Science University Research Symposium (SURS)
Intelligent PDF Extraction: Building a Regex-Driven and LLM-Assisted Pipeline for Document Analysis
Publication Date
Winter 11-24-2025
Department
Math and Computer Science, Department of
SURS Faculty Advisor
Dr. Christina Davis
Presentation Type
Poster Presentation
Abstract
This project presents a prototype for automated ingestion, extraction, and language model–based summarization of technical documents. The system integrates with Dropbox storage to identify, enumerate, and download files, then extracts and structures textual content from diverse document formats. Extracted text is stored locally and processed through OpenAI’s language models for summarization, question answering, and insight generation. The primary objective is to produce concise, human-readable summaries, and targeted responses that significantly reduce manual review time. By automating information retrieval and contextual interpretation, the pipeline enhances both accuracy and efficiency in document management workflows. Future development could incorporate retrieval-augmented generation (RAG) techniques to provide context-aware reasoning and improved traceability to original sources. Overall, this project demonstrates a scalable and practical application of large language models for domain-specific document analysis, supporting faster decision-making, improved collaboration, and operational excellence.
Recommended Citation
Croke, Brielle, "Intelligent PDF Extraction: Building a Regex-Driven and LLM-Assisted Pipeline for Document Analysis" (2025). Science University Research Symposium (SURS). 286.
https://repository.belmont.edu/surs/286
