Science University Research Symposium (SURS)

Intelligent PDF Extraction: Building a Regex-Driven and LLM-Assisted Pipeline for Document Analysis

Publication Date

Winter 11-24-2025

Department

Math and Computer Science, Department of

SURS Faculty Advisor

Dr. Christina Davis

Presentation Type

Poster Presentation

Abstract

This project presents a prototype for automated ingestion, extraction, and language model–based summarization of technical documents. The system integrates with Dropbox storage to identify, enumerate, and download files, then extracts and structures textual content from diverse document formats. Extracted text is stored locally and processed through OpenAI’s language models for summarization, question answering, and insight generation. The primary objective is to produce concise, human-readable summaries, and targeted responses that significantly reduce manual review time. By automating information retrieval and contextual interpretation, the pipeline enhances both accuracy and efficiency in document management workflows. Future development could incorporate retrieval-augmented generation (RAG) techniques to provide context-aware reasoning and improved traceability to original sources. Overall, this project demonstrates a scalable and practical application of large language models for domain-specific document analysis, supporting faster decision-making, improved collaboration, and operational excellence.

This document is currently not available here.

Share

COinS