OCR, Chunking, and Image Indexing for Scientific Papers

·
nlpretrievalocr

I was working on a project that needed to extract content from scientific papers (PDFs) and make it searchable. The kind of papers with complex layouts, equations, tables, figures, multi-column text. Not easy stuff.

I ended up comparing three OCR services, testing different chunking strategies for retrieval, and figuring out how to index the images from papers so they could actually be found later. Here's what I learned.

OCR: Tesseract vs Textract vs Azure Document Intelligence

I tested Google's Tesseract, AWS Textract, and Azure Document Intelligence on the same set of scientific papers.

Tesseract is free and open source, which is nice. But it struggles with complex layouts. Multi-column papers confuse it. Tables come out garbled. It's fine for simple single-column documents but scientific papers are not that.

AWS Textract is better. It handles tables reasonably well and can detect different content blocks on a page. It understands that a paper has columns and doesn't just merge them into one stream of text. Still, it sometimes mixes up the reading order on complicated layouts.

Azure Document Intelligence was the clear winner for this use case. It detected page breaks correctly. It identified polygons around images and tables accurately, which matters when you need to know where figures are on the page and extract them separately. Its PDF-to-markdown conversion was the best of the three. It also did a solid job with key-value extraction, which is useful for pulling out metadata from paper headers.

Textract was fine too, honestly. If you're already deep in the AWS ecosystem, it'll work. But if you're choosing from scratch, Azure Document Intelligence gave me the best results by a noticeable margin.

Chunking for retrieval

After extracting the text, I needed to chunk it for indexing in a vector store. The goal is retrieval: someone asks a question, and you want to find the right chunk of the paper that answers it.

I tried a few approaches:

Fixed-size chunking (just splitting every N tokens) is the baseline. It's fast but dumb. You end up splitting in the middle of paragraphs, in the middle of arguments. The chunks don't represent coherent ideas.

Recursive character splitting is better. It tries to split on paragraph boundaries first, then sentence boundaries. It respects the structure of the text more. But it's still rule-based and doesn't actually understand what it's splitting.

What worked best was semantic chunking using a small 7B parameter LLM. I had the model read through the text and decide where the topic shifts happen. It would mark boundaries where one idea ends and another begins. The chunks that came out of this were way more coherent. When you retrieved a chunk, it actually contained a complete thought or argument rather than an arbitrary slice of text.

It's slower and costs more compute than the rule-based approaches. But the retrieval quality improvement was worth it for this project. The chunks made more sense, and the search results were more relevant.

Image indexing

Scientific papers are full of figures, charts, and diagrams. If someone asks "show me the architecture diagram from that paper," you need to be able to find it. This is harder than text retrieval because images don't have obvious text content to index.

I tried two approaches:

Approach 1: Vision model descriptions. Take the image, send it to a vision model, and ask it to describe what's in it. The description gets embedded as text and goes into the vector index.

The trick that made this work well was adding context. Instead of just sending the image by itself, I also sent the surrounding text from the same page, or sometimes the caption, or a reference to the figure from elsewhere in the paper. This gave the vision model enough context to write a description that was actually specific. Instead of "a bar chart showing results," it would say "a bar chart comparing F1 scores of BERT, RoBERTa, and GPT-2 on the SQuAD benchmark, with BERT achieving the highest score."

That extra specificity in the description meant the text embedding captured the actual content of the image, not just its visual type.

Approach 2: CLIP embeddings. Encode the image directly with CLIP and put the embedding in the index. This works but it's limited. CLIP is good at matching images to general descriptions, but it doesn't capture the specific content of scientific figures very well.

Approach 1 with contextual descriptions was significantly better for retrieval. The key was not just sending the image in isolation but giving the vision model enough surrounding context to understand what the figure is actually about. Then the generated text, once embedded, was much more findable by natural language queries.

The full pipeline

PDF goes in. Azure Document Intelligence extracts everything into markdown with images and tables separated out. Text gets semantically chunked by a 7B LLM and embedded. Images get described by a vision model (with context from the surrounding text) and those descriptions get embedded. Everything goes into the same vector index, and you can search across both text and images with natural language queries.

Nothing here is groundbreaking individually. The Azure OCR, semantic chunking, and contextual image descriptions are all known techniques. But getting the details right on each step made a big difference in the end-to-end retrieval quality. Especially the OCR choice and the image context. Those two had the biggest impact.