← blog

How to get structured data from any PDF with Gemini

2026-03-12

You want typed JSON out of a PDF — numbers as numbers, dates as dates, tables with columns. Gemini can read PDFs natively (scanned or text-based) and return structured JSON. Here's how to wire it up in TypeScript.

Setup

npm install @google/generative-ai

Grab a key from Google AI Studio. Gemini 2.5 Flash is fast and cheap.

Basic extraction

Send the PDF as base64, set responseMimeType: "application/json" so Gemini returns valid JSON instead of markdown-wrapped text.

extract.ts

import { GoogleGenerativeAI } from "@google/generative-ai";
import { readFileSync } from "fs";

const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);

const model = genAI.getGenerativeModel({
model: "gemini-2.5-flash",
generationConfig: {
  responseMimeType: "application/json",
  maxOutputTokens: 65536,
},
});

const pdf = readFileSync("report.pdf");

const result = await model.generateContent([
{
  inlineData: {
    mimeType: "application/pdf",
    data: pdf.toString("base64"),
  },
},
{
  text: `Extract all tables and key-value pairs from this document.
Return JSON in this format:
{
"title": "string or null",
"date": "ISO 8601 or null",
"sections": [{
  "heading": "string",
  "tables": [{
    "name": "string",
    "columns": [{ "name": "string", "type": "string | number | date | currency | percentage" }],
    "rows": [["cell", ...]]
  }]
}],
"keyValuePairs": [{ "key": "string", "value": "string" }]
}`,
},
]);

const data = JSON.parse(result.response.text());
console.log(JSON.stringify(data, null, 2));

This works on scanned docs and images too, not just text-based PDFs.

Gotchas

Gemini returns everything as strings

Without explicit instructions, you'll get "$1,234.56" instead of 1234.56. Tell it to use native JSON types:

prompt snippet

const prompt = `
CRITICAL: For table cell values, use native JSON types:
- Numbers (quantities, prices, totals): use JSON numbers, e.g. 1234.56, not "1,234.56"
- Currency amounts: strip the symbol, e.g. 49.99, not "$49.99"
- Percentages: return as a number, e.g. 19.5, not "19.5%"
- Empty cells: use null
- Dates: ISO 8601 strings (YYYY-MM-DD)

UNIT SUFFIXES: If values have a unit like "563.66p" (pence) or "150bps",
strip the suffix and return the raw number. Add a "unit" field to the column.
`;

Ambiguous dates

03/04/2024 — March 4th or April 3rd? Depends on the document's locale. You can have Gemini detect it from context clues:

date detection

const prompt = `
DATE FORMAT DETECTION: Before parsing dates, determine the document's locale:
- Currency: USD/$ → US format (MM/DD), GBP/£/EUR/€ → UK/EU format (DD/MM)
- Addresses: US states/zip codes → US, UK postcodes → EU
- Tax IDs: EIN → US, VAT number → EU
Always output dates in ISO 8601 (YYYY-MM-DD).
Include a "dateFormatSource" field with what you detected.
`;

Works well — Gemini infers locale from currency symbols, addresses, and tax IDs.

Things left as an exercise

The above handles single-page or short documents. For anything bigger, you'll also need to deal with:

Page-by-page extraction — long PDFs overflow the output token limit. Extract per page with rolling context (last columns, last heading).
Table merging across pages — match columns by normalized name, stitch rows.
Merged cells — Gemini returns vertically merged cells as empty. Fill down.
Non-determinism — column names drift across pages, phantom tables appear. Normalize and absorb.
Truncation — check finishReason for "MAX_TOKENS".

Or just use an API

I built Contexa to handle all of the above — typed columns, multi-page table merging, unit detection, date disambiguation. One call in, structured JSON out. There's also a PDF-to-Excel endpoint that gives you a .xlsx with one sheet per table.

import { ContexaPdfToExcel } from "@contexaworks/pdf-to-excel";

const client = new ContexaPdfToExcel({ apiKey: "your-rapidapi-key" });
const result = await client.pdfToExcel(readFileSync("report.pdf"));

writeFileSync("report.xlsx", Buffer.from(result.data));

Try it in the playground API tutorial