Outclassing Frontier LLMs at Extracting Information
Nov 25, 2025 | 11:00 AM - 11:10 AMStartup Stage
Nov 25, 2025 | 11:00 AM - 11:10 AM
Startup Stage
Description
Accurately extracting information from documents has been a decades-old dream. Important workflows — from automated back-office processing to enterprise RAG — depend on it.
LLMs promise to fulfill this dream but currently fall short: they hallucinate information, struggle with long documents, and break down on complex layouts.
The solution: LLMs specialized in information extraction.
In this talk, I will present:
- **NuExtract** — the first LLM specialized in extracting structured information (JSON output)
- **NuMarkdown** — the first reasoning OCR LLM (RAG-ready Markdown output).
**These low-hallucination [open-source] models outclass frontier LLMs like GPT-5 and Gemini 2.5 while being orders of magnitude smaller**, enabling private usage.
I will demonstrate the abilities of these LLMs, show how to use them at scale, and discuss what’s coming next in information extraction.

