Build an self-improving AI agent that turns documents into structured data (with LangGraph)



This content originally appeared on DEV Community and was authored by Oliver S

Project: Unstructured to structured

What this AI agent actually does?

This self-improving AI agent takes messy documents (invoices, contracts, medical reports, whatever) and turns them into clean, structured data and CSV tables. But here’s the kicker – it actually gets better at its job over time.

Full code open source at: https://github.com/Handit-AI/handit-examples/tree/main/examples/unstructured-to-structured

Table of Contents

  • What does this AI agent do?
  • Example: from messy to structured
  • 1. Architecture overview
  • 3. The core: LangGraph workflow
  • 4. Node classes: specialized tools for every task
    • Inference Schema Node — the schema detective
    • Invoice Data Capture Node — the data extractor
    • Generate CSV Node — the table builder
  • 5. Self-improvement (the best part)
  • 6. Results
  • 7. Conclusions

Example: from messy to structured

Input:

Image 1

Image 2

Output:

Structured data (CSV files) with ultra high accuracy:

More CSV files:

You also get a JSON file per document. The data is structured with a general schema. The LLM picks the best schema for the document type, so you can use the data however you want.

{
  "header": {
    "document_title": {
      "value": "Purchase Order",
      "normalized_value": "Purchase Order",
      "reason": "Top-right prominent heading reads 'Purchase Order' (visual title header).",
      "confidence": 0.95
    },
    "purchase_order_number": {
      "value": "PO000495",
      "normalized_value": "PO000495",
      "reason": "Label 'PO No: PO000495' printed near the header on the right; matched to schema synonyms 'PO No' / 'Purchase Order #'.",
      "confidence": 0.95
    },
    "po_date": {
      "value": "04/26/2017",
      "normalized_value": "2017-04-26",
      "reason": "Date '04/26/2017' directly under PO number in header; normalized to ISO-8601.",
      "confidence": 0.95
    },
  "parties": {
    "bill_to_name": {
      "value": "PLANERGY Boston Office",
      "normalized_value": "PLANERGY Boston Office",
      "reason": "Top-left block under the company logo lists 'PLANERGY' and 'Boston Office' — interpreted as the billing/requesting organization.",
      "confidence": 0.88
    },
...
 "items": {
    "items": {
      "value": [
        {
          "item_name": "Nescafe Gold Blend Coffee 7oz",
          "item_description": null,
          "item_code": "QD2-00350",
          "sku": "QD2-00350",
          "quantity": 1.0,
          "unit": null,
          "unit_price": 34.99,
          "discount": 0.0,
          "line_total": 34.99,
          "currency": "USD"
        },
        {
          "item_name": "Tettley Tea Round Tea Bags 440/Pk",
          "item_description": null,
          "item_code": "QD2-TET440",
          "sku": "QD2-TET440",
          "quantity": 1.0,
          "unit": null,
          "unit_price": 20.49,
          "discount": 0.0,
          "line_total": 20.49,
          "currency": "USD"
        },
...

So I built this thing called “Unstructured to Structured”, and honestly, it’s doing some pretty wild stuff. Let me break down what’s actually happening under the hood.

Let’s dive in!

1. Architecture Overview

Let’s understand the architecture of our AI agent at a very high level:

This architecture separates concerns into distinct nodes:

  1. inference_schema

    • Purpose: AI analyzes uploaded documents to create a unified JSON schema
    • Input: Images, PDFs, text files
    • Output: Structured schema defining data fields and relationships
    • AI capability: Multimodal analysis (vision + text)
  2. document_data_capture

    • Purpose: Maps document content to the inferred schema using AI extraction
    • Input: Documents + inferred schema
    • Output: Structured JSON with field mappings
    • AI capability: Field extraction with confidence scores
  3. generate_csv

    • Purpose: Convert structured JSON into clean CSV tables
    • Input: Structured JSON from the previous node
    • Output: CSVs files ready for analysis
    • AI capability: Intelligent table structure planning

Another Perspective of the Workflow 🧠

Think of it as a smart pipeline that processes documents step by step. Here’s what happens:

  1. You upload documents – like invoices, contracts, medical reports (any type)
  2. The agent analyzes everything – it looks at all your documents and figures out the best structure (schema)
  3. It creates a unified schema – one JSON schema that can represent all your documents
  4. Then extracts the data – maps each document to the schema with AI
  5. Finally builds tables – creates CSV files and structured data you can actually use

Inference Schema Node – the schema detective (This is the most importan node)

This is where the magic starts:

  1. Reads images, PDFs, and text
  2. Proposes a unified JSON schema that fits everything
  3. Works across any document type
  4. Adds useful field types and reasoning

class SchemaField(BaseModel):
    """
    Represents a single field in the inferred schema.

    Each field defines the structure, validation rules, and metadata for a piece
    of data extracted from documents. Fields can be simple (string, number) or
    complex (objects, arrays) depending on the document structure.
    """

    name: str = Field(description="Field name")
    types: List[str] = Field(description="Allowed JSON types, e.g., ['string'], ['number'], ['string','null'], ['object'], ['array']")
    description: str = Field(description="What the field represents and how to interpret it")
    required: bool = Field(description="Whether this field is commonly present across the provided documents")
    examples: Optional[List[str]] = Field(default=None, description="Representative example values if known")
    enum: Optional[List[str]] = Field(default=None, description="Enumerated set of possible values when applicable")
    format: Optional[str] = Field(default=None, description="Special format hint like 'date', 'email', 'phone', 'currency', 'lang' etc.")
    reason: str = Field(description="Brief rationale for inferring this field (signals, patterns, layout cues)")


class SchemaSection(BaseModel):
    """
    Logical grouping of fields to organize the schema structure.

    Sections help organize related fields into meaningful groups rather than
    having all fields in a flat list. This improves schema readability and
    makes it easier to understand the document structure.
    """

    name: str = Field(description="Section name (generic), e.g., 'core', 'entities', 'dates', 'financial', 'items', 'metadata'")
    fields: List[SchemaField] = Field(description="Fields within this section")


class InferredSchema(BaseModel):
    """
    Top-level inferred schema for a heterogeneous set of documents.

    This schema represents the unified structure that can accommodate various
    document types and formats. It combines common patterns found across
    multiple documents into a single, flexible schema definition.
    """

    title: str = Field(description="Human-readable title of the inferred schema")
    version: str = Field(description="Schema semantic version, e.g., '0.1.0'")
    description: str = Field(description="High-level description of the schema and how it was inferred")

    common_sections: List[SchemaSection] = Field(description="Sections that apply broadly across the provided documents")
    specialized_sections: Optional[Dict[str, List[SchemaSection]]] = Field(
        default=None,
        description="Optional mapping of document_type -> sections specific to that type",
    )

    rationale: str = Field(description="Concise explanation of the main signals used to infer this schema")

system = """
You are a senior information architect. Given multiple heterogeneous documents (any type, any language), infer the most appropriate, general JSON schema that can represent them.

Guidance:
- Infer structure purely from the supplied documents; avoid biasing toward any specific document type.
- Use lower_snake_case for field names.
- Use JSON types: string, number, boolean, object, array, null. When a field may be missing, include null in its types.
- Allow nested objects and arrays where the documents imply hierarchical structure.
- Include brief, useful descriptions for fields when possible without inventing content.
- Return ONLY JSON that matches the provided Pydantic model for an inferred schema.

 Per-field requirements:
 - For each field, add a short 'reason' explaining the signals used to infer the field (keywords, repeated labels, table headers, layout proximity, visual grouping, etc.).
"""

Document Data Capture Node – the data extractor


system = """
You are a robust multimodal (vision + text) document-to-schema mapping system. Given an inferred schema and a document (image/pdf/text), analyze layout and visual structure first, then map fields strictly to the provided schema.

Requirements:
- Use the provided schema as the contract for output structure (keep sections/fields as-is).
- For each field, search labels/headers/aliases using the 'synonyms' provided by the schema and semantic similarity (including multilingual variants).
- Prioritize visual layout cues (titles, headers, table columns, proximity, group boxes) before plain text.
- Do NOT invent values. If a value isn't found, set it to null and add a short reason.
- For every field, include a short 'reason' explaining the mapping (signals used) and a 'normalized_value' when applicable (e.g., date to ISO, amounts to numeric, emails lowercased, trimmed strings).
- Return ONLY a JSON object that mirrors the schema sections/fields. Each field should be an object: {{"value": <any|null>, "normalized_value": <any|null>, "reason": <string>, "confidence": <number optional>}}.
"""

Generate CSV Node – the table builder

Finally, it creates structured tables from all your data:

def generate_csv(state: GraphState) -> Dict[str, Any]:
    # Load all the extracted JSON data
    all_json_data = []
    for json_path in structured_json_paths:
        with open(json_path, "r", encoding="utf-8") as f:
            json_data = json.load(f)
        all_json_data.append({"filename": filename, "data": json_data})

    # Ask the LLM to create tables
    llm_response = csv_generation_planner.invoke({
        "documents_inventory": all_json_data
    })

    # Generate CSV files
    generated_files = _save_tables_to_csv(tables, output_dir)

    # Track with Handit.ai
    tracker.track_node(
        input={"systemPrompt": get_system_prompt(), "userPrompt": get_user_prompt(), "documents_inventory": all_json_data},
        output={"tables": tables, "plan": plan, "generated_files": generated_files},
        node_name="generate_csv",
        agent_name=agent_name,
        node_type="llm",
        execution_id=execution_id
    )

Want to dig into the nodes and prompts? Check the repo.

5. The self-improvement (Best Part)

Here’s the really cool thing – this AI agent actually gets better over time. Here is the secret weapon Handit.ai

Every action, every response is fully observed and analyzed. The system can see:

  • If the schema inferences worked well
  • Which data extractions failed
  • How long processing takes
  • What document types cause issues
  • When the LLM makes mistakes
  • And more…

And yes sir! When this powerful tool detects any mistakes it fixes automatically.

This means the AI agent can actually improve itself. If the LLM extracts the wrong field or generates incorrect schemas or csvs, Handit.ai tracks that failure and automatically adjusts the AI agent to prevent the same mistake from happening again. It’s like having an AI engineer who is constantly monitoring, evaluating and improving your AI agent.

Here are the results after implementing Handit.ai:

6. Results

(Screenshots, metrics, or a short before/after summary here.)

If you try this out, I’d love your feedback or PRs. 🙌


This content originally appeared on DEV Community and was authored by Oliver S