Document Understanding and Extraction Agent

Agent Description:

The Document Understanding and Extraction Agent automates the extraction of data from unstructured documents such as PDFs or invoices. It retrieves content via an API and performs initial entity recognition (for example, Supplier, Date, Items, Amounts). The agent then cleans, validates, and formats the extracted data, correcting data types and structuring line items appropriately. Finally, it outputs clean, structured JSON ready for analysis or system integration.

Purpose and Components
  • Purpose: To automate the ingestion and processing of unstructured documents, turning raw text from invoices or reports into a standardized JSON format with correct data types and structures (like line items) for use in downstream systems.
  • Components:
    • Document Extractor: An agent that fetches the raw document text and performs initial entity recognition.

    • Structured Data Formatter: An agent that cleans, validates, and properly formats the extracted data.

    • API Connector (GET): A tool to retrieve the raw document content from a remote source.

Supported Capabilities
  • Fetching raw document content (simulated via JSON) using a GET tool.

  • Identifying and extracting key entities from text:

    • Supplier Name
    • Invoice / Document Number
    • Date
    • Items Purchased / Services
    • Quantity, Unit Price, Total Amount
  • Formatting extracted entities into a preliminary JSON with all fields as strings.

  • Validating and type-casting data (e.g., ensuring dates are in YYYY-MM-DD format, amounts are numbers).

  • Organizing flat line items into a proper nested JSON array (e.g., an "items" array).

  • Standardizing supplier names or categories where applicable.

  • Generating a final, structured JSON output ready for ingestion or analytics.

LLM Used
  • GPT_4O_MINI

    Note: To learn more about the LLM and to modify its behavior, refer to the Configuring LLM settings section.

Sub-Agents

1. Document Extractor

  • Role: Raw data extractor

  • Scope: Convert unstructured text from PDFs into preliminary structured JSON.

  • Description: This agent is the starting point. It uses the GET tool to fetch the raw content of a document (e.g., an invoice). It identifies key entities (like Supplier Name, Invoice Number, Date, line items, and Total Amount) and formats them into a preliminary JSON, keeping all fields as basic strings. This raw JSON is then passed to the formatter.

  • LLM Used: Default (Inherits from parent agent).

  • Tools Used: Request - Get

2. Structured Data Formatter

  • Role: Data structurer

  • Scope: Ensure JSON output is clean, typed, and ready for analytics.

  • Description: This agent receives the raw, string-based JSON from the extractor. It performs the crucial data cleaning and formatting tasks: it validates dates, converts amounts to numbers, and organizes all individual line items into a properly nested "items" array. The final, clean, and structured JSON is the output.

  • LLM Used: Default (Inherits from parent agent).

  • Tools Used: None

Tools Used:
  • Request - Get Tool: Fetches the raw document content (simulated as JSON text) from a remote endpoint.

Note: For details on modifying the Tools, refer Tools Library section.
Agent Workflow Behavior Summary
  1. The Document Extractor (start node) is triggered, using the Request - Get tool to fetch the unstructured data of an invoice.

  2. It identifies key entities (Supplier, Date, Total, etc.) and formats them into a raw JSON where all values are strings.

  3. This raw JSON is passed to the Structured Data Formatter.

  4. The Structured Data Formatter (end node) cleans the data, converts data types (strings to numbers, validates dates), and organizes line items into a nested array.

  5. The agent outputs the final, clean, structured JSON, ready for database ingestion.

Sample Questions:
  • Process the attached invoice and extract its line items.

  • Extract the supplier name, invoice number, and total amount from this document.