Data Clustering Agent

Agent Description:

The Data Clustering Agent automates the grouping of similar business records for easier analysis. It extracts raw data from sources such as PDFs (simulated via API) and validates essential fields. The agent then cleans and normalizes this data (for example, by formatting names and numbers). Finally, it organizes the records into meaningful clusters based on criteria such as client or region and outputs the results in structured JSON format.

Purpose and Components

Purpose: To automate the process of extracting raw data from PDF documents, cleaning it, and then organizing it into meaningful clusters (for example, by client, region) for easier analysis.
Components:
- Data Extractor: An agent to fetch and parse data from a source (simulating a PDF).
- Clustering Agent: An agent that normalizes and groups the extracted data based on defined criteria.
- API Connector (GET): A tool to retrieve the raw data.

Supported Capabilities

Fetching PDF content (simulated via JSON) using a GET tool.
Extracting relevant fields from text (e.g., Transaction ID, Client Name, Amount, Date, Region).
Validating that each row contains essential fields and flagging incomplete rows.
Returning a clean JSON array of raw records.
Normalizing text fields (trimming spaces, lowercasing names) and numeric fields (converting strings to numbers).
Defining clustering criteria, such as client-based grouping or region-based grouping.
Creating cluster objects with a cluster_id, a list of record IDs, and cluster metadata.
Returning a final clustered JSON output for downstream analytics.

LLM Used

GPT_4O_MINI

Note: To learn more about the LLM and to modify its behavior, refer to the Configuring LLM settings section.

Sub-Agents

1. Raw Data Extractor

Role: Raw data extractor
Scope: Fetch PDF documents via API and extract line-by-line textual data for further processing.
Description: This agent is the starting point. It uses the GET tool to fetch content (simulating a PDF), extracts all relevant data lines (like Transaction ID, Client Name), validates that essential fields are present, and then passes a clean JSON array of these records to the next agent.
LLM Used: Default (Inherits from parent agent).
Tools Used: Request - Get

2. Record Clustering Agent

Role: Cluster maker
Scope: Organizes raw records into clusters using similarity metrics such as client name, region, and transaction amount.
Description: This agent receives the raw JSON data from the extractor. It first normalizes the data (e.g., trimming spaces, converting strings to numbers) and then applies clustering logic. It can group transactions by criteria like client or region, creates cluster objects, and returns the final clustered JSON output.
LLM Used: Default (Inherits from parent agent).
Tools Used: None

Tools Used:

Request - Get Tool: Fetches the raw data (simulated as PDF content) from a JSON endpoint.

Note: For details on modifying the Tools, refer Tools Library section.

Agent Workflow Behavior Summary

The Raw Data Extractor (start node) is triggered and uses the Request - Get tool to fetch the dataset.
It extracts and validates the raw records, formatting them into a JSON array.
This raw JSON array is passed to the Record Clustering Agent.
The Record Clustering Agent (end node) normalizes the data (e.g., lowercases names) and then groups the records into clusters based on predefined criteria (like client or region).
It outputs the final clustered JSON, ready for analysis.

Sample Questions:

Cluster the latest transaction data by region.
Extract and segment the transaction data by client name.