Reference Datasets

This page describes how reference datasets in HCL Detect are used to populate and update profile tables. It covers dataset structure, attribute definitions, supported formats (JSON/CSV), refresh behavior, and how to configure them within the solution.

Reference datasets are configured to populate the profile tables. This defines the data model and format of reference data along with the destination profile and profile table. The reference data generally comes from the data warehouses, operational systems or given as a static files.

below is the example of a solution containing one reference datasets definition:

$ tree etc/model/reference_datasets/
etc/model/reference_datasets/
��������� towers.json
��������� towers.schema.json

The name of reference dataset is Towers, this reference dataset is used to periodically update profile Tower.

The towers.schema.json file is structures as below:

{
    "attributes": [
        {
            "name": "cellId",
            "type": "String"
        },
        {
            "name": "city",
            "skipped": true,
            "type": "String"
        },
        {
            "name": "cityCode",
            "type": "String"
        },
        {
            "name": "date",
            "type": "String",
            "allowedToBeNull": true
        },
        {
            "name": "district",
            "type": "String"
        },
        {
            "name": "equipment",
            "type": "List(String)"
        },
        {
            "name": "lat",
            "type": "Double"
        },
        {
            "name": "lon",
            "type": "Double"
        },
        {
            "name": "siteType",
            "type": "String"
        },
        {
            "name": "towerType",
            "type": "String"
        }
    ],
    "dataFileFormat": "JSON",
    "dropTableBeforeInsertion": true,
    "keyAttribute": "cellId",
    "name": "Towers",
    "profile": "Tower",
    "refreshIntervalInMillis": 6000000,
    "refreshable": true,
    "table": "TOWER"
}

attributes section defines the name, type, null constrains, skipped flag. by default all attributes are non-nullable. If an attribute is marked as "skipped": true then that attribute will be skipped while loading the profile table. The order of attributes in the attribute section defines the order of the attribute in the reference dataset's data file.
dataFileFormat are of two types i.e., JSON or CSV.
dropTableBeforeInsertion is flag to drop the table before loading a new file.
keyAttribute is one of the attribute from the file to be used as key while loading data into profile table.
name is the name of reference data set.
profile is the name of profile to be loaded.
refreshIntervalInMillis is the frequency of checking if the file is modified.
refreshable is a flag to enable the refreshability of the reference dataset.
table is the name of table among the tables list from the configured profile.

The structure of CSV file will have some additional parameters as below:

{   ..
        ..
        "dataFileFormat": "CSV",
    "dropTableBeforeInsertion": false,
    "keyAttribute": "cardType",
    "name": "Card Logo",
    "parameters": [
        {
            "name": "delimiter",
            "type": "String",
            "value": "|"
        },
        {
            "name": "hasHeader",
            "type": "Bool",
            "value": true
        }
    ],
        ..
        ..
}

delimiter is the delimiter in the CSV file.
hasHeader is flag to know if the file has header or not.

Once the reference datasets are defined in etc/model/reference_datasets directory, we need to change the product's configuration to point to these reference datasets. The reference datasets are configured in drive.referenceDatasets section of product configuration etc/drive.json:

..
..
"drive": {
    "profiles": [
          "Customer",
          "Tower"
      ],
      "referenceDatasets": [
          "Towers"
      ]
}
..
..
"masterProfile": "Customer",
..
..

Note:

File naming of reference datasets directory is very important as Detect will use this to navigate and access a particular reference dataset configuration. The reference dataset Towers is kept under reference_datasets/towers.schema.json and since it is a JSON type, the Detect will expect towers.json in the same directory. If the type was CSV, then Detect will expect towers.dat file.
Detect maintains a hash of file in the database in order to track changes in the data file and will only refresh if the file is modified.