Reference Datasets
This page describes how reference datasets in HCL Detect are used to populate and update profile tables. It covers dataset structure, attribute definitions, supported formats (JSON/CSV), refresh behavior, and how to configure them within the solution.
Reference datasets are configured to populate the profile tables. This defines the data model and format of reference data along with the destination profile and profile table. The reference data generally comes from the data warehouses, operational systems or given as a static files.
below is the example of a solution containing one reference datasets definition:
$ tree etc/model/reference_datasets/
etc/model/reference_datasets/
��������� towers.json
��������� towers.schema.json
The name of reference dataset is Towers, this reference dataset is used to periodically update profile Tower.
The towers.schema.json file is structures as below:
{
"attributes": [
{
"name": "cellId",
"type": "String"
},
{
"name": "city",
"skipped": true,
"type": "String"
},
{
"name": "cityCode",
"type": "String"
},
{
"name": "date",
"type": "String",
"allowedToBeNull": true
},
{
"name": "district",
"type": "String"
},
{
"name": "equipment",
"type": "List(String)"
},
{
"name": "lat",
"type": "Double"
},
{
"name": "lon",
"type": "Double"
},
{
"name": "siteType",
"type": "String"
},
{
"name": "towerType",
"type": "String"
}
],
"dataFileFormat": "JSON",
"dropTableBeforeInsertion": true,
"keyAttribute": "cellId",
"name": "Towers",
"profile": "Tower",
"refreshIntervalInMillis": 6000000,
"refreshable": true,
"table": "TOWER"
}
attributessection defines the name, type, null constrains, skipped flag. by default all attributes are non-nullable. If an attribute is marked as "skipped": true then that attribute will be skipped while loading the profile table. The order of attributes in theattributesection defines the order of the attribute in the reference dataset's data file.dataFileFormatare of two types i.e., JSON or CSV.dropTableBeforeInsertionis flag to drop the table before loading a new file.keyAttributeis one of the attribute from the file to be used as key while loading data into profile table.nameis the name of reference data set.profileis the name of profile to be loaded.refreshIntervalInMillisis the frequency of checking if the file is modified.refreshableis a flag to enable the refreshability of the reference dataset.tableis the name of table among the tables list from the configured profile.
The structure of CSV file will have some additional parameters as below:
{ ..
..
"dataFileFormat": "CSV",
"dropTableBeforeInsertion": false,
"keyAttribute": "cardType",
"name": "Card Logo",
"parameters": [
{
"name": "delimiter",
"type": "String",
"value": "|"
},
{
"name": "hasHeader",
"type": "Bool",
"value": true
}
],
..
..
}
delimiteris the delimiter in the CSV file.hasHeaderis flag to know if the file has header or not.
Once the reference datasets are defined in etc/model/reference_datasets directory,
we need to change the product's configuration to point to these reference datasets. The
reference datasets are configured in drive.referenceDatasets section of
product configuration etc/drive.json:
..
..
"drive": {
"profiles": [
"Customer",
"Tower"
],
"referenceDatasets": [
"Towers"
]
}
..
..
"masterProfile": "Customer",
..
..
- File naming of reference datasets directory is very important as Detect will use this to
navigate and access a particular reference dataset configuration. The reference
dataset
Towersis kept under reference_datasets/towers.schema.json and since it is a JSON type, the Detect will expecttowers.jsonin the same directory. If the type was CSV, then Detect will expecttowers.datfile. - Detect maintains a hash of file in the database in order to track changes in the data file and will only refresh if the file is modified.