Reference Datasets
Reference datasets are configured to populate the profile tables. This defines the data model and format of reference data along with the destination profile and profile table. The reference data generally comes from the data warehouses, operational systems or given as a static files.
below is the example of a solution containing one reference datasets definition:
$ tree etc/model/reference_datasets/
etc/model/reference_datasets/
├── towers.json
└── towers.schema.json
The name of reference dataset is Towers, this reference dataset is used to periodically update profile Tower.
The towers.schema.json
file is structures as below:
{
"attributes": [
{
"name": "cellId",
"type": "String"
},
{
"name": "city",
"skipped": true,
"type": "String"
},
{
"name": "cityCode",
"type": "String"
},
{
"name": "date",
"type": "String",
"allowedToBeNull": true
},
{
"name": "district",
"type": "String"
},
{
"name": "equipment",
"type": "List(String)"
},
{
"name": "lat",
"type": "Double"
},
{
"name": "lon",
"type": "Double"
},
{
"name": "siteType",
"type": "String"
},
{
"name": "towerType",
"type": "String"
}
],
"dataFileFormat": "JSON",
"dropTableBeforeInsertion": true,
"keyAttribute": "cellId",
"name": "Towers",
"profile": "Tower",
"refreshIntervalInMillis": 6000000,
"refreshable": true,
"table": "TOWER"
}
attributes
section defines the name, type, null constrains, skipped flag. by default all attributes are non-nullable. If an attribute is marked as "skipped": true then that attribute will be skipped while loading the profile table. The order of attributes in theattribute
section defines the order of the attribute in the reference dataset's data file.dataFileFormat
are of two types i.e., JSON or CSV.dropTableBeforeInsertion
is flag to drop the table before loading a new file.keyAttribute
is one of the attribute from the file to be used as key while loading data into profile table.name
is the name of reference data set.profile
is the name of profile to be loaded.refreshIntervalInMillis
is the frequency of checking if the file is modified.refreshable
is a flag to enable the refreshability of the reference dataset.table
is the name of table among the tables list from the configured profile.
The structure of CSV file will have some additional parameters as below:
{ ..
..
"dataFileFormat": "CSV",
"dropTableBeforeInsertion": false,
"keyAttribute": "cardType",
"name": "Card Logo",
"parameters": [
{
"name": "delimiter",
"type": "String",
"value": "|"
},
{
"name": "hasHeader",
"type": "Bool",
"value": true
}
],
..
..
}
delimiter
is the delimiter in the CSV file.hasHeader
is flag to know if the file has header or not.
Once the reference datasets are defined in etc/model/reference_datasets
directory, we need to change the product's configuration to point to these reference
datasets. The reference datasets are configured in
drive.referenceDatasets
section of product configuration
etc/drive.json
:
..
..
"drive": {
"profiles": [
"Customer",
"Tower"
],
"referenceDatasets": [
"Towers"
]
}
..
..
"masterProfile": "Customer",
..
..
- File naming of reference datasets directory is very important as
Detect will
use this to navigate and access a particular reference dataset
configuration. The reference dataset
Towers
is kept under reference_datasets/towers.schema.json and since it is a JSON type, the Detect will expecttowers.json
in the same directory. If the type was CSV, then Detect will expecttowers.dat
file. - Detect maintains a hash of file in the database in order to track changes in the data file and will only refresh if the file is modified.