Monitoring, Logging, and Typical Failure Modes
This section covers daily monitoring responsibilities, log access, and first-response actions for typical pipeline failure modes.
What the Service Implementation Team Is Expected to Monitor
Service Implementation Team is expected to regularly monitor:
- Job Schedules and Status
-
Airflow (or equivalent) DAG status for:
- Marketing Data Mart → LDZ ingestion
- LDZ → RDV Data Vault jobs
- RDV/Mart → Canonical/UDS Interface Layer jobs
- Interface → Customer 360 / Campaign 360 jobs / Flowchart 360 jobs
- ML Model ingestion and Customer 360 enrichment pipeline jobs (cdm_ingest_db schema).
- Oracle/ SQL Server/ Snowflake setup DAGs (airflow_variable_sync, ddl_execution_dag_multidb, etl_date_control_update_dag) - monitor run status on all new deployments.
- System Health Metrics
-
- Job duration vs historical baseline.
- Job failure/retry counts.
- Key Technical Indicators
-
- Source file arrival in LDZ (file presence and size).
- Major row counts / batch counts where defined in runbooks.
- Error logs for failed tasks.
- Flowchart 360 table refresh timestamps — confirm GENERATE_FLOWCHART_360_DATA completed successfully.
- ML Model write-back: confirm cdm_ingest_db records present and Customer 360 enrichment timestamp updated.
Where Service Implementation Team Checks Logs
Depending on implementation, the Service Implementation Team should know how to access:
- Orchestrator logs (Airflow, Control-M, etc.) for task execution logs and error stacks and return codes.
- ETL/ELT engine logs (DBT logs, SQL logs, ETL tool logs) for SQL execution errors, connection timeouts and resource issues.
- Database logs where relevant (DBA typically assists), but the Service Implementation Team should know which schemas are involved: LDZ, RDV, Metadata, 360, cdm_ingest_db, and how to retrieve basic row counts via pre-defined SQL scripts.
Typical Failure Modes and First Response
Below are typical failure patterns and what the Service Implementation Team does first:
- Source File Missing or Late (Mart → LDZ)
Symptom: LDZ ingestion job fails due to missing file or zero bytes.
Service Implementation Team Action: Confirm file presence and size in inbound location, and check with the client's upstream team if file delivery is delayed. If file is delivered, rerun ingestion job as per runbook.
Escalate if: Escalate to: L3/Services team if (File structure changed (extra/missing columns) or content is repeatedly invalid.
- Schema Mismatch (LDZ or RDV)
Symptom: "Column not found", "Too many columns", or type mismatch.
Service Implementation Team Action: Capture exact error message and job log snippet, and verify if a recent schema change was deployed upstream.
Escalate immediately to: Services team.
- Key Constraint / Uniqueness Issues in RDV
Symptom: Duplicate business keys, violated Data Vault constraints.
Service Implementation Team Action: Confirm which batch/date is affected, and log the counts and sample records using pre-defined diagnostic queries.
Escalate to: Services team.
- Interface Layer Load Failures (Mart/RDV → Canonical/UDS Interface)
Symptom: Transformation failure, invalid data for mandatory interface fields.
Service Implementation Team Action: Capture failed step, error messages, affected entity, and execute basic row-count queries to see if partial load occurred.
Escalate to: Services team.
- 360 Layer Materialization Failures (Customer 360 / Campaign 360 /
Flowchart 360)
Symptom: 360 tables or views not refreshed, reporting showing stale data.
Service Implementation Team Action: Check upstream job completion (Interface Layer jobs), and capture logs for the failing 360 build step. For Flowchart 360 failures, also confirm that flowchart source data is present in the BDV layer.
Escalate to: Services team.
- ML Model Ingestion / Customer 360 Enrichment Failure
Symptom: cdm_ingest_db tables not populated with ML predictions, or Customer 360 enrichment timestamp has not updated with latest NBC/STO values.
Service Implementation Team Action: Check the ML model ingestion DAG run status in Airflow. Capture the failed task log and confirm whether the issue is in the ingestion step or the enrichment step. Verify row counts in cdm_ingest_db using pre-defined diagnostic queries.
Escalate to: Services team.
- Oracle/ SQL Server/ Snowflake Setup DAG Failures
Symptom: ddl_execution_dag, airflow_variable_sync, or etl_date_control_update_dag fails during initial environment setup or rebuild.
Service Implementation Team Action: Capture the failing task name and full error log from the Airflow UI. Confirm the execution order was followed (airflow_variable_sync first, then ddl_execution_dag, then etl_date_control_update_dag). Verify Oracle connectivity and DBA credentials are valid.
Escalate to: Services team.
Service Implementation Team should never attempt to "patch" data or change mappings; they only execute documented procedures and gather evidence for the right owners.
- Grandmaster DAG Monitoring
The cdm_grandmaster_dag serves as the primary orchestration entry point for end-to-end CDM execution.
The Service Implementation Team should monitor:
- Grandmaster DAG execution status
- Master DAG dependency completion status
- Failed dependency chains between Customer, Campaign, Aggregate Layer, 360, and ML processing stages
- DAG execution duration against historical baseline
Failure Mode:
If cdm_grandmaster_dag completes with downstream dependency failures, capture the failed Master DAG and task logs and escalate to the Services Team.
RACI Matrix
RACI Key: R = Responsible | A = Accountable | C = Consulted | I = Informed | A/R = Accountable & Responsible
| Activity | Impl. Services/Eng. | Tech BA | ETL Dev | Data Architect | DBA | Unica+ Marketer | Service Implementation Team Ops | MaxAI |
|---|---|---|---|---|---|---|---|---|
| Define Canonical Entities & Semantics | A | A | C | R | I | C | I | C |
| Define Source→Interface Mapping (Marketing DM) | A | A | C | C | I | C | I | I |
| Configure Metadata Tables for Entities/Mappings | C | C | C | A | I | C | I | I |
| Develop ETL Pipelines (LDZ→RDV, Mart→Interface) | C | C | A | C | I | C | I | I |
| Code Generation Setup & Maintenance | I | I | C | A | I | C | I | I |
| Daily Job Monitoring & Basic Checks | I | I | I | I | I | C | A/R | I |
| Execute Runbooks (Restart/Rerun/Backfill) | I | I | C | I | I | C | A/R | I |
| Investigate Job Failures (First-Level) | I | I | C | I | C | I | A/R | I |
| Deep-Dive Failure Analysis (Logic/Design) | A/R | A/R | A/R | C | C | I | C | I |
| DB Performance & Capacity Management | I | I | C | I | A/R | I | C | I |
| Release & Deployment Coordination | C | C | C | C | C | A/R | I | C |
| ML Model Integration (STO/NBC) | I | I | C | A | C | I | I | A/R |
| MaxAI Insight Consumption (No Model Changes) | I | I | I | C | I | I | I | A/R |
| Service Implementation Team Runbook Authoring | C | C | C | C | C | A/R | I | I |
| Service Implementation Team Training & Enablement | C | C | C | C | C | A/R | A/R | C |
| Audience Resolution Configuration & Validation | C | A/R | A | I | C | I | I | I |
| Aggregate Layer Configuration & Validation | C | A/R | A | C | C | I | I | C |
| Pipeline Orchestration Framework Configuration (Grandmaster / Master / Leaf DAGs) | I | A/R | A | C | I | C | C | I |