Monitoring, Logging, and Typical Failure Modes

What L2 Is Expected to Monitor

L2 is expected to regularly monitor:

  • Job Schedules and Status
    • Airflow (or equivalent) DAG status for:
      • Marketing Data Mart → LDZ ingestion
      • LDZ → RDV DV jobs
      • RDV/Mart → Canonical/UDS Interface Layer jobs
      • Interface → Customer 360 / Campaign 360 jobs
  • System Health Metrics
    • Job duration vs historical baseline.
    • Job failure/retry counts.
  • Key Technical Indicators
    • Source file arrival in LDZ (file presence and size).
    • Major row counts / batch counts where defined in runbooks.
    • Error logs for failed tasks.

Where L2 Checks Logs

Depending on implementation, L2 should know how to access:

  • Orchestrator logs (Airflow, Control-M, etc.) for task execution logs and error stacks and return codes
  • ETL/ELT engine logs (DBT logs, SQL logs, ETL tool logs) for SQL execution errors, connection timeouts and resource issues
  • Database logs where relevant (DBA typically assists, but L2 should know which schemas are involved: LDZ, RDV, Metadata, 360, and how to retrieve basic row counts via pre-defined SQL scripts.

Typical Failure Modes and First Response

Below are typical failure patterns for the Marketing Data Mart–biased pipeline, and what L2 does first:

  1. Source File Missing or Late (Mart → LDZ)

    Symptom: LDZ ingestion job fails due to missing file or zero bytes.

    L2 Action: Confirm file presence and size in inbound location, and check with the client's upstream team if file delivery is delayed.

    If file is delivered, rerun ingestion job as per runbook.

    Escalate if: File structure changed (extra/missing columns) or content is repeatedly invalid.

  2. Schema Mismatch (LDZ or RDV)

    Symptom: "Column not found", "Too many columns", or type mismatch.

    L2 Action: Capture exact error message and job log snippet, and verify if a recent schema change was deployed upstream.

    Escalate immediately to: Tech BA + ETL Developer + Metadata Architect (mapping may require update).

  3. Key Constraint / Uniqueness Issues in RDV

    Symptom: Duplicate business keys, violated DV constraints.

    L2 Action: Confirm which batch/date is affected, and log the counts and sample records using pre-defined diagnostic queries.

    Escalate to: ETL Developer (for technical logic) and potentially Tech BA (if mapping is wrong or business assumptions changed).

  4. Interface Layer Load Failures (Mart/RDV → Canonical/UDS Interface)

    Symptom: Transformation failure, invalid data for mandatory interface fields.

    L2 Action: Capture failed step, error messages, affected entity, and execute basic row-count queries to see if partial load occurred.

    Escalate to: ETL Developer first; Tech BA if it’s clearly mapping or rule-related.

  5. 360 Layer Materialization Failures (Customer 360 / Campaign 360)

    Symptom: 360 tables or views not refreshed, reporting showing stale data.

    L2 Action: Check upstream job completion (Interface Layer jobs), and capture logs for the failing 360 build step.

    Escalate to: ETL Developer (for transformation logic) and CDP / MAX AI team if downstream integration is affected.

L2 should never attempt to "patch" data or change mappings; they only execute documented procedures and gather evidence for the right owners.

Escalation Paths for L2

L2 must know who to call for what type of issue. At a minimum:

  • Tech BA: When the issue relates to mapping logic, business rules, or the meaning of fields.

    Examples: derived columns no longer make sense, new campaign types not handled, horizontal/vertical slicing assumptions broken.

  • ETL Developer: When pipelines fail due to transformation logic, orchestration issues, or code defects.

    Examples: failing DAG task, broken SQL, performance regression.

  • Metadata Architect / Canonical Architect: When metadata tables seem inconsistent or code generation is impacted.

    Examples: new entity not appearing in generated code, metadata referential errors.

  • DBA: When failures are clearly infrastructure-related.

    Examples: space issues, locking, resource limits, DB performance incidents.

  • Project / Implementation Manager: For repeated incidents, timeline impact, or changes that require coordination across client teams.

Escalation contacts and SLAs should be documented in a separate Support & Contact Matrix, referenced from this document.

RACI Matrix

Activity Implementation L2 Ops L3/ Engineering
Tech BA ETL Dev Data Architect DBA Unica+ Marketer CDP MAX AI
Define Canonical Entities & Semantics A C R I C I C C
Define Source→Interface Mapping (Marketing DM) A C C I C I C I
Configure Metadata Tables for Entities/Mappings C C A I C I I I
Develop ETL Pipelines (LDZ→RDV, Mart→Interface) C A C I C I I I
Code Generation Setup & Maintenance I C A I C I I I
Daily Job Monitoring & Basic Checks I I I I C A/R I I
Execute Runbooks (Restart/Rerun/Backfill) I C I I C A/R I I
Investigate Job Failures (First-Level) I C I C I A/R I I
Deep-Dive Failure Analysis (Logic/Design) A/R A/R C C I C I I
DB Performance & Capacity Management I C I A/R I C I I
Release & Deployment Coordination C C C C A/R I C C
Customer 360 Integration with CDP C C C I C I A/R C
Campaign 360 Integration with CDP C C C I C I A/R C
MAX AI Insight Consumption (No Model Changes) I I C I I I C A/R
L2 Runbook Authoring C C C C A/R I I I
L2 Training & Enablement C C C C A/R A/R C C