Troubleshooting

Failed DAG Reruns
Best practice: Use the Airflow UI to retrigger only the failed DAG rather than restarting the grand-master. Navigate to the DAG, click the failed run, and use Clear on the failed task to rerun it. For master DAGs that have child failures, retrigger the specific child DAG directly.
DAG Failure Recovery
For bespoke ETL pipelines, if a DAG fails due to invalid or incomplete source data:
  1. Identify and correct the data issue in the source system.
  2. Validate that the source data is available and accurate.
  3. Re-trigger the failed DAG from Airflow.
  4. Continue execution of the remaining dependent DAGs after successful completion.
Note: For bespoke pipelines, no additional recovery procedure is required. Once the source data issue is resolved, the failed DAG can be restarted safely and the pipeline execution can continue normally.
ETL Date Control Not Updating
Symptom: Delta loads process too much or too little data.
Fix: Re-run etl_date_control_update_dag after correcting the etl_date_control_update_config variable in the configs/etl_date_control.json file and re-syncing the variables.
Grand Master DAG — Diagnostic Steps
  • Identify the failed task in the Airflow UI Grid or Graph view, and note the task ID corresponding to the child DAG that was triggered.
  • Check task logs for the TriggerDagRunOperator. The log will indicate which child DAG run was triggered and whether the child DAG itself failed or timed out.
  • Navigate to the child DAG in the Airflow UI and identify the failing task within it using the Grid view.
  • For dbt failures, open the task log and look for the structured log path (/mnt/airflow-data/logs/dbt/). Read the dbt run log file at that path for model-level error details and SQL exceptions.
  • Verify Airflow Variables are populated correctly. Navigate to Admin > Variables and confirm all required variables (CDM_DATABASE, image tags, host/port/credentials) are set to non-empty, valid values.
DAG Tasks Stuck in Queued State
Symptom: Tasks remain in queued state and don't start.
Cause Fix
KubernetesExecutor not configured Verify AIRFLOW__CORE__EXECUTOR=KubernetesExecutor
Insufficient cluster resources Scale up worker nodes or reduce PARALLEL_TASK_RUNS variable
Airflow scheduler pod is down Check kubectl get pods -n <namespace>