Troubleshooting

Failed DAG Reruns

Best practice: Use the Airflow UI to retrigger only the failed DAG rather than restarting the grand-master. Navigate to the DAG, click the failed run, and use Clear on the failed task to rerun it. For master DAGs that have child failures, retrigger the specific child DAG directly.

DAG Failure Recovery

For bespoke ETL pipelines, if a DAG fails due to invalid or incomplete source data:

Identify and correct the data issue in the source system.
Validate that the source data is available and accurate.
Re-trigger the failed DAG from Airflow.
Continue execution of the remaining dependent DAGs after successful completion.

Note: For bespoke pipelines, no additional recovery procedure is required. Once the source data issue is resolved, the failed DAG can be restarted safely and the pipeline execution can continue normally.

ETL Date Control Not Updating

Symptom: Delta loads process too much or too little data.

Fix: Re-run etl_date_control_update_dag after correcting the etl_date_control_update_config variable in the configs/etl_date_control.json file and re-syncing the variables.

Grand Master DAG — Diagnostic Steps

Identify the failed task in the Airflow UI Grid or Graph view, and note the task ID corresponding to the child DAG that was triggered.
Check task logs for the TriggerDagRunOperator. The log will indicate which child DAG run was triggered and whether the child DAG itself failed or timed out.
Navigate to the child DAG in the Airflow UI and identify the failing task within it using the Grid view.
For dbt failures, open the task log and look for the structured log path (/mnt/airflow-data/logs/dbt/). Read the dbt run log file at that path for model-level error details and SQL exceptions.
Verify Airflow Variables are populated correctly. Navigate to Admin > Variables and confirm all required variables (CDM_DATABASE, image tags, host/port/credentials) are set to non-empty, valid values.

DAG Tasks Stuck in Queued State

Symptom: Tasks remain in queued state and don't start.


Cause	Fix
`KubernetesExecutor` not configured	Verify `AIRFLOW__CORE__EXECUTOR`=`KubernetesExecutor`
Insufficient cluster resources	Scale up worker nodes or reduce PARALLEL_TASK_RUNS variable
Airflow scheduler pod is down	Check `kubectl get pods -n <namespace>`