Troubleshooting
- Failed DAG Reruns
- Best practice: Use the Airflow UI to retrigger only the failed DAG rather than restarting the grand-master. Navigate to the DAG, click the failed run, and use Clear on the failed task to rerun it. For master DAGs that have child failures, retrigger the specific child DAG directly.
- DAG Failure Recovery
- For bespoke ETL pipelines, if a DAG fails due to invalid or incomplete source
data:
- Identify and correct the data issue in the source system.
- Validate that the source data is available and accurate.
- Re-trigger the failed DAG from Airflow.
- Continue execution of the remaining dependent DAGs after successful completion.
Note: For bespoke pipelines, no additional recovery procedure is required. Once the source data issue is resolved, the failed DAG can be restarted safely and the pipeline execution can continue normally. - ETL Date Control Not Updating
- Symptom: Delta loads process too much or too little data.
- Grand Master DAG — Diagnostic Steps
-
- Identify the failed task in the Airflow UI Grid or Graph view, and note the task ID corresponding to the child DAG that was triggered.
- Check task logs for the
TriggerDagRunOperator. The log will indicate which child DAG run was triggered and whether the child DAG itself failed or timed out. - Navigate to the child DAG in the Airflow UI and identify the failing task within it using the Grid view.
- For
dbtfailures, open the task log and look for the structured log path (/mnt/airflow-data/logs/dbt/). Read thedbt runlog file at that path for model-level error details and SQL exceptions. - Verify Airflow Variables are populated correctly. Navigate to and confirm all required variables
(
CDM_DATABASE, image tags, host/port/credentials) are set to non-empty, valid values.
- DAG Tasks Stuck in Queued State
- Symptom: Tasks remain in queued state and don't start.
Cause Fix KubernetesExecutornot configuredVerify AIRFLOW__CORE__EXECUTOR=KubernetesExecutorInsufficient cluster resources Scale up worker nodes or reduce PARALLEL_TASK_RUNS variable Airflow scheduler pod is down Check kubectl get pods -n <namespace>