Skip to content

Task Graphs & Orchestration

This page explains how Snowflake Task Graphs are used to orchestrate ML pipelines in this repository.

What are Snowflake Task Graphs?

Snowflake Task Graphs (DAGs) are a native Snowflake orchestration mechanism that allows you to define dependencies between tasks and execute them in order.

Key properties:

  • Serverless: no external orchestrator (Airflow, Prefect, etc.) required.
  • Native: tasks run directly in Snowflake, close to the data.
  • Schedulable: tasks can run on a cron schedule or be triggered manually.
  • Observable: task history and status are queryable in Snowflake.

The deploy-vs-run pattern

This repository uses a consistent deploy-then-run pattern:

deploy-training-dag     →  Creates/updates the task graph definition in Snowflake
run-training-dag        →  Triggers an immediate execution of the deployed graph

This separation is important because:

  • Deploy is a code change. It updates what the pipeline does.
  • Run is an execution. It runs the pipeline as currently defined.
  • Deploy happens when an engineer runs the deploy target; run happens on schedule.

Training DAG

The training DAG orchestrates the model training pipeline:

graph LR
    A[generate_dataset] --> B[train_model]
    B --> C[evaluate_model]
    C --> D[register_model]
Task What it does
generate_dataset Build spine, ASOF JOIN features, temporal split.
train_model Run distributed XGBoost training via Container Services.
evaluate_model Compute metrics on validation set.
register_model Log model with metrics to the Model Registry.

Configuration

The training DAG is configured via YAML:

# config/training/base.yaml
schedule: "0 2 * * *"   # Daily at 2 AM (for production)
compute_pool: PUDO_TRAINING_POOL
warehouse: PUDO_TRAINING_WH

Inference DAG

The inference DAG orchestrates the batch prediction pipeline:

graph LR
    A[load_model] --> B[generate_features]
    B --> C[run_predictions]
    C --> D[write_results]
Task What it does
load_model Retrieve model version from the Registry.
generate_features Compute inference-time features from the Feature Store.
run_predictions Apply model to feature matrix.
write_results Store predictions with metadata.

CLI alternative

For ad-hoc runs and debugging, the pudo-inference CLI provides the same functionality without the DAG:

pudo-inference run        # Equivalent to running the inference DAG
pudo-inference evaluate   # Post-inference evaluation
pudo-inference alerts     # Check alert conditions
pudo-inference summary    # Print summary

Scheduling

Task graphs support Snowflake's native scheduling:

-- Set a daily schedule for the training DAG
ALTER TASK PUDO_DEV.TRAINING_DAG_ROOT SET SCHEDULE = 'USING CRON 0 2 * * * UTC';

-- Suspend or resume scheduling
ALTER TASK PUDO_DEV.TRAINING_DAG_ROOT SUSPEND;
ALTER TASK PUDO_DEV.TRAINING_DAG_ROOT RESUME;

In this repository, scheduling is typically configured during deployment via the deploy-training-dag and deploy-inference-dag scripts.

Monitoring

You can monitor task graph execution in Snowflake:

-- Recent task executions
SELECT *
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY())
WHERE NAME LIKE 'TRAINING%'
ORDER BY SCHEDULED_TIME DESC
LIMIT 20;

-- Current task graph state
SHOW TASKS IN SCHEMA PUDO_DEV;

Why not an external orchestrator?

This repository uses Snowflake Task Graphs instead of external tools (Airflow, Prefect, Dagster) because:

  • Simplicity: no additional infrastructure to manage.
  • Proximity: tasks run close to the data, minimising data movement.
  • Cost: uses existing Snowflake compute resources.
  • Governance: all execution history is in Snowflake, queryable and auditable.

For teams that already use an external orchestrator or automation system, the deploy scripts are plain make targets and can be called from any wrapper.

See also