Skip to content

Tutorial 2: Repo Mental Model

This tutorial explains how the repository is structured so you know where things live and why.

What you will learn

  • The hub-spoke architecture and ownership boundaries.
  • What each top-level directory contains.
  • How configuration and deployment work across components.

The hub-spoke model

The repository is organised as a hub and one or more project spokes:

gls-snowflake-workshop/
├── hub/          # Shared platform infrastructure
├── mock_data/    # Data simulation and seeding
└── projects/
    └── pudo/     # Reference project: PUDO capacity prediction

Ownership rule

Hub and shared code may be referenced by projects. Projects may never be referenced by hub or shared code.

This means:

  • The hub creates databases, schemas, roles, and warehouses that projects use.
  • Each project owns its own feature views, training pipelines, and inference pipelines.
  • Projects are independent of each other.

Component breakdown

hub/: Platform infrastructure

Creates the Snowflake objects that all projects share:

  • Databases and schemas (SHARED_DATA, FEATURE_STORE_<ENV>, MODEL_REGISTRY_<ENV>).
  • Operational roles and grants.
  • Warehouses and compute pools.

Entry point: make -C hub deploy-infra

mock_data/: Data simulation

Generates realistic PUDO data and loads it into SHARED_DATA:

  • PUDO locations with geospatial attributes.
  • Parcel volumes, delivery attempts, and occupancy.
  • Temporal patterns and seasonal trends.

Entry points:

Make target What it does
seed-shared-data Initial bulk load of PUDO data.
add-morning-data Simulate morning parcel arrivals.
add-evening-data Simulate evening delivery completions.
simulation-status Show current simulation state.
reset-simulation Reset simulation clock.

projects/pudo/: Reference project

The main MLOps project, organised into lifecycle blocks:

projects/pudo/
├── config/           # YAML configuration with environment overlays
├── scripts/          # Deployment and execution scripts
├── src/pudo/
│   ├── core/         # Shared utilities (session, config, SQL helpers)
│   ├── feature_view/ # Entity definitions and feature view implementations
│   ├── training/     # Training DAG, model training, evaluation
│   └── inference/    # Inference DAG, batch prediction, CLI tools
└── Makefile          # Operational entry points

Entry points:

Make target What it does
deploy-schema Create the project schema in Snowflake.
deploy-feature-store Register entities and feature views.
deploy-training-dag Deploy the training task graph.
run-training-dag Execute the training task graph.
deploy-inference-dag Deploy the inference task graph.
run-inference-dag Execute the inference task graph.
run-inference Run batch inference via CLI.
evaluate-predictions Compare predictions to actuals.
inference-alerts Check for alert conditions.
inference-summary Print prediction summary.

Configuration pattern

Each component uses YAML configuration with environment overlays:

config/
├── base.yaml              # Default values
├── dev.override.yaml      # Dev-specific overrides
├── staging.override.yaml  # Staging-specific overrides
└── prod.override.yaml     # Production-specific overrides

This is similar to Kustomize in Kubernetes: base values are merged with environment-specific overrides. The active environment is determined by the Git branch name.

No root Makefile

There is intentionally no root Makefile or root pyproject.toml. Each component is self-contained with its own:

  • pyproject.toml: Python dependencies managed by uv.
  • Makefile: Operational targets.
  • .env: Snowflake connection credentials.
  • uv.lock: Locked dependency tree.

You always run commands from within a component directory:

make -C hub deploy-infra
make -C mock_data seed-shared-data
make -C projects/pudo deploy-schema

Next step

Continue to Tutorial 3: Seed Shared Data to populate the shared Snowflake schema with mock PUDO data.