Tutorial 3: Seed Shared Data¶
This tutorial loads the initial mock PUDO data into the shared Snowflake schema.
What you will learn¶
- How to seed the
SHARED_DATAschema with realistic PUDO data. - How to inspect the seeded data.
- How the mock data generator works.
Before you start¶
- Hub infrastructure is deployed (Tutorial 1).
- You understand the repo layout (Tutorial 2).
Step 1: Seed the shared data¶
From the repository root:
This runs uv run python scripts/seed_shared_data.py, which:
- Connects to Snowflake using the credentials in
mock_data/.env. - Creates the PUDO tables in
SHARED_DATAif they do not exist. - Generates a realistic PUDO network with locations, parcels, delivery attempts, and occupancy records.
- Loads the generated data into Snowflake.
Step 2: Verify the data¶
You can verify the seeded data by running queries in Snowflake:
USE SCHEMA SHARED_DATA.PUBLIC;
SELECT COUNT(*) AS pudo_count FROM PUDO_LOCATIONS;
SELECT COUNT(*) AS parcel_count FROM PARCELS;
SELECT COUNT(*) AS delivery_count FROM DELIVERY_ATTEMPTS;
SELECT COUNT(*) AS occupancy_count FROM OCCUPANCY;
Step 3: Check simulation status¶
The mock data generator tracks a simulation clock. You can check the current state:
This shows the current simulation date and how many days of data have been generated.
What the mock data contains¶
| Table | Content |
|---|---|
PUDO_LOCATIONS |
PUDO sites with coordinates, type, capacity, and operating hours. |
PARCELS |
Individual parcel records with origin, destination, and timestamps. |
DELIVERY_ATTEMPTS |
Delivery attempts with success/failure outcomes. |
OCCUPANCY |
Hourly occupancy readings per PUDO location. |
Incremental data generation¶
After the initial seed, you can add data incrementally to simulate daily operations:
# Simulate morning arrivals
make -C mock_data add-morning-data
# Simulate evening completions
make -C mock_data add-evening-data
These commands advance the simulation clock and add a new day's worth of data. You will use them in Tutorial 7 to create evaluation cycles.
Resetting the simulation¶
If you need to start over:
This resets the simulation clock to the initial state. You will need to re-seed the data.
What you have now¶
-
SHARED_DATAschema populated with mock PUDO data. - Understanding of the simulation lifecycle.
Next step¶
Continue to Tutorial 4: Deploy Schema & Feature Store to create the project-specific schema and register feature views.