← Back to Blog
AI12 min read

Building Your First Data Pipeline: A Hands-On Tutorial for Engineers

NDN Analytics TeamApril 13, 2026

Every data engineer starts the same way: building analysis in a Jupyter notebook. It works great until you need to run it daily. Then notebooks become a liability.


This guide shows you how to move from "notebook that kind of works" to "production data pipeline that you trust."


Architecture: From Notebook to Pipeline


### The Notebook Phase (What You Probably Have)

```

Notebook (runs on your laptop)

Reads from database

Transforms data

Writes to CSV

```


Problems:

  • Only runs when you run it
  • Hard to debug when it fails (was it the data? Your code?)
  • No alerting if something breaks
  • Scaling to larger datasets requires manual optimization

  • ### The Production Pipeline (What You Need)

    ```

    Scheduled Job (Cloud Run or Cloud Functions)

    ↓ (Daily at 2 AM)

    Reads from data warehouse

    Transforms (with error handling)

    Validates output

    Writes to production database

    Monitoring + Alerting (Slack if it fails)

    ```


    This architecture handles failures, scales automatically, and lets you sleep at night.


    The Step-by-Step Guide


    ### Step 1: Choose Your Stack


    For most teams, Google Cloud is the fastest path:

  • Cloud Storage: Data lake (S3 equivalent)
  • BigQuery: Data warehouse (petabyte-scale SQL)
  • Cloud Run: Scheduled containers (no server management)
  • Cloud Logging: Centralized logs and alerts

  • Why Cloud? Because it integrates with NDN products (Demand IQ, Care Predict, Route AI all use Cloud).


    ### Step 2: Define Your Data Flow


    Before writing code, document:

  • **Input source**: Where does raw data come from? (API? Database? S3 dump?)
  • **Transformation**: What processing happens? (Cleaning? Aggregation? ML scoring?)
  • **Output**: Where does final data go? (Data warehouse? Real-time API? Email report?)
  • **Schedule**: How often? (Daily? Hourly? Real-time?)
  • **SLA**: How long can it take? (Must finish before 6 AM? Can run all day?)

  • ### Step 3: Build Locally (Docker)


    Package your code in a Docker container so it runs identically everywhere.


    **Example Dockerfile for a Python data pipeline:**


    ```dockerfile

    FROM python:3.11-slim


    WORKDIR /app

    COPY requirements.txt .

    RUN pip install -r requirements.txt


    COPY pipeline.py .


    CMD ["python", "pipeline.py"]

    ```


    **requirements.txt:**

    ```

    google-cloud-storage==2.10.0

    google-cloud-bigquery==3.13.0

    pandas==2.0.3

    ```


    **pipeline.py:**

    ```python

    from google.cloud import bigquery, storage

    import pandas as pd

    import logging


    logging.basicConfig(level=logging.INFO)

    logger = logging.getLogger(__name__)


    def run():

    logger.info("Starting data pipeline...")


    # Read from BigQuery

    client = bigquery.Client()

    query = """

    SELECT

    date,

    product_id,

    COUNT(*) as sales_count

    FROM `project.dataset.orders`

    WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

    GROUP BY date, product_id

    """

    df = client.query(query).to_dataframe()

    logger.info(f"Read {len(df)} rows from BigQuery")


    # Transform

    df['sales_count'] = df['sales_count'].fillna(0).astype(int)


    # Validate

    assert df['sales_count'].min() >= 0, "Negative sales counts!"

    logger.info(f"Validation passed: all values in valid range")


    # Write to BigQuery

    job_config = bigquery.LoadJobConfig(write_disposition="WRITE_APPEND")

    client.load_table_from_dataframe(

    df,

    "project.dataset.daily_aggregates",

    job_config=job_config

    )

    logger.info("Pipeline complete")


    if __name__ == "__main__":

    run()

    ```


    ### Step 4: Deploy to Cloud Run


    Cloud Run runs your container on a schedule without managing servers.


    **Deploy your container:**

    ```bash

    # Build and push to Container Registry

    gcloud builds submit --tag gcr.io/YOUR-PROJECT/data-pipeline


    # Deploy to Cloud Run

    gcloud run deploy data-pipeline --image gcr.io/YOUR-PROJECT/data-pipeline --platform managed --region us-central1 --no-allow-unauthenticated

    ```


    **Schedule it with Cloud Scheduler:**

    ```bash

    gcloud scheduler jobs create app-engine daily-pipeline --schedule="0 2 * * *" --http-method=POST --uri=https://us-central1-YOUR-PROJECT.cloudfunctions.net/trigger-pipeline --oidc-service-account-email=SA-EMAIL@YOUR-PROJECT.iam.gserviceaccount.com

    ```


    This runs your pipeline every day at 2 AM. If it fails, you get a notification.


    ### Step 5: Add Monitoring


    Monitor three things:

  • **Execution time**: Did the pipeline finish before SLA?
  • **Data quality**: Are output records valid?
  • **Error rate**: Did any records fail processing?

  • **Cloud Logging setup:**


    ```python

    # In your pipeline.py

    logger.info(f"Pipeline completed: {len(df)} records processed in {elapsed_time}s")


    # Create an alert in Cloud Monitoring

    # Alert if execution time > 30 minutes or error rate > 5%

    ```


    Common Pitfalls


    ### Pitfall 1: Not Handling Failures

    Your pipeline stops halfway through. Old data is left half-processed.


    **Fix:** Use transactions (data warehouse feature) so either all data updates or none. Fail loudly with clear error messages.


    ### Pitfall 2: Not Monitoring Data Quality

    Your pipeline runs successfully but outputs garbage data. Nobody notices for 2 weeks.


    **Fix:** Add validation checks (schema validation, range checks, duplicate detection) and alert if validation fails.


    ### Pitfall 3: Assuming Data Never Changes Format

    Your data source adds a new column. Your pipeline breaks.


    **Fix:** Use schema validation at the start of your pipeline. Fail fast if schema doesn't match expectations.


    ### Pitfall 4: Not Documenting Dependencies

    Your pipeline depends on a third-party API. Nobody knows.


    **Fix:** Document all dependencies (data sources, external APIs, timezone assumptions) in code comments and runbooks.


    Scaling Beyond the Basics


    Once you have a working pipeline, you can scale:


  • Add more pipelines: Build pipelines for different datasets
  • Use a DAG framework: Airflow or Dagster for complex dependencies
  • Implement incremental processing: Only process new data, not the whole dataset
  • Add real-time streaming: Switch from daily batch to continuous (Apache Beam, Kafka)

  • How NDN Products Use Data Pipelines


    Every NDN product includes enterprise data pipelines:

  • Demand IQ: Hourly pipelines ingesting POS, inventory, and weather data
  • Care Predict: Real-time pipelines consuming EHR updates
  • Route AI: Continuous pipelines aggregating traffic and delivery data
  • TraceChain: Event-driven pipelines for supply chain records

  • When you work with NDN, you're getting battle-tested pipeline patterns.


    Your Next Steps


    Start with a simple pipeline and iterate. Don't try to build a perfect system on day one.


    **Week 1:** Build locally, test thoroughly

    **Week 2:** Deploy to Cloud Run with daily schedule

    **Week 3:** Add monitoring and alerting

    **Week 4:** Document and make it someone else's responsibility


    If you need guidance building data pipelines for AI products, book a technical consultation and we'll show you the right architecture for your use case.

    Need Help Implementing AI/Blockchain Solutions?

    NDN Analytics specializes in enterprise AI and blockchain implementation. Our team can help you integrate cutting-edge technology into your existing workflows.