AI12 min read

Building Your First Data Pipeline: A Hands-On Tutorial for Engineers

NDN Analytics TeamApril 13, 2026

Every data engineer starts the same way: building analysis in a Jupyter notebook. It works great until you need to run it daily. Then notebooks become a liability.

This guide shows you how to move from "notebook that kind of works" to "production data pipeline that you trust."

Architecture: From Notebook to Pipeline

### The Notebook Phase (What You Probably Have)

```

Notebook (runs on your laptop)

↓

Reads from database

↓

Transforms data

↓

Writes to CSV

```

Problems:

Only runs when you run it

Hard to debug when it fails (was it the data? Your code?)

No alerting if something breaks

Scaling to larger datasets requires manual optimization

### The Production Pipeline (What You Need)

```

Scheduled Job (Cloud Run or Cloud Functions)

↓ (Daily at 2 AM)

Reads from data warehouse

↓

Transforms (with error handling)

↓

Validates output

↓

Writes to production database

↓

Monitoring + Alerting (Slack if it fails)

```

This architecture handles failures, scales automatically, and lets you sleep at night.

The Step-by-Step Guide

### Step 1: Choose Your Stack

For most teams, Google Cloud is the fastest path:

Cloud Storage: Data lake (S3 equivalent)

BigQuery: Data warehouse (petabyte-scale SQL)

Cloud Run: Scheduled containers (no server management)

Cloud Logging: Centralized logs and alerts

Why Cloud? Because it integrates with NDN products (Demand IQ, Care Predict, Route AI all use Cloud).

### Step 2: Define Your Data Flow

Before writing code, document:

**Input source**: Where does raw data come from? (API? Database? S3 dump?)

**Transformation**: What processing happens? (Cleaning? Aggregation? ML scoring?)

**Output**: Where does final data go? (Data warehouse? Real-time API? Email report?)

**Schedule**: How often? (Daily? Hourly? Real-time?)

**SLA**: How long can it take? (Must finish before 6 AM? Can run all day?)

### Step 3: Build Locally (Docker)

Package your code in a Docker container so it runs identically everywhere.

**Example Dockerfile for a Python data pipeline:**

```dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY pipeline.py .

CMD ["python", "pipeline.py"]

```

**requirements.txt:**

```

google-cloud-storage==2.10.0

google-cloud-bigquery==3.13.0

pandas==2.0.3

```

**pipeline.py:**

```python

from google.cloud import bigquery, storage

import pandas as pd

import logging

logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)

def run():

logger.info("Starting data pipeline...")

# Read from BigQuery

client = bigquery.Client()

query = """

SELECT

date,

product_id,

COUNT(*) as sales_count

FROM `project.dataset.orders`

WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

GROUP BY date, product_id

"""

df = client.query(query).to_dataframe()

logger.info(f"Read {len(df)} rows from BigQuery")

# Transform

df['sales_count'] = df['sales_count'].fillna(0).astype(int)

# Validate

assert df['sales_count'].min() >= 0, "Negative sales counts!"

logger.info(f"Validation passed: all values in valid range")

# Write to BigQuery

job_config = bigquery.LoadJobConfig(write_disposition="WRITE_APPEND")

client.load_table_from_dataframe(

df,

"project.dataset.daily_aggregates",

job_config=job_config

)

logger.info("Pipeline complete")

if __name__ == "__main__":

run()

```

### Step 4: Deploy to Cloud Run

Cloud Run runs your container on a schedule without managing servers.

**Deploy your container:**

```bash

# Build and push to Container Registry

gcloud builds submit --tag gcr.io/YOUR-PROJECT/data-pipeline

# Deploy to Cloud Run

gcloud run deploy data-pipeline --image gcr.io/YOUR-PROJECT/data-pipeline --platform managed --region us-central1 --no-allow-unauthenticated

```

**Schedule it with Cloud Scheduler:**

```bash

gcloud scheduler jobs create app-engine daily-pipeline --schedule="0 2 * * *" --http-method=POST --uri=https://us-central1-YOUR-PROJECT.cloudfunctions.net/trigger-pipeline --oidc-service-account-email=SA-EMAIL@YOUR-PROJECT.iam.gserviceaccount.com

```

This runs your pipeline every day at 2 AM. If it fails, you get a notification.

### Step 5: Add Monitoring

Monitor three things:

**Execution time**: Did the pipeline finish before SLA?

**Data quality**: Are output records valid?

**Error rate**: Did any records fail processing?

**Cloud Logging setup:**

```python

# In your pipeline.py

logger.info(f"Pipeline completed: {len(df)} records processed in {elapsed_time}s")

# Create an alert in Cloud Monitoring

# Alert if execution time > 30 minutes or error rate > 5%

```

Common Pitfalls

### Pitfall 1: Not Handling Failures

Your pipeline stops halfway through. Old data is left half-processed.

**Fix:** Use transactions (data warehouse feature) so either all data updates or none. Fail loudly with clear error messages.

### Pitfall 2: Not Monitoring Data Quality

Your pipeline runs successfully but outputs garbage data. Nobody notices for 2 weeks.

**Fix:** Add validation checks (schema validation, range checks, duplicate detection) and alert if validation fails.

### Pitfall 3: Assuming Data Never Changes Format

Your data source adds a new column. Your pipeline breaks.

**Fix:** Use schema validation at the start of your pipeline. Fail fast if schema doesn't match expectations.

### Pitfall 4: Not Documenting Dependencies

Your pipeline depends on a third-party API. Nobody knows.

**Fix:** Document all dependencies (data sources, external APIs, timezone assumptions) in code comments and runbooks.

Scaling Beyond the Basics

Once you have a working pipeline, you can scale:

Add more pipelines: Build pipelines for different datasets

Use a DAG framework: Airflow or Dagster for complex dependencies

Implement incremental processing: Only process new data, not the whole dataset

Add real-time streaming: Switch from daily batch to continuous (Apache Beam, Kafka)

How NDN Products Use Data Pipelines

Every NDN product includes enterprise data pipelines:

Demand IQ: Hourly pipelines ingesting POS, inventory, and weather data

Care Predict: Real-time pipelines consuming EHR updates

Route AI: Continuous pipelines aggregating traffic and delivery data

TraceChain: Event-driven pipelines for supply chain records

When you work with NDN, you're getting battle-tested pipeline patterns.

Your Next Steps

Start with a simple pipeline and iterate. Don't try to build a perfect system on day one.

**Week 1:** Build locally, test thoroughly

**Week 2:** Deploy to Cloud Run with daily schedule

**Week 3:** Add monitoring and alerting

**Week 4:** Document and make it someone else's responsibility

If you need guidance building data pipelines for AI products, book a technical consultation and we'll show you the right architecture for your use case.

Need Help Implementing AI/Blockchain Solutions?

NDN Analytics specializes in enterprise AI and blockchain implementation. Our team can help you integrate cutting-edge technology into your existing workflows.

Book a Consultation Explore Our Products