Building Your First Data Pipeline: A Hands-On Tutorial for Engineers
Every data engineer starts the same way: building analysis in a Jupyter notebook. It works great until you need to run it daily. Then notebooks become a liability.
This guide shows you how to move from "notebook that kind of works" to "production data pipeline that you trust."
Architecture: From Notebook to Pipeline
### The Notebook Phase (What You Probably Have)
```
Notebook (runs on your laptop)
↓
Reads from database
↓
Transforms data
↓
Writes to CSV
```
Problems:
### The Production Pipeline (What You Need)
```
Scheduled Job (Cloud Run or Cloud Functions)
↓ (Daily at 2 AM)
Reads from data warehouse
↓
Transforms (with error handling)
↓
Validates output
↓
Writes to production database
↓
Monitoring + Alerting (Slack if it fails)
```
This architecture handles failures, scales automatically, and lets you sleep at night.
The Step-by-Step Guide
### Step 1: Choose Your Stack
For most teams, Google Cloud is the fastest path:
Why Cloud? Because it integrates with NDN products (Demand IQ, Care Predict, Route AI all use Cloud).
### Step 2: Define Your Data Flow
Before writing code, document:
### Step 3: Build Locally (Docker)
Package your code in a Docker container so it runs identically everywhere.
**Example Dockerfile for a Python data pipeline:**
```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY pipeline.py .
CMD ["python", "pipeline.py"]
```
**requirements.txt:**
```
google-cloud-storage==2.10.0
google-cloud-bigquery==3.13.0
pandas==2.0.3
```
**pipeline.py:**
```python
from google.cloud import bigquery, storage
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def run():
logger.info("Starting data pipeline...")
# Read from BigQuery
client = bigquery.Client()
query = """
SELECT
date,
product_id,
COUNT(*) as sales_count
FROM `project.dataset.orders`
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
GROUP BY date, product_id
"""
df = client.query(query).to_dataframe()
logger.info(f"Read {len(df)} rows from BigQuery")
# Transform
df['sales_count'] = df['sales_count'].fillna(0).astype(int)
# Validate
assert df['sales_count'].min() >= 0, "Negative sales counts!"
logger.info(f"Validation passed: all values in valid range")
# Write to BigQuery
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_APPEND")
client.load_table_from_dataframe(
df,
"project.dataset.daily_aggregates",
job_config=job_config
)
logger.info("Pipeline complete")
if __name__ == "__main__":
run()
```
### Step 4: Deploy to Cloud Run
Cloud Run runs your container on a schedule without managing servers.
**Deploy your container:**
```bash
# Build and push to Container Registry
gcloud builds submit --tag gcr.io/YOUR-PROJECT/data-pipeline
# Deploy to Cloud Run
gcloud run deploy data-pipeline --image gcr.io/YOUR-PROJECT/data-pipeline --platform managed --region us-central1 --no-allow-unauthenticated
```
**Schedule it with Cloud Scheduler:**
```bash
gcloud scheduler jobs create app-engine daily-pipeline --schedule="0 2 * * *" --http-method=POST --uri=https://us-central1-YOUR-PROJECT.cloudfunctions.net/trigger-pipeline --oidc-service-account-email=SA-EMAIL@YOUR-PROJECT.iam.gserviceaccount.com
```
This runs your pipeline every day at 2 AM. If it fails, you get a notification.
### Step 5: Add Monitoring
Monitor three things:
**Cloud Logging setup:**
```python
# In your pipeline.py
logger.info(f"Pipeline completed: {len(df)} records processed in {elapsed_time}s")
# Create an alert in Cloud Monitoring
# Alert if execution time > 30 minutes or error rate > 5%
```
Common Pitfalls
### Pitfall 1: Not Handling Failures
Your pipeline stops halfway through. Old data is left half-processed.
**Fix:** Use transactions (data warehouse feature) so either all data updates or none. Fail loudly with clear error messages.
### Pitfall 2: Not Monitoring Data Quality
Your pipeline runs successfully but outputs garbage data. Nobody notices for 2 weeks.
**Fix:** Add validation checks (schema validation, range checks, duplicate detection) and alert if validation fails.
### Pitfall 3: Assuming Data Never Changes Format
Your data source adds a new column. Your pipeline breaks.
**Fix:** Use schema validation at the start of your pipeline. Fail fast if schema doesn't match expectations.
### Pitfall 4: Not Documenting Dependencies
Your pipeline depends on a third-party API. Nobody knows.
**Fix:** Document all dependencies (data sources, external APIs, timezone assumptions) in code comments and runbooks.
Scaling Beyond the Basics
Once you have a working pipeline, you can scale:
How NDN Products Use Data Pipelines
Every NDN product includes enterprise data pipelines:
When you work with NDN, you're getting battle-tested pipeline patterns.
Your Next Steps
Start with a simple pipeline and iterate. Don't try to build a perfect system on day one.
**Week 1:** Build locally, test thoroughly
**Week 2:** Deploy to Cloud Run with daily schedule
**Week 3:** Add monitoring and alerting
**Week 4:** Document and make it someone else's responsibility
If you need guidance building data pipelines for AI products, book a technical consultation and we'll show you the right architecture for your use case.
Need Help Implementing AI/Blockchain Solutions?
NDN Analytics specializes in enterprise AI and blockchain implementation. Our team can help you integrate cutting-edge technology into your existing workflows.