Learn

Platform

For Business

Pricing

Resources

data analytics lesson video

data analytics lesson video

data analytics lesson video

Mastering Python for ETL: A Practical Guide for Analysts & Data Engineers

Mastering Python for ETL: A Practical Guide for Analysts & Data Engineers

Mastering Python for ETL: A Practical Guide for Analysts & Data Engineers

ETL (Extract, Transform, Load) is the backbone of modern analytics. In this guide—based on a live session led by Maven Analytics instructor Chris Bruehl—you’ll learn why Python is ideal for ETL, how to extract from files, databases, and APIs, the most useful transform patterns (cleaning, typing, joining, feature engineering), and simple ways to load and automate pipelines so they run reliably at scale.

Interested in learning more? Continue learning with

Interested in learning more? Continue learning with

Python Foundations for Data Analysis

Python Foundations for Data Analysis

Why ETL (and Python) matters

ETL isn’t flashy like interactive dashboards or machine learning—but it’s where business value starts. Companies have more data than ever, and the real constraint is turning raw sources into clean, reliable tables that analysts and BI developers can trust.

Python is a standout for ETL because:

  • It’s free and portable—skills carry with you across roles and employers.

  • The ecosystem is massive (pandas, SQLAlchemy, requests, PySpark, etc.).

  • It’s cloud-friendly (AWS, Azure, GCP, Glue, Data Factory, BigQuery, Redshift).

  • You can write once and automate forever (scheduled scripts, jobs, DAGs).

ETL = Extract → Transform → Load

ELT = Extract → Load → Transform (often into a lake/warehouse first)

Both are valid. Most business reporting still benefits from classic ETL where you deliver ready-to-query tables for analysts.

What you’ll build (high level)

From the live demo:

  • Extract data from CSVs (single & many), JSON, Parquet, a SQL database, and an API (OpenWeather).

  • Transform it with pandas: consistent column naming, missing values, types, engineered metrics, joins.

  • Load it into a database table you can query and schedule.

You can adapt the patterns below to your own data and stack.

Extract: Files, databases, APIs

1) CSVs (single & many)

Read a single file:

import pandas as pd

sales_df = pd.read_csv("LA_retail_sales.csv")
sales_df.info()  # 750 rows, 8 columns

Read many daily files with a naming pattern (fast and maintainable):

import glob
import pandas as pd

file_paths = glob.glob("sales_by_day/*.csv")           # grabs all CSVs in folder
sales_df = pd.concat([pd.read_csv(p) for p in file_paths],
                     ignore_index=True)

If you need precise control (e.g., only the first 5 days), loop over a formatted string:

import pandas as pd

sales_df = pd.DataFrame()
for day in range(1, 6):  # 1..5
    daily = pd.read_csv(f"sales_by_day/sales_2024-09-{day:02d}.csv")
    sales_df = pd.concat([sales_df, daily], ignore_index=True)

2) JSON & Parquet

# JSON (newline-delimited/“lines” JSON)
sales_json = pd.read_json("LA_retail_sales.json", lines=True)

# Parquet (columnar, cloud-friendly)
sales_parquet = pd.read_parquet("LA_retail_sales.parquet")

3) SQL databases (read)

import sqlite3, pandas as pd

con = sqlite3.connect("local.db")
stores = pd.read_sql("""
    SELECT store_id, SUM(dollar_sales) AS total_sales
    FROM sales
    GROUP BY store_id
    ORDER BY total_sales DESC
""", con)

Tip: Filter/aggregate in SQL first to minimize the data you pull over the wire.

4) APIs (OpenWeather example)

Add context like daily temperature and humidity to your sales:

import requests, time
import pandas as pd

# unique dates from sales
dates = (sales_df["date"].sort_values().dropna().dt.normalize().unique())

LA_LAT, LA_LON = 34.0522, -118.2437
API_KEY = "YOUR_API_KEY"

weather_rows = []
for dt in dates:
    unix_ts = int(pd.Timestamp(dt).timestamp())
    url = (f"https://api.openweathermap.org/data/3.0/onecall/timemachine"
           f"?lat={LA_LAT}&lon={LA_LON}&dt={unix_ts}&appid={API_KEY}&units=imperial")
    resp = requests.get(url, timeout=15)
    if resp.status_code == 200:
        data = resp.json()["data"][0]   # first record for the day
        weather_rows.append({
            "date": pd.to_datetime(dt),
            "temp_f": data["temp"],
            "humidity": data["humidity"],
            "description": data.get("weather", [{}])[0].get("description")
        })
        time.sleep(1)  # be polite to the API
weather_df = pd.DataFrame(weather_rows)

FOR INDIVIDUALS

Master data & AI skills

Build data & AI skills to launch or accelerate your career (start for free, no credit card required).

FOR COMPANIES & TEAMS

Transform your workforce

Assess your team's data & AI skills and follow personalized learning plans to close the gaps.

FOR INDIVIDUALS

Master data & AI skills

Build data & AI skills to launch or accelerate your career (start for free, no credit card required).

FOR COMPANIES & TEAMS

Transform your workforce

Assess your team's data & AI skills and follow personalized learning plans to close the gaps.

FOR INDIVIDUALS

Master data & AI skills

Build data & AI skills to launch or accelerate your career (start for free, no credit card required).

FOR COMPANIES & TEAMS

Transform your workforce

Assess your team's data & AI skills and follow personalized learning plans to close the gaps.