ETL (Extract, Transform, Load) is the backbone of modern analytics. In this guide—based on a live session led by Maven Analytics instructor Chris Bruehl—you’ll learn why Python is ideal for ETL, how to extract from files, databases, and APIs, the most useful transform patterns (cleaning, typing, joining, feature engineering), and simple ways to load and automate pipelines so they run reliably at scale.
Why ETL (and Python) matters
ETL isn’t flashy like interactive dashboards or machine learning—but it’s where business value starts. Companies have more data than ever, and the real constraint is turning raw sources into clean, reliable tables that analysts and BI developers can trust.
Python is a standout for ETL because:
It’s free and portable—skills carry with you across roles and employers.
The ecosystem is massive (pandas, SQLAlchemy, requests, PySpark, etc.).
It’s cloud-friendly (AWS, Azure, GCP, Glue, Data Factory, BigQuery, Redshift).
You can write once and automate forever (scheduled scripts, jobs, DAGs).
ETL = Extract → Transform → Load
ELT = Extract → Load → Transform (often into a lake/warehouse first)
Both are valid. Most business reporting still benefits from classic ETL where you deliver ready-to-query tables for analysts.
What you’ll build (high level)
From the live demo:
Extract data from CSVs (single & many), JSON, Parquet, a SQL database, and an API (OpenWeather).
Transform it with pandas: consistent column naming, missing values, types, engineered metrics, joins.
Load it into a database table you can query and schedule.
You can adapt the patterns below to your own data and stack.
Extract: Files, databases, APIs
1) CSVs (single & many)
Read a single file:
import pandas as pd
sales_df = pd.read_csv("LA_retail_sales.csv")
sales_df.info() # 750 rows, 8 columnsimport pandas as pd
sales_df = pd.read_csv("LA_retail_sales.csv")
sales_df.info() # 750 rows, 8 columnsimport pandas as pd
sales_df = pd.read_csv("LA_retail_sales.csv")
sales_df.info() # 750 rows, 8 columnsRead many daily files with a naming pattern (fast and maintainable):
import glob
import pandas as pd
file_paths = glob.glob("sales_by_day/*.csv") # grabs all CSVs in folder
sales_df = pd.concat([pd.read_csv(p) for p in file_paths],
ignore_index=True)import glob
import pandas as pd
file_paths = glob.glob("sales_by_day/*.csv") # grabs all CSVs in folder
sales_df = pd.concat([pd.read_csv(p) for p in file_paths],
ignore_index=True)import glob
import pandas as pd
file_paths = glob.glob("sales_by_day/*.csv") # grabs all CSVs in folder
sales_df = pd.concat([pd.read_csv(p) for p in file_paths],
ignore_index=True)If you need precise control (e.g., only the first 5 days), loop over a formatted string:
import pandas as pd
sales_df = pd.DataFrame()
for day in range(1, 6): # 1..5
daily = pd.read_csv(f"sales_by_day/sales_2024-09-{day:02d}.csv")
sales_df = pd.concat([sales_df, daily], ignore_index=True)import pandas as pd
sales_df = pd.DataFrame()
for day in range(1, 6): # 1..5
daily = pd.read_csv(f"sales_by_day/sales_2024-09-{day:02d}.csv")
sales_df = pd.concat([sales_df, daily], ignore_index=True)import pandas as pd
sales_df = pd.DataFrame()
for day in range(1, 6): # 1..5
daily = pd.read_csv(f"sales_by_day/sales_2024-09-{day:02d}.csv")
sales_df = pd.concat([sales_df, daily], ignore_index=True)2) JSON & Parquet
# JSON (newline-delimited/“lines” JSON)
sales_json = pd.read_json("LA_retail_sales.json", lines=True)
# Parquet (columnar, cloud-friendly)
sales_parquet = pd.read_parquet("LA_retail_sales.parquet")# JSON (newline-delimited/“lines” JSON)
sales_json = pd.read_json("LA_retail_sales.json", lines=True)
# Parquet (columnar, cloud-friendly)
sales_parquet = pd.read_parquet("LA_retail_sales.parquet")# JSON (newline-delimited/“lines” JSON)
sales_json = pd.read_json("LA_retail_sales.json", lines=True)
# Parquet (columnar, cloud-friendly)
sales_parquet = pd.read_parquet("LA_retail_sales.parquet")3) SQL databases (read)
import sqlite3, pandas as pd
con = sqlite3.connect("local.db")
stores = pd.read_sql("""
SELECT store_id, SUM(dollar_sales) AS total_sales
FROM sales
GROUP BY store_id
ORDER BY total_sales DESC
""", con)import sqlite3, pandas as pd
con = sqlite3.connect("local.db")
stores = pd.read_sql("""
SELECT store_id, SUM(dollar_sales) AS total_sales
FROM sales
GROUP BY store_id
ORDER BY total_sales DESC
""", con)import sqlite3, pandas as pd
con = sqlite3.connect("local.db")
stores = pd.read_sql("""
SELECT store_id, SUM(dollar_sales) AS total_sales
FROM sales
GROUP BY store_id
ORDER BY total_sales DESC
""", con)Tip: Filter/aggregate in SQL first to minimize the data you pull over the wire.
4) APIs (OpenWeather example)
Add context like daily temperature and humidity to your sales:
import requests, time
import pandas as pd
# unique dates from sales
dates = (sales_df["date"].sort_values().dropna().dt.normalize().unique())
LA_LAT, LA_LON = 34.0522, -118.2437
API_KEY = "YOUR_API_KEY"
weather_rows = []
for dt in dates:
unix_ts = int(pd.Timestamp(dt).timestamp())
url = (f"https://api.openweathermap.org/data/3.0/onecall/timemachine"
f"?lat={LA_LAT}&lon={LA_LON}&dt={unix_ts}&appid={API_KEY}&units=imperial")
resp = requests.get(url, timeout=15)
if resp.status_code == 200:
data = resp.json()["data"][0] # first record for the day
weather_rows.append({
"date": pd.to_datetime(dt),
"temp_f": data["temp"],
"humidity": data["humidity"],
"description": data.get("weather", [{}])[0].get("description")
})
time.sleep(1) # be polite to the API
weather_df = pd.DataFrame(weather_rows)import requests, time
import pandas as pd
# unique dates from sales
dates = (sales_df["date"].sort_values().dropna().dt.normalize().unique())
LA_LAT, LA_LON = 34.0522, -118.2437
API_KEY = "YOUR_API_KEY"
weather_rows = []
for dt in dates:
unix_ts = int(pd.Timestamp(dt).timestamp())
url = (f"https://api.openweathermap.org/data/3.0/onecall/timemachine"
f"?lat={LA_LAT}&lon={LA_LON}&dt={unix_ts}&appid={API_KEY}&units=imperial")
resp = requests.get(url, timeout=15)
if resp.status_code == 200:
data = resp.json()["data"][0] # first record for the day
weather_rows.append({
"date": pd.to_datetime(dt),
"temp_f": data["temp"],
"humidity": data["humidity"],
"description": data.get("weather", [{}])[0].get("description")
})
time.sleep(1) # be polite to the API
weather_df = pd.DataFrame(weather_rows)import requests, time
import pandas as pd
# unique dates from sales
dates = (sales_df["date"].sort_values().dropna().dt.normalize().unique())
LA_LAT, LA_LON = 34.0522, -118.2437
API_KEY = "YOUR_API_KEY"
weather_rows = []
for dt in dates:
unix_ts = int(pd.Timestamp(dt).timestamp())
url = (f"https://api.openweathermap.org/data/3.0/onecall/timemachine"
f"?lat={LA_LAT}&lon={LA_LON}&dt={unix_ts}&appid={API_KEY}&units=imperial")
resp = requests.get(url, timeout=15)
if resp.status_code == 200:
data = resp.json()["data"][0] # first record for the day
weather_rows.append({
"date": pd.to_datetime(dt),
"temp_f": data["temp"],
"humidity": data["humidity"],
"description": data.get("weather", [{}])[0].get("description")
})
time.sleep(1) # be polite to the API
weather_df = pd.DataFrame(weather_rows)