Tools used in this project
Coffee Experiment Analysis

About this project

(report is too long, please view it on Tableau or at the bottom of this page)


This data comes from the James Hoffmann Youtube channel where he did a taste test experiment in which participants were shipped 4 different coffees (in the form of frozen coffee extracts), asked to rate the coffees, and answer a few more questions. There were 4,000+ participants from all over the US. I decided to create my own personal project, using SQL to answer questions that I was curious about.

Data Cleaning

I looked through all the questions that were answered and decided which ones I wanted to keep (the coffee ratings information most importantly). There were many questions that had answers as their own column with true or false in the cell, so I deleted most of these. After narrowing down the questions to just a few, I started sorting and filtering through the data to understand and clean it. The only change I made was to the addins column, in which I shortened the answers to make them more simple. I used find and replace to change “just black” to black, “sugar or sweetener” to just sugar, “dairy alternative” to dairy, “coffee creamer” to creamer, and “flavored syrup” to syrup. I counted 3315 records


There wasn’t any business related data such as sales or transaction date so my focus was on just challenging myself using SQL. I mainly created questions that related the columns to each other and from there crafted a query to give me a solution.


(Please follow along with the queries here)

In Query 1 I explored the different ways that people enjoy their coffee. The choices I created were: consuming it black, with milk and sugar, or with just one add-in. To simplify the analysis, I used the "addins" column to consolidate responses. If the entry indicated milk, dairy alternative, or creamer, it became "milk." If it said sugar or syrup, it was simply "sugar." I found that the most prevalent preference is for black coffee. The choices of adding both milk and sugar or just milk are equally popular. The least common response was opting for just sugar with black coffee.

With Query 2, I wanted to know how many people's favorite drink typically contains milk vs no milk. If the “fav_drink” column contained (cold brew, americano, espresso, drip, or pour over) I classified it as having no milk, the rest of the records (such as latte and cortado) as having milk. What I found was that the majority of participants said their favorite type of coffee typically contained no milk: 1,859 people compared to the 1,456 whose favorite drink typically did contain milk. From this insight, you could perhaps say that most of these participants enjoy a more purer taste of coffee.

Continuing with Query 3, I analyzed participants' caffeine preferences, classifying them as full, half, or none. I then compared it between genders and the results show that a higher percentage of males (93.71%) prefer full caffeine than females (85.82%).

Expanding on Query 4, it involves two columns: “bean_origin” is a yes or no question asking whether participants know where their coffee originates from. “Expertise” asks them to rank their coffee expertise on a scale 1-10. I wanted to know if people who care where their coffee comes from rank their expertise higher. The result affirms this correlation, those who do know where their coffee originates from have an average expertise rating of 6.2, significantly higher than the average rating of 4.1 among those who don’t.

Moving forward, the coffees in this experiment were named A,B,C, and D. Each coffee had their own column for how participants rated their bitterness and acidity (ex. a_bitt, a_acid). In query 5, I took the data and averaged it, grouping the results by coffee. They rated coffees B and C as the most bitter, while A and D were the most acidic. This information will be important for the next step in Query 6.

Using “roast_level” (which asked participants if they favored either Light, Medium, or Dark roasts), I made a new column called “taste” which was either bitter or acidic. Medium and Dark roast went into the bitter category and Light roast went into acidic (my logic being that Light roast coffees tend to be the most acidic). Out of all the people who chose bitter, what percentage of them chose either B or C as their favorite in the “overall_fav” column? The result was that 74.46% of people who preferred bitter coffee chose either coffee B or C as their overall favorite. For the people who preferred acidic coffee, 66.28% chose either A and D. Essentially, the goal was to see how well participant’s roast preference aligns with their choice for favorite coffee.

Query 7 takes “cups_day”, which is how many cups of coffee participants drip in a day, and counts the results. The options were 1-4 cups and then above or below that amount. The result shows that 1,398 people drink 2 cups per day and 1,090 people drink only one (combined constituting 75% of the participants). Furthermore, Query 8 takes it a step further by looking at composition of the results by each age group. The insight that surprised me the most was that participants >65 years old had the highest percentage of participants drinking over 4 cups of coffee per day. Plus, 55-64 year olds had the highest percentage of drinkers who had 3 cups per day.1-2 cups was roughly the same percentage for all the age groups.

In Query 9 I took the “mthly_spend” column and categorized them: $40 or less, $40-$80, and $80+ as low, medium, and high spenders respectively. I then compared the splits between age groups. The splits were similar between age groups with 90% being either low or medium spenders. The one that stood out was the 18-24 age group where 2/3 were low spenders and the rest were medium spenders. Then in Query 10, I converted the monthly spend ranges into actual numbers. The ranges spanned $20 so to each range I added $10 to the smaller value and for <$20 and >$100 I +- $10. On average, the group 35-44 years old spent the most money on coffee per month ($49.05) while 18-24 year olds spent $38.09.

Key Takeaways

The key takeaway from this project is that you can extract a myriad of insightful information from coffee data. The older demographic seem to be heavy coffee drinkers, with a high percentage of them consuming 3-4 cups per day. In general, these participants appear to enjoy the purer taste of coffee given the large quantity of them who drink it black. I foresaw this conclusion given that these participants signed up for the experiment and have a higher interest in coffee than most. They appreciate the intricacies of coffee and want to taste the different notes that make each coffee unique. In regards to spending, the 35-54 year old range spends the most per month (close to $50). 18-24 year olds spend the least per month at $38 and everyone else is in between.

Additional project images

Discussion and feedback(0 comments)
2000 characters remaining