Tools used in this project
Maven Coffee Challenge

About this project

Challenge assignment:

Create an explanatory report providing a data-driven strategy for opening the first coffee shop for a group of investors looking to break into the US coffee market.

Possible questions:

  • Target audience: What type of customer should we target, and what are their preferences?
  • Product offering: What types of coffee beans and drinks should we offer?
  • Pricing strategy: How can we align prices with customer value perception?

My first thoughts on the assignment: it is all about the money. Investors, break into a market, it sounds like “we want to be the next Starbucks” or better. The questions are typical for market research.

After that I watched the video in which the results of the coffee survey are discussed. Points the presenter mentioned and I noted were:

  1. far more respondents male than female (average 72% male, 28% female)
  2. around 75% of respondents between 25 and 45 years old
  3. respondents geographically distributed across the USA mainland and city/rural.

Although the unbalance between male/female can be fine when discussing a coffee survey (the presenter mentioned skewed results) it should be fixed when trying to do market research. At least that is my oppionion. Suppose all females like red and males like blue and you have 25% females and 75% males. If you do not fix the unbalance between male and female it would be easy to conclude most people like blue. If you weight the values for male and female the conclusion would be that the preference for red and blue are equal.

I decided that survey results should be weighted. I used https://survalyzer.com/weighting-survey-data/ to see how and wondered if the coffee survey used this software.

I never gave much thought to the age group ranges; I used them automatically.

The location of the first shop is important; it should align with the most important target audience.

Getting started

I began with the idea of profiling the age groups. The primary characteristics I considered were monthly expenses (representing recurring costs, or potentially monthly recurring income for the client), the highest amount ever paid for a coffee, the price willing to pay for a single coffee, and $paid on purchased equipment over the last five years.

Sub-characteristics I considered included typical places to drink coffee, reasons for drinking coffee, preferred coffee types, and the most commonly used equipment at home.

Profiles ordered by monthly expenses. As I was preparing the report, I realized it wasn't leading to any meaningful insights. At least, that's my perspective. My development process tends to be recursive; perhaps due to inexperience, I found myself circling back to the phrase "it's all about the money." That's when I decided on the structure that's now detailed in the report, although not necessarily the order of the blocks.

Data cleaning and data in general (all Powerquery)

  • In this case I imported the raw data and copied it to prepare the copy for basic cleaning and further usage.
  • I did some cleaning on the survey id, in case there were duplicates.
  • In the coffee survey were a few multiselect fields. In the data this was translated into a summary field with for every selectable option a true/false column. I kept the summary field for filter purposes (empty means nothing selected), I converted all fields with true/false values into 1/0 values. If you sum all the ones you have the number of times a value is selected.
  • Fields with values yes/no followed the same procedure as fields with true/false.
  • I converted the age group field into values I would like to see on the screen (replaced empty with unknown)
  • I replaced all values in the field Gender not having the value “Male” or “Female” with Unknown, also the null fields (more on this later, ugly thing 1).
  • The values in the “money” fields were all ranges in string format. If you want to do some calculation/summarization on such a field it has to be a decimal or integer. I decided to convert these fields into the value in the middle with the “<” values being max and the “>”value being lower boudary. I used the same procedure for number of cups a day. (ugly 2)
    • Number of cups a day from [Less than 1, 1,2,3,4, More than 4] to [1,1,2,3,4,5]
    • Monthly expenses from [< $20, $20-$40, $40-$60, $60-$80, $80-$100,> $100] to [20,30,50,70,90,100]
    • Equipment expenses from [Less than $20, $20-$50, $50-$100, $100-$300, $300-$500, $500-$1000, More than $1,000] to [20,35,75,200,400,750,1000]
    • Most paid for a coffee and willing to pay for a coffee from [Less than $2, $2-$4, $4-$6, $6-$8,$8-$10, $10-$15, $15-$20, More than $20] to [2,3,5,7,9,12.5,17.5,20]

Unless mentioned otherwise I kept all the empty and null values. More on this in the summarizing method.

I added an age group table for filtering on visuals.

The summarizing method and the weighting table

Roughly the routine to get an end value was this:

  1. Create a duplicate of the cleaned table. There are probably better ways (reference?) but for easiest way to handle fatigue mistakes.
  2. Remove all columns except the ones your interest in for this grouping. Filter all blank and null values out for the characteristic of interest.
  3. Summarize on agegroup and gender => endresult 3 column (female, male, unknown), rows for every age group + characteristic. Values (summarization of values)
  4. Divide the calculated values in the column unknown over the columns female and male. This is necessary for the weighting.
  5. Pivot until you have rows with age group + gender + a column with the count of surveys and columns with characteristics and the values
  6. Merge with the weighting table on age group + gender.
  7. Multiply the characteristics (not the count of surveys) with the weighting factor.
  8. Summarize the values on age group => totals per age group
  9. Per age group calculate either the average value (values like the money fields) or the percentage (true/false fields)

Remark on step 2: By filtering out null and blank you work with known values explicitly entered by respondents.

Remark on step 4: Although my weighting table (step 6) contained real world percentages for male/female ratio for all age groups, I used the average ratio for all values except the “money” fields. Using the actual ratio to divide the unknown values into new female and male values took long, it crashed my Power Bi one time.

Calculation all except “money” : female new = .28 * unknown + female, male new = .72 * unknown + male.

Remarks on step 6:

Source for the weighting data, Census (american government), download link https://www2.census.gov/programs-surveys/popest/datasets/2020-2022/national/asrh/nc-est2022-alldata-r-file08.csv

The image below is the weighting table before creating a pivoted version to make merging easy.

  • calculating a column with female respondents incl. .28 of the unknown and similar procedure for male with .72.

  • the weighting factor is calculated via real world percentage / survey percentage.


Remarks on the routine:

  • I remember a remark on a report of mine with a visual showing average of average. To never…. all the grouping only used sum or count.

  • Typing this already very long description makes me wonder, what is the decent way to solve a problem like this (ranges to values etc). There must be many books on this topic.

The report

As mentioned before I switched from age group profiles back to “all about the money”.

Why these values and why the ordering of the blocks (except the summary). And which assumptions did I use.

Block “Buying a coffee”

Where do people spend the money on coffee? When not at home mostly in the cafe or on-the-go. Why not the office? At least in the Netherlands there is always free coffee from your employer. I am self-employed and work from home. I would probably select (it was a multi select) at home and in the office. And another person would say home, the third one in the office. There is probably no exchange of money, I narrowed this visual down to “Cafe” or “On-the-go”.

My interpretation of "most ever paid for a coffee" is based on the answer I would give to the question "what is the most you ever paid for a glass of wine". I'll never forget the answer, it is 20 years ago, near Athens, beautiful view, we were discussing if we would enter the place, we did, it was €8, one glass. I interpreted "most ever paid for a caffee" the same way. Yesterday I googled the average coffee prices in the USA and it came up with this overview, it is a few years old, there was inflation but substantially lower than the most-willing prices in the test.

The visual shows max paid versus willing to pay to make obvious that the younger age groups are paying or willing to pay more in absolute term. I added the percentage of people who selected “I’m typically drinking coffee in a cafe” and “on-the-go” to show that these numbers have the same curve.


Block “Spendings per month”

Spendings per month is roughly all you buy except equipment to brew yourself a nice cup of coffee at home + everything you spend in a cafe or on-the-go.

I added this block after the “Buying a coffee” because it makes the calculation example easier to process. In this block I realized I was in trouble when I actually wanted to add some opportunities. There was not much coming up because it is not clear which part of monthly spendings actually goes to buying coffee outside home. And already having mentioned the data I think would have been crucial for a decent Market(ing) Analysis and forecast I decided that “The Great American Coffee Taste Test” was not a good analysis source for decision making but could be nice to compare the results with one or a few official reports. Fun fact: I scanned some pages of the reports I mention in the summary and “young business people” is the way to go.


Block “Home brewing and equipment”

This block is used to dive deeper into home brewing expenses and the equipment most used. This offers opportunities to sell grind, beans etc. I left out selling beans because only approx. 3% overall use the “Beans to cup” method.


Block “Why do you drink”

I chose “I need the ritual” because it is a soft characteristic offering marketing and branding opportunities after some additional research. You could expect that people of a coffee taste test like the taste and "I need caffeine" sounds more like an addiction and as is, offers not too many marketing or branding opportunities.



To be honest, I do not think the detail blocks are of much value having written what I did under monthly expenses. But it would be strange to only submit that block after all the pivoting I did.


Colours, fonts and lay-out

I used my favourite font throughout the whole report (Bahnschrift) for no specific reason.

The report has brown colouring but given the nature of the client and assignment it would be white background with mostly blue if it were something official.

Tools used: Power BI powerquery for all the data manipulation, Power BI for the visuals, Lipstick Column Chart by Colour Nova (min max price) and Affinity Design for the final report.

Text: Chatgtp was in a very poetic mood when I offered my text for improvement, I stopped using it. I hope it's understandable enough.

Additional project images

Discussion and feedback(0 comments)
2000 characters remaining