__STYLES__
This project is part of extracurricular activity program with Quantium, a company who helps clients to maximize their data assets. In this opportunity, my role as part of retail analytics team and I've been assigned to help the Category Manager for Chips to understand the types of customers who purchase chips and their purchasing behavior within the region.
There are two data that will be used here, Transaction Data and Customer Behavior. Let's dive in!
The transaction data comprehensively records all details of each transaction. The data, presented in Excel format, includes the following information:
Here is a snippet of the transaction data, offering a glimpse into the details of a few sample transactions:
Customer Behavior data encapsulates comprehensive information about customers. Presented in CSV format. the dataset includes several key fields, as outlined below:
Presented here is a snippet of the customer behavior data:
Before proceeding with data preparation, let's see what information we can extract from the dataset:
To accomplish this, we need to extract 3 elements from "PROD_NAME" column: the brand name, chip variant, and packaging size. Before diving into the extraction process, let's see the preceding steps.
In this project, we employ some libraries and a module:
re
module provides support for regular expressions, which are powerful tools for pattern matching and string manipulation. In this project, we'll use the 're' module to clean up text.The next step is reading transaction data. We can use Pandas function: pd.read_excel() to read the excel file.
After reading the data, we need to verify the correctness of the format for each column. We can use .dtype to check data type of each column.
As you can see in the image above, the Date data type is presented in integer format. Therefore, we need to convert it into date type.
'unit
' parameter tells the function how to interpret the numeric values in the input data. In this case, the numeric values represent the number of days ('D'
) since a specific origin.
Now, we're delving into the challenging task of cleaning up the product names. At this stage, we aim to extract the three key elements: the brand name, variant, and packaging size. However, we encounter the challenge of dealing with inconsistent text. In the following image, we see several issues are apparent, such as:
To remove the multiple spaces, we can use re.sub()
to replace one or more consecutive spaces ('\s+'
) with a single space. It is applied to each element in the 'PROD_NAME
' column using .apply()
method with a lambda function. Then, we can map the inconsistent brand names to the corrected versions using the 'replace_item
' dictionary.
After running the code, we have successfully removed the spaces in our product name. We’ve also replaced the abbreviation and inconsistent text with the correct one. Let’s take a look at the output!
You might be thinking that our job is done. But hold on! Do you see a red underline in the picture above? That’s our next issue. Some products have variant names and packaging that are not separated by space. To address this concern, we need to add a space between the variant and packaging size as illustrated in the code below.
The initial step is using 'compile()
' function to compile a regular expression pattern that matches specific pattern. In this case, the pattern is a single alphabetical character ([a-zA-Z]
) followed by a single digit (\d
). Then, we create a replacement function called “add_space
” to add a space between the letter and the digit in the matched substring. To apply the replacement function, we use ‘re.sub()
’ method and list comprehension to iterate through each product name. Here is the output produced by the code above.
After ensuring brand name, variant, and size are separated by space, we’re ready to extract them from the 'PROD_NAME
'.
We choose to extract brand name first because the pattern is easy to spot. Our brand name is located at the first word. To extract it, we can create a ‘BRAND
’ column and take the first word using ‘str.split()
’ method and ‘str[0]
’ (selecting first element). Afterward, we update our brand name using 'update_brand()
' function. This function joins the extracted words with a space. For example, we want to add a space for “RedRockDeli” or “NaturalChipCo”. The output will be “Red Rock Deli” and “Natural Chip Co”.
Once the brand name is addressed, our attention turns to the ‘SIZE
’ column. As illustrated in the code above, the fourth line aims to extract the packaging size, identified by the pattern of digit (\d
) followed by a letter ([a-zA-Z]
) (125g, 175g, etc) from the product name. However, we might encounter an anomaly where the packaging size is positioned in the middle of the text, as shown in the image below.
To anticipate this issue, the fifth line of code above attempts to extract the packaging size from the middle of the text. Additionally, we introduce a new column, ‘PACKING_AT_END
’ to indicate whether the packaging size is located at the end of the product name. You might be wondering why we add this column, but don’t worry—I will explain it later. Moving on to the next step, we create an empty ‘VARIANT
’ column.
Now, we're entering the final step of word extraction: the variant name. Remember our underrated guy, ‘PACKING_AT_END
’? Well, it's time for him to shine. Keep in mind that we have two conditions:
Based on these conditions, we’ve created two lines of code to handle them specifically. As you can see in the code above, the first line of code enables the ‘VARIANT
’ to capture any strings between the brand name and packaging size if the packaging size is located in the end of the text. On the other hand, the second line of code helps ‘VARIANT
’ to concatenate strings between ‘BRAND
’ and ‘SIZE
’ with strings after ‘SIZE
’ if the packing size is located in the middle of the text. The last two lines of code standardize the size unit in lowercase form and replace blank ‘VARIANT
’ with “Original”.
Well, let’s take a short break! I need to grab a cup of coffee first...
I'm back!
After ensuring the clarity of the variants, the next step involves reshaping our data frame through the processes of filtering, removing, and ordering columns. Initially, we’d like to create a new column, which, we can call it “Category
”. This column assists us in classifying products into chips and non-chips categories. For your information, we have unknown variants that contain “Salsa” word. According to Google, Salsa is a common ingredient in Mexican cuisine, often served as a condiment. You can see a list of “Salsa” products in the image below.
You might be tempted to classify Salsa as a non-chip product. I think you shouldn’t do that. As shown in the list above, we have two chips that contain the word “Salsa”: Smiths Crinkle Cut Tomato Salsa 150g and Red Rock Deli SR Salsa & Mzzrlla 150g. Instead of using “Salsa”, I prefer to use Dip as a condition to classify the product as a non-chip. After searching it in Google, I found that dip is similar to sauce.
Based on the list of dip product above, Smiths Crinkle Cut French OnionDip is a chip. Therefore, we need to create a condition: if a product name contains “Crinkle”, then categorize it as “Chips”. On the other hand, if a product name contains “Dip”, then set the value as “Non-Chips”.
The next thing to do is filtering only chips category. The first line of code below demonstrates how to achieve this. Now, our DataFrame exclusively contains chips product. However, it’s important to note that “Salsa” products still exist. To ensure that “Salsa” products are not included in our next analysis, we can filter out variant names that contain either “Mild” or “Medium”, indicating Salsa spiciness level.
Once we’ve done, we can proceed into the next steps, removing and ordering columns. Since we have created brand, variant, and size columns, the product name column becomes unnecessary. Other than that, ‘PACKING_AT_END
’ and ‘CATEGORY
’ are no longer needed because we’re already sure that our DataFrame includes only chip product. To remove the columns, we need to specify columns we want to remove in a list. Afterwards, we use ‘drop()
’ method to remove the columns, as depicted in the image below.
Then we finally determine the desired column order using a list and creating a new DataFrame with the specified column order.
Here is our final DataFrame:
Checking for outliers in our DataFrame is crucial. We want to ensure that outliers do not influence our next analysis. Therefore, we can use the Seaborn library to visualize a Box Plot and identify any outliers. Let’s call boxplot()
function to help us drawing a Box Plot!
We use ‘PROD_QTY
’ to identify any unusual purchase amounts in a single transaction. As observed in the box plot above, we have transactions where customers bought chips in unusual quantities. Consequently, we will examine the customer ID using ‘LYLTY_CARD_NBR
’ to further investigate.
Upon the provided code above, we determined that these transactions were associated with the same customer. Upon inspecting the transaction records, we discovered that the customer consistently buys the same product in the same quantity. Subsequently, we can exclude this entry from our upcoming analysis using the following code.
For the final touch, let's save the cleaned data! All right, we have cleaned transaction data and ready to jump into the next analysis. See you in the next part!