Retail Strategy for Chip Products Part 1: Data Cleaning

Tools used in this project
Retail Strategy for Chip Products Part 1: Data Cleaning

About this project

This project is part of extracurricular activity program with Quantium, a company who helps clients to maximize their data assets. In this opportunity, my role as part of retail analytics team and I've been assigned to help the Category Manager for Chips to understand the types of customers who purchase chips and their purchasing behavior within the region.

About The Data

There are two data that will be used here, Transaction Data and Customer Behavior. Let's dive in!

Transaction Data

The transaction data comprehensively records all details of each transaction. The data, presented in Excel format, includes the following information:

  1. DATE: contains transaction date
  2. STORE_NBR: Store Number or Store ID
  3. LYLTY_CARD_NBR: Loyalty Card Number, similar to Customer ID
  4. TXN_ID: Transaction ID
  5. PROD_NBR: Product Number or Product ID
  6. PROD_NAME: Product Name
  7. PROD_QTY: The purchase quantity in a single transaction
  8. TOT_SALES: Total sales of the product (in AUD)

Here is a snippet of the transaction data, offering a glimpse into the details of a few sample transactions: undefined

Customer Behavior

Customer Behavior data encapsulates comprehensive information about customers. Presented in CSV format. the dataset includes several key fields, as outlined below:

  1. LYLTY_CARD_NBR: Loyalty Card Number, more similar to Customer ID
  2. LIFESTAGE: Customer attribute that identifies whether a customer has a family or not and what point in life they are at
  3. PREMIUM_CUSTOMER: Customer segmentation involves distinguishing shoppers based on the price range and product categories they prefer. This strategy helps determine whether customers prioritize spending more for quality or brand, or if they opt for more budget-friendly options

Presented here is a snippet of the customer behavior data:


My Approach

Before proceeding with data preparation, let's see what information we can extract from the dataset:

  1. Top chips brands that generate high sales.
  2. The most popular chips brand.
  3. The most popular packing size.
  4. Average chips purchased per customer.

To accomplish this, we need to extract 3 elements from "PROD_NAME" column: the brand name, chip variant, and packaging size. Before diving into the extraction process, let's see the preceding steps.

Importing Libraries and Modules

In this project, we employ some libraries and a module:

  1. Pandas: Library for data manipulation and analysis.
  2. RegEx (re): The re module provides support for regular expressions, which are powerful tools for pattern matching and string manipulation. In this project, we'll use the 're' module to clean up text.
  3. Seaborn: Data visualization library for Python. In this project, we'll use Seaborn to create a box plot to identify outliers.
  4. NumPy: Library for numerical computing in Python.


Reading Transaction Data

The next step is reading transaction data. We can use Pandas function: pd.read_excel() to read the excel file.


Data Formatting

After reading the data, we need to verify the correctness of the format for each column. We can use .dtype to check data type of each column. undefined

As you can see in the image above, the Date data type is presented in integer format. Therefore, we need to convert it into date type.


'unit' parameter tells the function how to interpret the numeric values in the input data. In this case, the numeric values represent the number of days ('D') since a specific origin.

Cleaning Product Name

Now, we're delving into the challenging task of cleaning up the product names. At this stage, we aim to extract the three key elements: the brand name, variant, and packaging size. However, we encounter the challenge of dealing with inconsistent text. In the following image, we see several issues are apparent, such as:

  1. Multiple paces -> The spaces that separate the text vary, with instances having 6 whitespaces, 4 whitespaces, and so on.
  2. Inconsistent brand name -> For instance, as depicted in the image below, we observe products like "Natural Chip Compny" and "NCC". After a bit of research on Google, it revealed that these are the same brand. Another case includes "Woolworth" and "WW", which, upon investigation, were found to represent the same brand.


To remove the multiple spaces, we can use re.sub() to replace one or more consecutive spaces ('\s+') with a single space. It is applied to each element in the 'PROD_NAME' column using .apply() method with a lambda function. Then, we can map the inconsistent brand names to the corrected versions using the 'replace_item' dictionary. undefinedAfter running the code, we have successfully removed the spaces in our product name. We’ve also replaced the abbreviation and inconsistent text with the correct one. Let’s take a look at the output!

undefinedYou might be thinking that our job is done. But hold on! Do you see a red underline in the picture above? That’s our next issue. Some products have variant names and packaging that are not separated by space. To address this concern, we need to add a space between the variant and packaging size as illustrated in the code below.

undefinedThe initial step is using 'compile()' function to compile a regular expression pattern that matches specific pattern. In this case, the pattern is a single alphabetical character ([a-zA-Z]) followed by a single digit (\d). Then, we create a replacement function called “add_space” to add a space between the letter and the digit in the matched substring. To apply the replacement function, we use ‘re.sub()’ method and list comprehension to iterate through each product name. Here is the output produced by the code above. undefinedAfter ensuring brand name, variant, and size are separated by space, we’re ready to extract them from the 'PROD_NAME'.


We choose to extract brand name first because the pattern is easy to spot. Our brand name is located at the first word. To extract it, we can create a ‘BRAND’ column and take the first word using ‘str.split()’ method and ‘str[0]’ (selecting first element). Afterward, we update our brand name using 'update_brand()' function. This function joins the extracted words with a space. For example, we want to add a space for “RedRockDeli” or “NaturalChipCo”. The output will be “Red Rock Deli” and “Natural Chip Co”.

Once the brand name is addressed, our attention turns to the ‘SIZE’ column. As illustrated in the code above, the fourth line aims to extract the packaging size, identified by the pattern of digit (\d) followed by a letter ([a-zA-Z]) (125g, 175g, etc) from the product name. However, we might encounter an anomaly where the packaging size is positioned in the middle of the text, as shown in the image below.


To anticipate this issue, the fifth line of code above attempts to extract the packaging size from the middle of the text. Additionally, we introduce a new column, ‘PACKING_AT_END’ to indicate whether the packaging size is located at the end of the product name. You might be wondering why we add this column, but don’t worry—I will explain it later. Moving on to the next step, we create an empty ‘VARIANT’ column.


Now, we're entering the final step of word extraction: the variant name. Remember our underrated guy, ‘PACKING_AT_END’? Well, it's time for him to shine. Keep in mind that we have two conditions:

  1. Packaging size that is located in the end of the text.
  2. Packaging size that is located in the middle of the text.

Based on these conditions, we’ve created two lines of code to handle them specifically. As you can see in the code above, the first line of code enables the ‘VARIANT’ to capture any strings between the brand name and packaging size if the packaging size is located in the end of the text. On the other hand, the second line of code helps ‘VARIANT’ to concatenate strings between ‘BRAND’ and ‘SIZE’ with strings after ‘SIZEif the packing size is located in the middle of the text. The last two lines of code standardize the size unit in lowercase form and replace blank ‘VARIANT’ with “Original”.

undefinedWell, let’s take a short break! I need to grab a cup of coffee first...

I'm back!

After ensuring the clarity of the variants, the next step involves reshaping our data frame through the processes of filtering, removing, and ordering columns. Initially, we’d like to create a new column, which, we can call it “Category”. This column assists us in classifying products into chips and non-chips categories. For your information, we have unknown variants that contain “Salsa” word. According to Google, Salsa is a common ingredient in Mexican cuisine, often served as a condiment. You can see a list of “Salsa” products in the image below.


You might be tempted to classify Salsa as a non-chip product. I think you shouldn’t do that. As shown in the list above, we have two chips that contain the word “Salsa”: Smiths Crinkle Cut Tomato Salsa 150g and Red Rock Deli SR Salsa & Mzzrlla 150g. Instead of using “Salsa”, I prefer to use Dip as a condition to classify the product as a non-chip. After searching it in Google, I found that dip is similar to sauce.


Based on the list of dip product above, Smiths Crinkle Cut French OnionDip is a chip. Therefore, we need to create a condition: if a product name contains “Crinkle”, then categorize it as “Chips”. On the other hand, if a product name contains “Dip”, then set the value as “Non-Chips”.

undefinedThe next thing to do is filtering only chips category. The first line of code below demonstrates how to achieve this. Now, our DataFrame exclusively contains chips product. However, it’s important to note that “Salsa” products still exist. To ensure that “Salsa” products are not included in our next analysis, we can filter out variant names that contain either “Mild” or “Medium”, indicating Salsa spiciness level.


Once we’ve done, we can proceed into the next steps, removing and ordering columns. Since we have created brand, variant, and size columns, the product name column becomes unnecessary. Other than that, ‘PACKING_AT_END’ and ‘CATEGORY’ are no longer needed because we’re already sure that our DataFrame includes only chip product. To remove the columns, we need to specify columns we want to remove in a list. Afterwards, we use ‘drop()’ method to remove the columns, as depicted in the image below.

undefinedThen we finally determine the desired column order using a list and creating a new DataFrame with the specified column order.


Here is our final DataFrame: undefined

Checking for outliers in our DataFrame is crucial. We want to ensure that outliers do not influence our next analysis. Therefore, we can use the Seaborn library to visualize a Box Plot and identify any outliers. Let’s call boxplot() function to help us drawing a Box Plot!


We use ‘PROD_QTY’ to identify any unusual purchase amounts in a single transaction. As observed in the box plot above, we have transactions where customers bought chips in unusual quantities. Consequently, we will examine the customer ID using ‘LYLTY_CARD_NBR’ to further investigate.


Upon the provided code above, we determined that these transactions were associated with the same customer. Upon inspecting the transaction records, we discovered that the customer consistently buys the same product in the same quantity. Subsequently, we can exclude this entry from our upcoming analysis using the following code.

undefinedFor the final touch, let's save the cleaned data! undefinedAll right, we have cleaned transaction data and ready to jump into the next analysis. See you in the next part!

Discussion and feedback(0 comments)
2000 characters remaining