__STYLES__
Dataset Description
A link to the FIFA 21 messy raw dataset can be found here. Note: The dataset has approximately 18979 rows and 77 columns and also includes a data dictionary to give you insight into your cleaning process.
Data Cleaning Process
Pay Attention: Before every data cleaning process, be sure to make a duplicate of the dataset. This is to maintain originality and also serves as backup just in case you may need it again.
1. Cleaning the ID, Name, Long Name,FIFA 21 and Club column: After successfully loading data into my Excel worksheet, I noticed the name, long name, and club column had special character’s. The name column contains the players short name while the long name column contains the players full name. I’m guessing you already know the contents of the club column. I was able to clean these columns using power query. Once I loaded the data in UTF-8 type on the query editor it automatically eliminated all the special characters and afterwards I closed the editor and loaded the data on my existing worksheet. Voila! All done.
before
after
2. Cleaning the OVA, POT, BOV columns: The Overall Players’ Analysis (OVA) and Player Potential (POT) and Best Overall (BOV) columns were converted to percentages as written in the data dictionary. A custom column was used to multiply the value by 0.01 (100) to record the answer in percentage (%).
before
after
3. Cleaning the Contract, Joined, and Loan Date End column: While filtering the contract column, I realised it contained different values such as 2004~2021, Free, and Jun 30, 2021, On Loan. Also, the loan date end column had the same values as some of the filtered values in the contract column, so I went on to split the contract column to eliminate the repeated values (Jun 30, 2021) and also give a better analysis, as you can see below. Did you just look below? Yeah, you did.
before
after
before
after
5. Cleaning the value, wage, and release clause columns: Upon loading the data into my worksheet, the columns had data type issues and inconsistent values. The values have suffixes M and K, where M is millions and ‘K’ is thousands. In order to normalise the values, data type, and current currency type, I used the IF function to eliminate the suffixes M and K, where values that end with M are multiplied by a million and values that end in K are multiplied by a thousand.
before
after
6. Cleaning the W/F, SM, IR, and Hits columns: The following columns contain ratings on a scale of 1–5 with a special character in front of each value. The find and replace tool was used to eliminate the special characters. While filtering the Hits column, I found out some values had K meaning thousand in them, so I used the If function to eliminate the K and multiply the value by a thousand.
before
after
Conclusion: Cleaning up the data was a bit of a challenge, but you know what they say—practice makes perfect. It's not the easiest thing, but hey, we're getting the hang of it! You can find the clean data here. Thank you for reading.