__STYLES__
Data cleaning identifies and fixes errors, duplicates, and irrelevant data from a raw dataset. Part of the data preparation process, data cleaning allows for accurate, defensible data that generates reliable visualizations, models, and business decisions.
On an average, organizations believe that nearly 30% of their data is inaccurate. This dirty data costs companies 12% of their overall revenue — and they’re losing more than just money. Cleaning produces consistent, structured, accurate data, which allows for informed and intelligent decisions.
This project uses publicly available dataset of New York City (NYC) parking violations and can be found here.
The NYC Department of Finance collects data on every parking ticket issued in NYC (~10M per year!). This data is made publicly available to aid in ticket resolution and to guide policymakers.
The aim of this project is to identify potential errors related to data entry, data inconsistencies and data formatting, and then handle these errors with the help of various functions available in PostgreSQL. I have identified 24 out of 43 available columns from the dataset, which have been best understood by me in terms of their meaning and significance. I have attempted to clear various errors identified in the dataset and then extract some valuable insights from the cleaned data.
The dataset consists of one table named “NYC2016_17” and contains the following columns:
If we execute the following query, then we get a total of 527 different vehicle body types. Below are some of the results obtained: To me, it definitely appears unusual, because 527 is a very large number for vehicle body types. Based on careful inspection and some google search, one comes to know that almost similar vehicle body types are assigned different categories based on trivial details that may not matter much during further analysis.
One of the vehicle body types is ‘P-U’, which basically stands for Pick Up Truck. There is another category named ‘TRK’ which stands for Truck. Now a Pick-Up Truck can easily be categorized under a general category Truck which covers anything that is a truck. Same logic can be applied to other categories as well, based on an individual’s understanding of various abbreviations used for categorizing vehicle body types.
The below query shows the total number of records for both “TRK” and “P-U” before the changes are made. Now to make the relevant changes, I use the REPLACE function, which is one of the various string functions available in PostgreSQL. It basically takes the input string in which replacement needs to be done, the string to replace, and the string with which it should be replaced. The syntax is as follows:
REPLACE (string, old_string, new_string)
Now if we again execute the previous query, then the results are as below: The new count is 2534, which is the total of 1924 and 610 obtained before the update statement. Other changes made by me using the same syntax above are shown below as a,b , where a has been replaced by b:
Similarly, if a person has knowledge about other vehicle body types, then these changes can be made to other values also using the above written query, which may be beneficial while reporting grouped summary statistics or other insights.
In the above step, we exactly knew which category to replace with what. But, many times, there are cases where we have a group of very similar sounding text strings representing identical values, it is just that they differ in spelling as shown below in the next data cleaning step.
When I query DISTINCT vehicle colors, then I get 756 rows, which again is surprisingly large value for colors. Using the following query, a portion of results is displayed which shows the data entry errors. As can be seen, for Black color, we have B L, B LAC, BACK, BAK, B ACK etc. Dealing with these types of errors with the use of REPLACE function can become cumbersome as there are a lot of values involved. So here I prefer to use the SOUNDEX algorithm approach, which uses fuzzystrmatch extension in PostgreSQL.
SOUNDEX algorithm gives out a 4-character code for input string based on the sound of the string. Then that code can be put in to the DIFFERENCE function, which gives out the number of matching characters in the code. If the difference is 4, then that means the strings compared have the similar sound when pronounced, only their spellings are different, which may be due to human error during data collection or entry. For example, the following query gives out all the colors whose sound match with GRAY.
Therefore, all these rows are updated using the following query.
This implies that 66747 rows of data have been updated to GRAY color.
I have used the same query to update WHITE, RED, BLUE, YELLOW, GREEN, PINK, SILVER, BROWN and a total of 3,14,152 rows are updated.
Running the query again to find the distinct colors in our data, we get 645 colors. So, I have been able to reduce distinct colors by 111 colors. Using the same approach many other colors can be updated which are actually the same but just have different representations in the dataset.
Potential Drawbacks of Using the above approach: The approach used above may go wrong in certain cases. For example, DIFFERENCE (‘GRAY’,’GR’) gives 4, so the query above will replace GR with GRAY. Now GR can also be representative of GREEN too, similar argument goes for GREE, GGREE.
So the alternative approach is to filter the values which are not equal to ‘GR’, 'GREE'','GGREE' and then apply the update statement. The more one keeps going into the details, the more arduous the task becomes of getting the correct colors.
The column, plate_id, contains information about the registration plate number of the vehicle booked for violation. Sometimes, we may want to protect parking violation recipients' privacy, maybe in a new report to be released on web. In that case, it will be better if all letters in the plate_id column are replaced with a dash (-) to mask the true license plate number. Here, I used the REGEXP_REPLACE function, which is one of the functions for pattern matching in PostgreSQL. The syntax is as follows:
REGEXP_REPLACE (source, pattern, replace, flags)
For example, REGEXP_REPLACE (xyzzy, '\d', '', 'g') would replace any digit character (\d) in the column xyzzy with an underscore (). The g ("global") flag ensures every match is replaced. So as we can see in the above two examples, the REGEXP_REPLACE function gives us greater flexibility than simple REPLACE function, which only works on exact matches.
Dealing with missing data involves following multiple approaches as it depends upon the data that is actually missing. If data is missing from a numerical column, one approach can be to replace it with the mean or median of all other values present in the same column. There are many other approaches possible for replacing the same missing value, it depends upon the data and also on the domain knowledge of the person dealing with the data.
Here we deal with missing values in the Vehicle Body Type Column. So, there are 3993 records that are missing Vehicle Body Type. For dealing with missing values, I execute the following query to check the most commonly owned vehicle body types by people. 35% people own SUBN (Suburban car : one that has windows on the side in the rear, and has seats in the rear that can be folded or removed so the vehicle can carry cargo) and 29% own 4DSD (4 door Sedan) which is similar to SDN and 2DSD (2 door Sedan). So, our intuition says that we can replace all the NULL vehicle body types with the one that is owned the most. Suburban or Sedan becomes the obvious choice. In that case, we may use the following query to make the changes. On the other hand, if we want to have more accuracy and ready to spend more resources, then we can look up the car information from its registration plate number, which is available in almost all the records and then can then fill each missing value manually. The car records database can be accessed after discussions and permission from the concerned authorities.
For the time being, the best approach then becomes to replace the missing Vehicle Body Type with ‘Unknown’, so that they can be dealt with later on. COALESCE function outputs the first non-null value in the arguments provided. So, for all null vehicle body types it will output Unknown.
Since summons number is Primary key in the dataset, it cannot be duplicated. But if other columns are duplicated with different summons number, it would imply that a person has been issued multiple tickets for the same violation, which can definitely become a reason for dispute by the owner of vehicle.
The official NYC website lists duplicate tickets as one of the reasons for disputing a ticket. It clearly mentions the following criteria under which a ticket shall be considered duplicate and deemed to be cancelled:
ROW_NUMBER() function here is defined, which will execute on the partition provided by all the factors. So, if in the rows, all the factors hold the same value, then it would give 1,2,3 and so on. I have subtracted 1 from it, so that if the output is greater than 0, it automatically implies a duplicate record. There are a total of 61 duplicate records. For more clarity, one can see duplicate column where the value is 1&2 consecutively. This shows that for the same location, same violation, same violation type and same time, multiple tickets were issued.
So, in order to get rid of those records, we can simply use the query defined above as shown below. So, all the disputable tickets have now been cleared from the system, thus preventing any potential disputes from people charged with multiple violations.
Invalid values are the ones that deviate from the normal values that are expected to be entered in a column. For example, a string value that comes up in a column that can only contain integer. Database constraints can be helpful in preventing entry of invalid data in to the database, but in that process we may end up losing valuable information due to rejection of invalid records by the database. In the queries below, I have tried to identify invalid values in some columns. The query above returns 3376 records that do not match the standard criteria for Registration state column, that it shall contain two upper case letters only. And all these invalid records contain 99 as the value for Registration State. The query above returns 5222 records that are invalid, because 999 plate type is not defined by New York’s department of motor vehicles. More information about plate types can be found here.
Other columns can also be checked for invalid values using the approach of regular expressions or range constraints for time values like Vehicle Year. But dealing with these invalid values require domain knowledge or information from the NYC open data team, so that one can deal with these values correctly.
I have contacted NYC open data department via online query form, to see if they answer my query regarding these invalid data points. The proof is attached below:
A quick look at Violation Time shows that it is in the following format: Here the time is like HoursMinutesA for AM and P for PM.
Following query converts this string data into readable time for those string values that can be converted into time without errors. Others cannot be converted because it represents invalid values for time. I have no idea as of now how to clean those values which are going out of range. Below is an attempt to convert the dates into a valid format. Here some string values of violation time are not valid, and need to be filtered out so that TO_TIMESTAMP function gives an output. For example, the two digits representing hours should not go beyond 12 because it does not make sense when we are using AM and PM along with it. Similar logic applies to the one that has 0 for hours. Now, those time values which have hour value our of range like 4473P, need to be converted to NULL because they contain misleading and inconsistent information. So, the below two queries CAN BE used to make the necessary changes to the dataset. So, in this way, one can clean the invalid Violation Date values and also convert the valid values in to proper time format. The same approach can be used to alter the other time columns like Time First Observed, From Hours In Effect, To Hours In Effect and then valuable insights can be extracted accordingly from the data. I have not updated the values I proposed to be set NULL in the dataset, because I do not want to lose the time information that they contain. With this, I wrap up the data cleaning and proceed to extract some insights from the data.
Is there any difference between frequency of parking violations during day time and night time?
Here the query searches for times ending with A (AM) and assigns them to morning category and P (PM) to night category.
Insight: Parking violations happen to occur with almost equally frequency at both times of a day, without any underlying pattern to them on an aggregate level.
What may be the reason for parking violations? Is it the case that violators attempt to park in restricted areas just before the parking restrictions end. Try to investigate this phenomenon
Insight: There are a few parking violations only that occur within one hour before the parking violation is supposed to end. In other words, there are only a few cases where people violate parking restrictions by parking in the restricted areas within 1 hour before the parking restriction is going to be uplifted. So the data does not imply that violators attempt to park in restricted areas just before the parking restrictions end.
As seen in the above two cases, a lot of problems are caused during analysis due to inconsistent data in columns Violation Time, From Hours In Effect, To Hours In Effect. These columns if filled with proper format, can lead to better analytical results.
Identify if there are any particular ticket issuing agencies that fill the time related columns with inconsistent data?
Insight: Majority of data inconsistency is created by agency P. Based on this analysis, members of agency P should be contacted and given proper training sessions so that they fill in the data properly, avoiding potential barriers to analysis and also ensuring data quality.
In the data, there are some parking violations at the intersection of two streets (records where column intersecting street is not NULL). As obvious, parking violation on street corners can sometimes cause accidents.
So, identify the street corners with the greatest number of parking violations?
Insight: Here we have got a list of intersections where most of the parking violations occur. Based on this data, further prevention steps can be taken by authorities to prevent wrong parking there in future.
Data cleaning is a very vast topic and there are many different possible ways to deal with inconsistent data. What I have learnt from this project is that to be efficient at data cleaning, one needs to have a good understanding of the data itself, what every column means and also the domain knowledge involving a particular dataset. If one does not have domain knowledge, then one should try to get in touch with someone who does. Because directly applying some methods to clean data may seem to be easy but one may be losing crucial information just due to lack of proper knowledge required to work with the dataset.