__STYLES__
Cooperators General Insurance Company (CGIC) has collected quotes from customers which gives estimate information on the auto insurance coverage premium based on the information provided by the customers There are many times where a customer applies but may not end up purchasing insurance from that firm. Due to large amount of data, CGIC is therefore looking for a way to determine the quotes should be followed up, based on the probability that the quote would become "bound / binding." CGIC is interested in exploring whether the client details provided during the quoting process can be used to determine whether a quote can be bound. To do this, CGIC wants to use data mining techniques to better understand the differences in quotes among their highly diversified client base. The aim is to have a statistical model in place that can determine the probability of a quote being bound and will help assist the company in evaluating the quotes.
From the provided table of statistics generated using MS-Excel, we observe that attributes such as “TRACKING_SYSTEM”, “MARKING_SYSTEM”, “YEARS_AS_PRINCIPAL_DRIVER”, “OCCUPATION”, “VEHICLE_OWNERSHIP” and “VEHICLE_VALUE” have more than 50% of missing values. Thus, as there are more missing values, we neglect them in the analysis as this might result to type-2 error and invalid conclusions.
Descriptive Statistics table have been generated using SPSS for all the numerical attributes where the total records are 101891.
ANNUAL_KM: There is no missing value for this attribute. There are a few clients who have not driven the vehicle and the other clients have driven the vehicle for a maximum of 20M kilometers.
COMMUTE_DISTANCE: There are 35413 out of 101891 missing values for this attribute. During commute, clients travel distance range from 0 to 98000 km.
VEHICLE_VALUE: There are only 37218 data available out of 101891. The estimate of the vehicle value from the client ranges from $0 to $0.8M.
YEAR_LICENSED: There are 101883 data values for this variable. Minimum and maximum values are 0 and 135 respectively. This means that there a few clients with no license and others with license up to 135 years.
YEARS_AS_PRINCIPAL_DRIVER: There are 612 data values existing for this variable. There are clients with no driving experience as a principal driver on their specific vehicle and other clients with up to 50 years of experience.
Many values in the makes column have spelling mistakes, as a result, error or noise can be seen in the analysis. Also, confusion between the Vehicle Make and Model could be noticed. We have taken necessary steps to clean these values considering our knowledge about car.
Firstly, we will replace wrong names with possible known brand names. Few entries in this column contains Trucks and Vans for the purpose of learning. Therefore, we created new column named “vehicle category”. We could also use this dimension for further analysis.
We deleted the undisclosed data in vehicle make column. Irrelevant details were also sorted. In Vehicle model column, duplicate data was rectified as well as the spelling mistakes were corrected.
In ANNUAL_KM column, blank values were replaced by ‘unknown’. Threshold value for the Annual values are taken as 100k – 140k, these values are taken from the TD Insurance company. Mean of the entries were taken that fall within the threshold range and unknown values were then replaced by the mean value.
In COMMUTE_DISTANCE column, All the ‘0’ values are software error and these values were replaced by the mean calculated which was approximately 14.2Km. Also, all unknown and blank values were replaced by 14 Km. Besides this, we applied average of annual km to treat all the unknown values for commute category. We used 16416km for commute category, similarly 23090 Km for Business category, and 11549 Km for Pleasure category.
In case of GENDER, we assumed that majority fall under the male category, hence we treated all unknown and blanks as the “Male” class.
In case of YEAR_OF_BIRTH, we replaced all unknowns with the formula i.e. “[(2016) – (Mean of Age)]”. For AGE we used Mean to replace all outliers and noise.
To analyse POSTAL CODE, we created three columns named Province, Region and City. Then, deleted all area code that do not belong to Canada. We deleted those entries that have unknown in above stated columns. We also deleted records for Area codes.
In case of AREA CODE, all unknown area codes were replaced with province’s majority cities. And for CITY column, we used Postal code and area code, but in case of conflict, we prioritized based on Province.