__STYLES__
THE DATA
The data comes from Kaggle "Used Car prices in India" . This data set contains about 20k cars data with 13 features. In this Project I am going to use the SQL (Structured Query Language) particularly POSTGRESQL as a tool to clean and analyze the dataset.
The First Step I have used is to clean the Data so that we can analyze it in a better way. In this step I have used the data.io as a cloud platform where I uploaded the data and cleaned the data using the REPLACE, CASE statement, CONCAT, SUBSTR, STRPOS etc. functions.
Then in the next step I have analyzed this dataset using GROUP BY, Windows function, Case statements features.
Data Cleaning Steps:
WITH First_step_cleaning AS
(
SELECT
*,
REPLACE("new_price",'New Car (On-Road Price) : Rs.','') as "New Price",
REPLACE("mileage",'Mileage','') as "Mileage()",
REPLACE(REPLACE("engine",'Engine',''),'CC','') as Engines,
REPLACE(REPLACE("max_power",'Max Power',''),'bhp','') as "Max Power" ,
UPPER(SUBSTR("full_name",1,STRPOS(full_name,' '))) as Brand,
REPLACE(REPLACE("selling_price",'*','') ,',','') as cleaned_price,
REPLACE("seats",'Seats','') as "No. Of Seats",
REPLACE(REPLACE("km_driven",'kms',''),',','') as kms_driven
FROM "cardekho_updated"
),
second_step_cleaning AS
(
SELECT
CASE
WHEN Brand = 'LAND' THEN 'LAND-ROVER'
ELSE Brand
END as "Brand_Name",
INITCAP(full_name) as "Car Name",
bought_year, seller_type,
fuel_type, transmission_type,
kms_driven, "No. Of Seats",
"Mileage()", Engines AS "Engines(CC)",
"Max Power" as "Max Power(bph)",
CASE
WHEN cleaned_price ILIKE '%Lakh%' THEN (REPLACE(cleaned_price, 'Lakh', '')::numeric*100000)
WHEN cleaned_price ILIKE '%Cr%' THEN (REPLACE(cleaned_price, 'Cr', '')::numeric*10000000)
ELSE cleaned_price::numeric
END::integer as "Selling_Price",
SUBSTR("New Price", 1, STRPOS("New Price",'-')) as "Start_Price_Range",
SUBSTR("New Price", STRPOS("New Price",'-')+1, LENGTH("New Price")) as "End_Price_Range"
FROM first_step_cleaning
),
Third_step_cleaning AS
(
SELECT
*,
CASE
WHEN "Start_Price_Range" ILIKE '%Cr-%'
THEN (REPLACE("Start_Price_Range",'Cr-','')::numeric*10000000)
WHEN "Start_Price_Range" LIKE '%-%'
THEN (REPLACE("Start_Price_Range",'-','')::numeric*100000)
END as START_Price_Range,
CASE
WHEN "End_Price_Range" ILIKE '%Cr*%'
THEN (REPLACE("End_Price_Range",' Cr*','')::numeric*10000000)
WHEN "End_Price_Range" LIKE '%Lakh%'
THEN (REPLACE("End_Price_Range",' Lakh*','')::numeric*100000)
END as END_Price_Range
FROM second_step_cleaning
WHERE "End_Price_Range" LIKE '%Lakh%'
order by "Selling_Price"
)
SELECT
"Brand_Name",
"Car Name",
bought_year, seller_type,
fuel_type, transmission_type,
kms_driven, "No. Of Seats",
"Mileage()", " Engines(CC)",
"Max Power(bph)", "Selling_Price",
ROUND((START_Price_Range+END_Price_Range)/2,2) AS "Average Price For Same New Car"
FROM Third_step_cleaning;
I have analyzed the Used Car dataset to focus on the Main Business Question which I have explored are below:
1. Top 5 highest total number of cars from Brand.
2. Top 5 expensive Cars.
3. Bottom 5 Cheap Cars.
4. Top 2 expensive Cars in each Brand.
5. Top 5 Highest Average Selling Price Car Brand.
6. Bottom 5 Lowest Average Selling Price Car Brand.
7. Total Number of Vehicles having different Fuel Type category and their Average Price.
8. Total Number of Vehicles having different Seat number category and their Average selling Price.
9. Total Number of Vehicles having different Transmission type category and their Average selling Price.
10. Total Number of Vehicles having different Seller type category and their Average selling Price.
11. Total Number of Vehicles having different Engine Category and their Average selling Price.
12. Total Number of Vehicles having different Max Power(bph) Category and their Average selling Price.
13. Total Number of Vehicles having different Old years Category and their Average selling Price.
14. Total Number of Vehicles having different Kms driven Category and their Average selling Price.
15. Total Number of Vehicles having different Mileage Category(kmpl) and their Average selling Price.
Now let's Analyze each business Questions asked above.
1. Top 5 highest total number of cars from Brand.
SELECT
brand_name,
COUNT(*) AS "Total Number of Cars",
CONCAT(ROUND(COUNT(*)*100.0
/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),' %') as "% in Total"
FROM "Cardekho_Updated_dataset"
GROUP BY 1
ORDER BY COUNT(*) desc
LIMIT 5
Maruti Brand has highest about 30% of the total in car dekho application, while second is Hyundai with about 20% then Honda about 10%
2. Top 5 expensive Cars on cardekho Application.
SELECT
brand_name,
"Car Name",
CONCAT(ROUND(("selling_price"*1.0/10000000),2),' Cr') as "Top 5 Expensive Car"
FROM "Cardekho_Updated_dataset"
ORDER BY "selling_price" desc
LIMIT 5
FERRARI Gtc4lusso T was the car having the highest Price of 3.95 Cr while 2 cars from BENTLY are in Top 5.
3. Bottom 5 Cheap Cars.
SELECT
brand_name,
"Car Name",
CONCAT(ROUND(("selling_price"*1.0)/1000,1),' K') as "Bottom 5 Cheap Car"
FROM "Cardekho_Updated_dataset"
ORDER BY "selling_price"
LIMIT 5
The cheapest Car was from HYUNDAI named Santro Xing XI of price 25K whereas 3 of 5 cheapest cars from TATA Brand.
4. Top 2 expensive Cars in each Brand.
WITH CTE AS
(
SELECT
brand_name, "Car Name",
selling_price,
DENSE_RANK() OVER(partition by brand_name order by "selling_price" desc) as dr
FROM "Cardekho_Updated_dataset"
GROUP BY 1,2,3
)
SELECT
brand_name, "Car Name",
CONCAT(ROUND(selling_price*1.0/100000,2),' Lakhs') as "Selling Price"
FROM CTE where dr<=2
5. Top 5 Highest Average Selling Price Car Brand.
SELECT
brand_name,
CONCAT(ROUND(AVG(selling_price)/100000,2),' Lakhs') as "Average_selling_Price"
FROM "Cardekho_Updated_dataset"
GROUP BY 1
ORDER BY AVG(selling_price) desc
LIMIT 5
Highest Average Selling Price is for FERRARI, then ROLLS-ROYCE, LAMBORGHINI, BENTLEY and last MESARATI.
6. Bottom 5 Lowest Average Selling Price Car Brand.
SELECT
brand_name,
CONCAT(ROUND(AVG(selling_price)/100000,2),' Lakhs') as "Average_selling_Price"
FROM "Cardekho_Updated_dataset"
GROUP BY 1
ORDER BY AVG(selling_price)
LIMIT 5
DAEWOO is the brand that has the Lowest Average Selling Price 78,000 while AMBASSADOR and OPENCORSA are other brands.
7. Total Number of Vehicles having different Fuel Type category and their Average Price.
SELECT
fuel_type,
COUNT(*) as "Vehical Count",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Avg_Selling_Price (in lakh)",
CONCAT(ROUND(COUNT(*)*100.0
/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),'%') as "Total Vehical %"
FROM "Cardekho_Updated_dataset"
GROUP BY 1
ORDER BY 2 desc;
There are about 49.14% of Total cars are having Diesel fuel type with Average Selling Price 9.42 Lakhs, 48..88% of Total cars are having Petrol fuel type with Average Selling Price 5.49 Lakhs However Electric Vehicles about 0.07% of Total cars are having Average Selling Price is highest which is 11.55 Lakhs.
8. Total Number of Vehicles having different Seat number category and their Average selling Price.
SELECT
"No. Of Seats",
COUNT(*) as "Total Vehicle",
CONCAT(ROUND(COUNT(*)*100.0
/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),'%') as "Total Vehicles(%)",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Average selling Price"
FROM "Cardekho_Updated_dataset"
WHERE "No. Of Seats" IS NOT NULL
GROUP BY "No. Of Seats"
ORDER BY COUNT(*) desc
LIMIT 5
There are 83% of Vehicles are having 5 seats having Average Selling Price about 6.7 Lakhs, 12 % of Vehicles are having 7 seats with Average Selling Price 10.93 Lakhs and 4 Seats cars have the Highest Average Selling Price 20 Lakhs.
9. Total Number of Vehicles having different Transmission type category and their Average selling Price.
SELECT
transmission_type,
COUNT(*) AS "Total Number",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Average Selling Price",
CONCAT(ROUND(COUNT(*)*100.0/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),'%')
as "% Total"
FROM "Cardekho_Updated_dataset"
GROUP BY 1
There are 80% of Cars which are driven manually whereas remaining are Automatic.
10. Total Number of Vehicles having different Seller type category and their Average selling Price.
SELECT
seller_type,
COUNT(*) as Total_Count,
CONCAT(ROUND(COUNT(*)*100.0
/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),'%') as "% Total"
FROM "Cardekho_Updated_dataset"
GROUP BY 1
ORDER BY 2 desc
There are 60% of Total Vehicles are from dealers whereas 39% are from Individuals.
11. Total Number of Vehicles having different Engine Category and their Average selling Price.
SELECT
CASE
WHEN "Engines(CC)"<=1000
THEN '1000 CC - or below'
WHEN "Engines(CC)">1000 AND "Engines(CC)"<=1500
THEN '1000 CC - 1500 CC'
WHEN "Engines(CC)">1500 AND "Engines(CC)"<=2000
THEN '1500 CC -2000 CC'
WHEN "Engines(CC)">2000 AND "Engines(CC)"<=3000
THEN '2000 CC - 3000 CC'
WHEN "Engines(CC)">3000 AND "Engines(CC)"<5000
THEN '3000 CC - 5000 CC'
WHEN "Engines(CC)">=5000
THEN '5000 CC or more'
END as "Engine(CC) Category",
COUNT(*) as "Total Count",
CONCAT(ROUND(COUNT(*)*100.0/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),'%')
as "% Total",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Average Selling Price"
FROM "Cardekho_Updated_dataset"
WHERE "Engines(CC)" IS NOT NULL
GROUP BY 1
ORDER BY 1
There are about 59% of the Total Vehicles which are having the Engine capacity 1,000 CC - 1,500 CC where as the 14% of the Total Vehicles are having below 1,000 CC. We can also see that the Average Selling Price is directly dependent on the Engine capacity.
12. Total Number of Vehicles having different Max Power(bph) Category and their Average selling Price.
SELECT
CASE
WHEN "Max Power(bph)"<=100 THEN '100 bph - or below'
WHEN "Max Power(bph)">100 AND "Max Power(bph)"<=150 THEN '100 bph - 150 bph'
WHEN "Max Power(bph)">150 AND "Max Power(bph)"<=200 THEN '150 bph -200 bph'
WHEN "Max Power(bph)">200 AND "Max Power(bph)"<=250 THEN '200 bph - 250 bph'
WHEN "Max Power(bph)">250 THEN '250 bph or more'
END as "Max Power(bph) Category",
COUNT(*) as "Total Count",
CONCAT(ROUND(COUNT(*)*100.0/(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),'%')
as "% Total",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Average Selling Price"
FROM "Cardekho_Updated_dataset"
WHERE "Max Power(bph)" IS NOT NULL
GROUP BY 1
ORDER BY 1
There are 66% of Cars are having Max Power having 100 bph or below then 22% of Cars are having Max Power between 100 bph - 150 bph. However the as the Max Power is directly dependent on Average Selling Price.
13. Total Number of Vehicles having different Old years Category and their Average selling Price.
WITH CTE AS
(
SELECT
*,
(SELECT MAX(bought_year) from "Cardekho_Updated_dataset")
- bought_year AS "How much old(years)"
FROM "Cardekho_Updated_dataset"
ORDER BY bought_year desc
)
SELECT
CASE
WHEN "How much old(years)"<=3 THEN '3 years - or below'
WHEN "How much old(years)">3 AND "How much old(years)"<=7 THEN '4 years - 7 years'
WHEN "How much old(years)">7 AND "How much old(years)"<=11 THEN '8 years -11 years'
WHEN "How much old(years)">11 AND "How much old(years)"<=20 THEN '12 years - 20 years'
WHEN "How much old(years)">20 THEN '20 year or more'
END as "How much old(years) Category",
COUNT(*) as "Total Number of Cars",
CONCAT(ROUND(COUNT(*)*100.0/(SELECT COUNT(*) FROM CTE),2),' %') as "% Total",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Avg selling price"
FROM CTE GROUP BY 1
order by 2 desc
As the years old increases the Price of the car decreases. 47% of the total cars are in between 4 to 7 years old category
14. Total Number of Vehicles having different Kms driven Category and their Average selling Price.
SELECT
CASE
WHEN "kms_driven"<=10000 THEN '10000 kms - or below'
WHEN kms_driven>10000 AND kms_driven<=30000 THEN '10000 kms - 30000 kms'
WHEN kms_driven>30000 AND kms_driven<=70000 THEN '30000 kms - 70000 kms'
WHEN kms_driven>70000 AND kms_driven<=100000 THEN '70000 kms - 100000 kms'
WHEN kms_driven>100000 THEN '100000 kms or more'
END as "kms_driven Category",
COUNT(*) as "Total Number of Cars",
CONCAT(ROUND(COUNT(*)*100.0/
(SELECT COUNT(*) FROM "Cardekho_Updated_dataset"),2),' %') as "% Total",
CONCAT(ROUND(AVG("selling_price")/100000,2),' Lakhs') as "Avg selling price"
FROM "Cardekho_Updated_dataset"
WHERE kms_driven IS NOT NULL
GROUP BY 1
order by 2 desc
There are highest 48% of the Vehicles are in the range of 30,000 - 70,000 kms driven Range. However as the kms driven increases the AVG Selling Price decreases.
15. Total Number of Vehicles having different Mileage Category and their Average selling Price.
WITH CTE AS
(
SELECT
*,
REPLACE(REPLACE("Mileage(kmpl/kmkg)",' kmpl',''),' km/kg','')::numeric as new_mileage
FROM "Cardekho_Updated_dataset"
WHERE fuel_type IN('Diesel','Petrol')
)
SELECT
CASE
WHEN "new_mileage"<=10 THEN '10 kmpl - or below'
WHEN "new_mileage">10 AND "new_mileage"<=15 THEN '10 kmpl - 15 kmpl'
WHEN "new_mileage">15 AND "new_mileage"<=20 THEN '15 kmpl -20 kmpl'
WHEN "new_mileage">20 AND "new_mileage"<=25 THEN '20 kmpl - 25 kmpl'
WHEN "new_mileage">25 THEN '25 kmpl or more'
END as "mileage kmpl Category",
CONCAT(ROUND((COUNT(*)*100.0)/(SELECT COUNT(*) FROM CTE),2),' %') as "% Total"
FROM CTE
WHERE "new_mileage" IS NOT NULL
GROUP BY 1
ORDER BY (COUNT(*)*100.0)/(SELECT COUNT(*) FROM CTE) desc;
There are about 43% of the Total (Diesel/Petrol)vehicles are having Mileage between 15kmpl - 20 kmpl whereas 20kmpl - 25 kmpl range is having 34% of total. However the Average Selling Price is higher if the mileage is Lower.
INSIGHTS
After looking at the data, I noticed that CarDekho Platform is a very good platform to buy/sell the Cars of Brand Like Maruti, Hyundai, Honda etc. Also If you Want to Buy a car of Fuel Type Diesel/Petrol then here you can buy those Cars. The Price of the Car Ranges between 5 Lakhs to 20 Lakhs. It also majorly contain 83% cars which are of 5 Seater.
There are many factors that affect the Price of Used cars like if mileage is lower then Selling price would be higher, kms driven are lesser then Selling price would be more, Fuel Type is also a major factor like if the Car is electric then it's price would be higher, It also depends on the car brand like FERRARI, LAMBORGHINI ROLLS-ROYCE, BENTLEY, MESARATI etc. are of higher price. It also depends on the Engine Capacity like if the Engine CC is more then Selling price would be more. also if the Max Power is more then the Selling Price would be more.