__STYLES__
Click here to view this project on my GitHub.
Explore cancer mortality rates around the world:
I started with a dataset of cancer death counts, by country, year (1990-2016) and cancer type (20 types).
The first problem I encountered was that the numbers were death counts, and since my goal was to compare mortality rates across countries and time, of course I had to take into account population size. I found a dataset which contained population counts by country and year, and then I was able to start working.
The second problem was that the cancer death table was pivoted – each cancer type data was in a different column, instead of a different row. Tableau has a great unpivotting feature, but since it doesn't actually alter the source files, it couldn't help me here. I unpivotted tables in SQL in the past, but here, with 20 columns to unpivot, I looked for a different solution, and eventually used Excel and ~10 minutes of manual work :)
Third problem – the population dataset was pivoted as well, but unlike the cancer dataset, here it wasn't just 20 columns, but 170... so manual cut-and-paste wasn't an option. Instead, I copied the population table to the cancer Excel file and used VLookup() to get the right population size in each row. Since I needed to lookup not only for the right year (lookup by row) but for the right country as well (lookup by column), I combined VLookup() with Match() – it looked like this:
= VLOOKUP (C2, PopulationTable, MATCH (A2, PopulationTableHeader, 0), false)
It was my first time ever combining VLookup with Match, and I loved how it worked like magic!
Fourth problem – so yes, VLookup and Match worked great and most countries got their population sizes, but many didn't, simply because their names were different in the cancer and population tables. There was no easy fix for that – I had to go one by one and check what the problem was. Sometimes it was just due to different spelling, but sometimes I had to look up these countries and find their alternate names, change them in one of the tables, and then the VLookup-Match functions were able to take it from there.
Once this was done, I had one table with both the cancer death counts and population sizes, but I decided to break it up to two, so that I'll be able to practice Join on SQL.
I loaded the tables to SQL Server and started querying.
First things first, joining the cancer and population tables and calculating the mortality rate per capita. This query results were going to be the basis for the rest of the project, so I turned them into a View. I love how it made all the next queries so much more readable and clear, than if I used a subquery instead.
When the View was ready, I wrote various queries to explore the data, such as:
Other queries helped me answer specific questions that arose when looking at the previous queries' results, such as:
One of the things these queries showed, is that liver cancer has one of the highest mortality rates worldwide (it's among the top five), but in Israel it's only at number 12-14. It might be useful to understand what causes this positive difference in Israel.