What to consider when scaling with Populations?

Tools used in this project
What to consider when scaling with Populations?

Comparison Dashboard

About this project

The use of absolute numbers doesn't always lead to the right decisions, prompting a common practice of scaling them. For instance, consider the count of sunburn cases in summer. When fewer people swim, this count decreases; with more people enjoying the sun, it rises. Just observing this number might suggest an increased risk of sunburn, but in reality, the risk remains the same when scaled relative to the number of sunbathers. Gaining insights on this requires relative values, which aren't always easily obtainable as they're tied to the number of people involved.

The problem with populations

Populations are not static; they change over time. Comparing different states or cities reveals varying rates of change in population. Some areas may see minimal change, while others might experience substantial increases. Dividing past cases by a higher population than the actual one in this year results in an underestimation of the original values, introducing a deviation in the data.

undefinedTable with populations when all years are selected.

To solve this issue, thre are three potential methods for scaling, assuming one knows the cases per year. And last but not least in addition the best practice - in my opinion.

The most accurate one:

One approach involves inputting the population data for each year in my case, the anual population. Then, in calculations, the average number of people within the chosen period is used. However, this method isn't entirely accurate.

To achieve high precision, weighted ratios would be necessary. This process entails calculating the ratio for each year and multiplying it by the reported cases. This calculation must be performed for every year within the selected period. Subsequently, these values are summed and divided by the total reported cases.

However, due to its time-intensive nature, I opted for a closer assumption, which involves using the average of the selected period. This approach was then compared to the time-saving alternatives.

The maybe time saving ones:

Numer One: You can scale the cases by using the most recent amount of people

This method is the "quick and dirty" solution, involving pulling in the population data for a single year—in my case, the population of 2022. The approach entails taking all reported cases and dividing them by the population of 2022. However, a significant caveat arises here. In my scenario, relying solely on the most recent population figures led to less accurate results, particularly for earlier years like 2004, with deviations reaching around 39%.

undefinedNumber Two: You take the average of the amount at the start/end of the period

I expanded on this method by calculating the average between the population at the start and end points of the period. When observing a specific year, like 2004 as depicted in the image above, this approach yields better results than employing the population of 2022. However, it falls short of the accuracy achieved by using the precise average population for that specific year, which is the population of 2004 in this instance.

With multiple years selected, the results improve gradually. The deviation lessens as more years are taken into account. and as states are affected, that had small incremental changes.

undefinedSo,I hope you can see, that it is not that easy to normalize values to populations. At least it is not done as fast as you may think in the first approach as the most accurate solution would be pulling in the annual population for each state.

But what would be an experiment with the best solution, that is both? Accurate enough and time saving?

My favorite one

For this one you also just have to pull in the populations of 2002 and 2022 and then you adjust the population for the selected years. So you make an linear interpolation between the year 2002 and 2022. Technically speaking this is not as eassy as it sounds, because by selecting more than one year, you first need to calculate an average year.

Besides of that you need:

Population 2002 = CALCULATE

Population 2002 keepfiltrs = 
    Population[Population 2002],
        Population[Year] = 2002

The same has to be done with the year 2022.

And in my case the formula for adjusting the population looks like that:

Adj Pop = Population[Population 2002 keepfiltrs]+((Population[Population 2022 keepfiltrs]-Population[Population 2002 keepfiltrs])/20*([Average Year]-2002))

As the poluation is growing more or less linear, for me a linear interpolation was enough. If you can see that your sample is not growing linear, you also need to adapt the interpolation.

By using this formula I get a population that is pretty one to the acutal annual population:

undefinedSo I hope this was helpful! And to get more into the issue, you can use the published Power BI dashboard.

Best, Jasmin

Population Data taken from Census.gov

Additional project images

Discussion and feedback(2 comments)
Chris Dutton
Chris Dutton
5 months ago
Thanks for sharing Jasmin! This is such an important and nuanced topic that many analysts may lose sight of, and a great resource for you to share with the community. Keep it up!
2000 characters remaining