Tools used in this project
Power Outages

Power Outages

About this project


This dataset comes from Maven Analytics’ monthly challenges. It contains data all the way from 2002-2023 on power outages in the US (provided by the Department of Energy). Some of the information included are demand (electricity) loss, region affected, number of people affected, cause of outage, and restoration time. The task was to clean the data, create a dashboard, and then talk about the trends while also finding the weak points in the grid.

Cleaning the Data

  1. First step was to look at all the sheets using sort and filter to get an idea of the data. I deleted the “alert criteria” column because it was irrelevant for my analysis and wasn’t included from 2002-2014. Then in the date column I removed the months rows which had text so that only the actual dates were left.
  2. I used formulas to combine the date and time into one column for the outage event as well as the restoration time using a yyyy-mm-dd hh:mm:ss format. In the process using find and replace to fix errors in the time (ex. turning a.m. to AM), also correcting time/dates which were in text such as “noon”.
  3. Next step was to tackle the area which I wanted to view at a state level. I used ChatGPT to create a formula using all the states and put them in a list using commas. Specifically, =IF(ISNUMBER(SEARCH("Alabama'', E4)), "Alabama", "") & …. all the way to Wyoming. There were some repeated areas that weren’t states such as Houston, TVA (Tennessee Valley Authority), Duke Energy Carolinas Balance Authority (North Carolina). I used Google to see which state these areas corresponded to and included them in the equation to make the formula work better. There were 576 records (outages) that involved 2 or more states.
  4. I did these steps for each page and then combined them into a single page, the following steps were done after this step.
  5. Values that said “over #” or “under #” I replaced it with the number. Values that said between, i replaced it with the mean of those two numbers.
  6. I consolidated the type of disturbance column into fewer categories. For example, vandalism included theft, sabotage, and physical attacks. Natural disasters included tropical storm, hurricane, flood, and earthquake. Fuel supply emergency included coal, hydro, natural gas, water, and petroleum.
  7. Then, I consolidated the regions into the 6 main NERC regions, there were several specific regions that were not part of the main 6 such as FRCC (Florida) which belonged to the SERC region. There was also SPP which belonged to the MRO region and other state specific regions which I looked up and put them in the appropriate region. There were also a few errors such as Florida in the TRE (Texas) region and Wisconsin in the RFC instead of MRO region so I fixed those as well.
  8. There were outages in Canada such as BC, Alberta, and Manitoba as well as Hawaii and Puerto Rico which i deleted since I wanted to focus on mainland US for my analysis (there were 89 outages that weren’t mainland US, 65 of them from Puerto Rico).
  9. I deleted rows where the restoration or outage time/date were missing. I figured that was the most important piece of information in this dataset (if not preventing these outages, seeing if they’re being fixed faster than before).
  10. Looking at the restoration time and outage time, some differences were negative. The problem was: some outages began 12/30 and ended 1/1 but the year was the same, it didn't switch to the next year so I fixed this. Also some of the restoration times finished at midnight or past midnight yet the date didnt change as it was supposed to so i fixed this
  11. After cleaning the original 3974 records, I now had 3496 records.
  12. One last thing I did was realize how any map visual would be slightly inaccurate since 576/3496 (16.48% of the data) involved 2 or more states. The way the data was collected, there's no way to know specifically how many people from each state were affected by an outage or how the loss was divided between those 2 or more states, but this additional step made it less inaccurate.
  13. I wanted to include a map visual of the # of people affected by an outage per state. My solution was to split the # of people affected by however many states the outage involved. (ex. If an outage affected 7500 people in the states Georgia, Alabama, and Louisiana I would make a new row for each state and put 2500 for the # of people affected). This was a bit tricky in excel, especially to keep them in order, I used formulas to count the number of states per outage in a new column. Another formula to separate the states horizontally so that each outage had their own row (some outages involved 10 states). Then i used power query to divide the total # of people by the count of states, and also unpivoting the separated states (putting them into a single column). I added this data to the bottom of my cleaned combined sheet, and made sure to delete the rows from the “area” and “# of people affected” column it corresponded to. Now, the map visual would be less inaccurate.


Some of the key metrics I wanted to look at were # of outages, demand loss, downtime, and # of people affected. For my analysis, I decided to focus on the year 2022, comparing it to the rolling average of the past 5 years (2018-2022). Nonetheless the dashboard is fully interactive including a “year” filter and being able to filter by clicking on the visuals. Other areas I wanted to compare were the causes of these outages as well as the line graphs for # of people affected and downtime. Total downtime by region is also on the dashboard. The last thing I wanted to analyze was the total people affected, on a state level with the map, and on a monthly level with the line graph.

Analyzing the Data

Looking at the # of outages, the average for the past 5 years has been 285.4. In 2022, there were 335 total power outages which is above average. The demand loss however, was significantly lower at a total of 63,110 megawatts. Comparing that to 2018-2022 which had an average demand loss of 93,362 MW annually, it shows that more outages doesn't necessarily mean more demand loss.

The same thing goes for downtime, despite having more outages, 2022 experienced 3,565 hours of downtime compared to the 2018-2022 average of 5,281.6 hours. This is the main reason I chose to put the stacked bar and line graph on my dashboard. To show that more electricity lost doesn’t correlate strongly with more downtime (more loss doesn’t automatically mean it takes longer to fix). Then there is the # of people affected which in 2022 was 11,285,742 compared to the 2018-2022 average of 14,125,518. Clearly there were less people affected than usual which is a positive thing.

Moving on to the causes, the top causes for outages between 2018-2022 were severe weather at #1, followed by system operations, and vandalism. In the year 2022 it was slightly different. Vandalism was the top cause, then system operations, severe weather, and the last big cause was interruption. There is a pattern here with the top 3 causes, just not in the same order. It experienced more vandalism than it normally has in the past.

Looking at the # of people affected from 2018-2022, there are peaks in February and September with low points in May and November. Comparing that to just 2022, the peak and fall off is similar in September-October but with a sharper decrease. However, there is no peak in February for the year 2022 unlike the past 5 years which suggests something must’ve changed and it worked. On a state level, the 5-year period and 2022 were similar for # of people affected. In both, states like FL, CA, TX, and NV were affected the hardest (logically given their larger population). But what stands out is Michigan, it was one of the most affected areas in that 5 year period as well as overall. This could be due to its location which after further examination seems to be the case because most of the state’s outages were from severe weather.

Lastly I wanted to look at the dashboard and analyze the results for all the years. What we see is that August is definitely the worst month for outages. Severe weather is the #1 cause for power outages (1,313) followed by vandalism (796). The RFC region, which is in the midwest, had the most downtime which is surprising. I would've thought a coastal region like the south (SERC) region would’ve been the most affected but it turns out the midwest is more affected. Maybe the support for outages could be less advanced in the midwest causing restoration time to take longer.

Key Takeaways

From this analysis, we can glean numerous key takeaways. The first being a need for better infrastructure. Looking at the top 3 causes, measures need to be taken to make the infrastructure more weather proof, vandal proof, and reliable. By making the infrastructure more resistant to weather, it can reduce the number of outages. By adding more surveillance or increasing the security of the infrastructure, it will deter vandalism. Then, to minimize system operation failures, the Department of Energy can invest in newer technology or frequently do maintenance to make the system less prone to failures.

Moreover, given that August is the worst month for outages, it would be beneficial to increase support during that time. Whether it be monitoring power plants, generators, or whatever it may be more closely. Allocating more staff during that month would also be a good idea for fixing outages more efficiently. In regards to the “area” portion of each record, using zip code instead of city/state could be a better way to streamline the process. The last takeaway from this is that the data collection process needs to be improved. There should be just one template used to record all outages because in the original dataset, the sheets varied significantly. And there was a lot of incomplete information so getting as much detail from each power outage occurrence would help improve the integrity of the data.

Additional project images

Discussion and feedback(0 comments)
2000 characters remaining