__STYLES__

Where do the lights go out most often? - Power Outages in US [Finisher]

Tools used in this project
Where do the lights go out most often? - Power Outages in US [Finisher]

Dashboard Power Outages U.S.

About this project

OBJECTIVE

I'll be playing the role of a Senior Analytics Consultant hired by the U.S. Department of Energy (DOE).

As Electricity outages are a growing concern entering an age of unprecedented energy demand and climate disasters, we would like to have a dashboard to understand patterns and trends around outages, quantify their impact on our communities, and identify possible weak points in the grid. But as our data are messy, you first need to consolidate and clean up the raw data.

We have event-level power outage data going back from 2023 to 2002.

Finally, please I’ll explicitly call out any caveats or assumptions I make in regard to data quality issues or missing values.

DATA SET

The provided data contains:

  • An Excel file with a data dictionary
  • An Excel file with the single events from 2002 to 2023. Providing the date when it occurred, the area, the restoring date, the Power Loss and the affected customers as well as the reason for the failure (3974 #of records with 8 fields)
  • OE417 E-filling system tracking
  • OE417 Form Instructions
  • OE417 Survey Form

DATA CLEANING

This time the data needed a lot of cleaning. And I really mean a lot 😉

Number of columns/ Order of rows

All the sheets adapted to show the same number of columns in the same order as the sheet from 2023. Rows that summarized more incidents were split in singe rows. Empty rows were deleted. Titles were only kept in the first row of each sheet. Rows with Areas outside of US were also deleted. Like 2005 the Area Hamburg. Entries with no given region were also deleted.

Restoration Dates/ Dates

Dates and Time were given in one column in a not proper way. So, I created two columns out of the given columns showing the date and the time in a proper way. Restoration dates with the year 2001 are changed to 2007 or 2011, it’s assumed that they are not written in a readable way. But as the data start at 2002 2001 makes no sense.

To do so I used Left, Right, Mid and Search in combination with DATE().

For time data I extracted the hours and minutes and calculated the time with them on the basis that 24h equals 1,0. In some cases, time was given with noon or midnight. I replaced them with the exact time 12:00 PM and 12:00 AM.

For time calculation I had 4 cases by using text like 14:50 pm

If (AND(Left(text;2) = “12”; Right(text;2)=”pm”) ) Value = 0,5

If (AND(Left(text;2) = “12”; Right(text;2)=”am”) ) Value = 0

If (Right(text;2)=”pm”) ) Value = 1 / ( 24 * 60) * ((hour +12) * 60 + minutes)

If (Right(text;2)=”am”) ) Value = 1 / ( 24 * 60) * ((hour +0) * 60 + minutes)

Sometimes 12AM was written on the same day. So, the restoration would have been before the event. This was corrected by changing the day on the next day or checking if there was meant PM instead of AM.

Also PM and AM was sometimes mixed up. It was changed in the way that the restoration was after the event.

Mega Watt Loss

  • The term “approximately” before numbers was deleted. So Approximately 300 was changed into 300.
  • Rows with “greater than” were deleted because they do not give exact values.
  • And rows with a number range were replaced with the average of the minimum and maximum value.
  • All NA Values are shown N/A.

Type of Disturbance – Event Categories

The types are reduced to the following categories by using mostly ISNUMBER(FIND()) because by using find I do not need the exact term.

  • Snow/ Ice/ Wind/ Rain
  • Hurricanes/ Tornado
  • Fire
  • Vandalism/ Sabotage
  • Earthquake
  • Cyber Threat/ Attack
  • Tech. Defects & Load Shedding
  • Unknown
  • Blackout (All events on 2003-08-14 Northeast Blackout of 2003)

Fun fact: helicopter landing was also a cause for an outage.

NERC Regions

NERC is the abbreviation for North American Electric Reliability Corporation, a nonprofit organization, which is responsible for coordinating the electrical power grids. The supra-regional power grids in US are characterized by 7 larger, mutually, asynchronous interconnected networks, the NERC Regions. In addition to the 7 there is PR for Puerto Rico and HICC for Hawaii.

NERC Regions were checked. Because the list also showed wrong/old ones. For example, WSCC. This should mean WECC. The areas were checked with this.

https://atlas.eia.gov/datasets/eia::nerc-regions/explore

and this pdf

https://www.nerc.com/pa/RAPA/ra/Reliability%20Assessments%20DL/NERC_SRA_2023.pdf

As I was not sure if SPP is still existing or not, I kept it. Maybe the pdf states that it was also included in MRO. But as the exact area slit between MRO and SERC was not given, I decided to keep it.

The definitions are:

  • MROMidwest Reliability Organization (MRO)
  • NPCCNortheast Power Coordinating Council (NPCC)
  • RFCReliability First Corporation (RFC)
  • SERCSERC Reliability Corporation (SERC)
  • SPPSouthwest Power Pool (SPP)
  • TRETexas Reliability Entity (TRE)
  • WECCWestern Electricity Coordinating Council (WECC)

And

PR for Puerto Rico and

HICC for Hawaii

Missing Data; N/A, Blanks

If values were missing and I could extract them from the text, I took them from there. For megawatt loss and affected customers:

  • I replaced “-“ with 0 because it’s likely that “-“ means there were none
  • “none” is replaced with 0
  • N/A in affected customers and N/A in the megawatt low is considered as zero
  • unknown was replaced with a NULL value.

Tables Used

As there were some rows with several NERC Regions affected and also several States I created two separate tables for NERC Regions and for States.

And I also created a date table with dates from 2002 to 2023 to have a calendar table.

Creating a table with affected States

To create a table with the affected states I copied out the column with the Event_Id and the column with the areas. Afterwards I created for each State a column next to them. So I had 52 Columns (+Canada + Puerto Rico).

Then I used IF(ISNUMBER(FIND())) to check if the state name was given in the Area description. If yes, Excel was creating a 1 in the column, if not excel created a 0. For Washington D.C. I did manual corrections to set it on District of Columbia.

Then I axtended the code with CONCAT() and IF(). If 1 was the result of the step befre, the Event ID and the Name of the state were written separated witha comma in the column.

Then I used UNIQUE() and VSTACK() to create a list out of the matrix. And with TEXTSPLIT() I splitted up the concatenated Text again in Event number and State.

By doing so, I was able to filter out a list where I have single rows with Event_ID and a single state name. The intention of that was to be able to find out later which states were affected most often.

BUILDING THE DASHBOARD

Following the objective, I wanted to create a dashboard where you can

  • Filter by Time (year,month,weekday,daytime)
  • And also filter by NERC Region and by Event Category affected.
  • See the total affected customers, the total Number of Events and the Status of the Restoration, means how many Events were restored until today
  • See the number of Events in the different Regions
  • But also the number of Events in the Affected States in the specific Regions, because this gives me a feeling where I have weak points in the grid
  • I wanted to see the main reasons for the outages in the NERC Regions as well as in the states.
  • And the customer affected or the demand loss. Because Vandalism/ Sabotage for example happens often, but it is restored quickly and there are not many customers affected.
  • To get a feeling how severe the damage is I also gave the average restoration time for the different categories in h
  • And I wanted to see if all months are affected in the same way and if there were specific times or days of weeks that are more affected.
  • And I also wanted to give the possibility to compare the years

This should give a good overview of what’s going on.

Customer affected and Demand Loss can’t be shown on State Level, because the input Data were not in this detail level. But I used a Mouseover with the Map to show the State, the number of the events in the State and a Table with the different Event Categories and the # of events.

For the sum of affected customers and the average demand loss I used bookmarks and buttons. I really appreciate this way because it gives me more flexibility and it saves space.

Design & Visualization

The design is a typical dashboard design, that can be used to get a feeling for what’s going on.

The header gives a quick overview over the total happened Events, the affected customers, and the restoration Status.

The rest is dominated by bar and column charts, because in my opinion they serve best for comparing categories and this is what the dashboard mainly does.

I went with yellow, because this represents for me energy, combined it with a dark purple, to create a contrast and combined it with a light shade of blue and orange to differentiate between customer and Demand Loss.

Definition of technical Terms

The Amount of Demand Involved (Peak Megawatts), is the amount of the peak demand involved over the entire incident. I gave the average of this dimension for each category to give a feeling how severe the event was.

Used KPIs

  • Average Amount of demand loss
  • Sum of affected customers
  • Average number of events by month
  • Total Events

Insights

  • The most affected year is 2022 the less affected 2002
  • The most affected NERC Regions are WECC and RFC
  • The less affected SPP,PR and HICC
  • The most affected States are California and Texas
  • In Texas the reason for that is heavy weather, in California the outages are caused by technical defects
  • The major thread to the grid is severe weather (winter/summer months) but since 2011 there is also a growing number of events due to vandalism
  • Hurricanes happen most often in September and October
    • Hurricane Isabel 2003
    • Hurricane Charly 2004
    • Hurricane Katrina 2005
    • Hurricane Ike 2008
    • Hurricane Sandy 2012
  • Vandalism causes a lot of events but they are mostly suspicious, so the number of affected customers and the duration until restoration is very low
  • The highest restoration durations occure due to Hurricanes/ Tornados & Fire because there is a lot of infra structure damaged and due to Cyber Threats (less cases)
  • Most of the events were restored, as the overall restoration rate is 91%. 2017 has the lowest restoration rate with only 79%.
  • The restoration takes place very quickly. Apart from the causes mentioned above it takes only 42h which is less than 2 days.
  • During the weekends there are lesser events in average per day than during weekdays

SUMMARY, PERSONAL LEARNINGS & CHALLENGES

In general, it was a nice project, that covered many topics of real world data and I really enjoyed working on it.

Even if Data cleaning challenged me a lot this time. But somehow, I enjoy it to figure out tricky ways how to automate most of it. I also made some time consuming mistakes during the cleaning process that I had to correct. But for the future I know, where the potential for mistake lies by cleaning dates and times. It will certainly be faster the next time.

I also found out, that I must be very careful and rechecking data several times, for not overlooking illogical things. Like the duration of restoration that is minus. Because my time also included the day, but I didn’t see it, because the cell was formatted as time only. This can give you really grey hair.

Also this time I experienced that the dashboard building process is much more efficient if I sketch out a concept before starting to touch Power BI.

And this time, the project really challenged me a lot, so it helped to talk to others who were also working on it. The Power of the community really gets visible, when things get complicated. Especially for keeping the motivation and laughing together about the crazy cleaning process. 😉 Everybody who went through it this time can be really proud.

As always, I am open for any feedback or comments :)

Best, Jasmin

Additional project images

Insigths Part 1
Insigths Part 2
Discussion and feedback(5 comments)
comment-535-avatar
Akhand Pratap Singh
Akhand Pratap Singh
11 months ago
Hi Jasmin, I just want to ask how you managed to clean Restoration column. I mean in that column there are multiple format of date & time how you manage that? E.g: 1/26/03,2:00a.m. , 11:00p.

comment-541-avatar
Chandrakant Jadhav
Chandrakant Jadhav
11 months ago
Awesome Dashboard.....

comment-546-avatar
yousafkhan datascientist
yousafkhan datascientist
11 months ago
how you handled unknown values?
2000 characters remaining
Cookie SettingsWe use cookies to enhance your experience, analyze site traffic and deliver personalized content. Read our Privacy Policy.