Maven Power Outage Challenge
Challenge Goal
To consolidate and clean the raw data, and create a dashboard or report to explore electric grid outages across the US.
Challenge Objectives
- Analyze the data set and identify cleanup actions
- Perform cleanup actions and consolidate date
- Create an exploratory dashboard
Data Set
US Electric Grid Outages
- Electric outage incidents in the US power grid from January 2002 to July 2023, including details related to the event start and end time, location, alert criteria, demand loss, and estimated number of people affected.
- 3974 Records in 8 Fields
- Geospatial and Time Series datatypes
Tools
- Microsoft Excel for data cleanup and consolidation, and exploratory dashboard
Challenge Execution
Analyze the data set and identify cleanup actions
- Analysis insights
- Report format changes in 2011 and then again in 2015. Changes include adding new fields and changing their order. 2023 also adds a separate Year field, but this has no real effect on the rest of the 2015-2023 period.
- Date/Time formats differ greatly between reports in some fields. For example we have "11:13 a.m. January 05" (datatyped as text) in the Restoration field in 2002-2010, and Date and Time (correctly datatyped) in separate fields in 2015-2023. Some fields only have Date and not the Time component.
- "Type of Disturbance" in 2002-2010 is more granular than the corresponding "Event Type" in 2011-2014 and 2015-2023. For example we have "Ice Storm" or "Major Storm" in 2002-2010, while 2015-2023 has only "Severe Weather". "Event Type" in 2011-2014 is also more granular, for example "Voltage Reduction due to Severe Weather - Cold"
- "Number of Customers Affected" in 2002-2011 and 2011-2014 has blanks (-) or N/A in several of the reports.
- "Loss (megawatts)" in 2002-2011 and "Demand Loss (MW)" in 2011-2014 has blanks (-) or N/A in several of the reports.
- "Area Affected" has inconsistent granularity (sometimes just State, sometimes Area or County). Also there are sometimes "secondary" States that are affected. Area and County aren't required when reporting incidents.
- "Month" in 2015-2023 (and "Year" in 2023") are not present in the earlier periods.
- "Alert Criteria" only exists in 2015-2023.
- Identified cleanup actions
- Remove empty rows and put the results in separate Excel sheets (2002-2010, 2011-2014, 2015-2023)
- Harmonize Date/Time as separate fields in 2002-2010 and 2011-2014 in accordance with 2015-2023 and ensure consistent Date/Time formatting in all periods. If Time is missing from Date/Time, set as Unknown
- Harmonize "Type of Disturbance" in 2002-2010 and "Event Type" in 2011-2014 to fit "Event Type" in 2015-2023
- Change blanks (-) and N/A in "Number of Customers Affected" in all periods to Unknown
- Change blanks (-) and N/A in "Loss (megawatts)" in 2002-2011 and "Demand Loss (MW)" in 2011-2014 to Unknown
- Add "Month" and "Year" fields to all periods after Date/Time fields have been harmonized
- Rename any inconsistent fields in 2002-2010 and 2011-2014 to fit 2015-2023
- Add "Alert Criteria" and leave the field as empty for 2002-2010 and 2011-2014
- Consolidate all periods into a single report
- Add a State/Incident Matrix to identify which incidents affected which states
Perform cleanup actions and consolidate date
- Remove empty rows and put the results in separate Excel sheets (2002-2010, 2011-2014, 2015-2023)
- 2002-2010: Moved all incident data to a single sheet. Removed rows with empty "NERC Region" field, as this is a required/filled in. Found some spill data in a few of the other fields. Added this data to the primary row of the affected incident. Data will be cleaned up at a later stage in the project.
- 2011-2014: Better data quality here, the only "empty" rows are the month designations. These rows were removed.
- 2015-2023: No empty rows in data.
- Result is three sheets: 2002-2010 (800 rows), 2011-2014 (891 rows), 2015-2023 (2251 rows)
- Harmonize Date/Time as separate fields in 2002-2010 and 2011-2014 in accordance with 2015-2023 and ensure consistent Date/Time formatting in all periods. If Time is missing from Date/Time, set as Unknown
- 2002-2010
- "Date" field renamed to "Date Event Began"
- Data in "Date Event Began" alright throughout, no changes needed
- "Time" field renamed to "Time Event Began"
- Data in "Time Event Began" formatted as text like "2:00 p.m." or "11:00 a.m." for the majority. The p.m./a.m parts where extracted and used to correct Time to a 24-hour format.
- Noticed 12am/pm got calculated incorrectly. Filtered these out and corrected them before updating all the data in "Time Event Began".
- A few outliers (for example "Midnight") changed to appropriate times manually.
- 2 rows showing "ongoing" in "Time Event Began". These rows have been removed.
- "Restoration Time" has Date and Time in the same field, in different formattings (for example "2002-02-07 12:00:00" or "1/26/03, 2:00 a.m.".
- Replaced "Ongoing" values in "Restoration Time" with "Unknown".
- Corrected "2002-02-07 12:00:00" by duplicating the data changing formatting. Changed "Restoration Time" to "Date of Restoration" and added "Time of Restoration" field.
- Corrected "1/26/03, 2:00 a.m." first by separating the Date and Time components and correcting the formatting to YYYY-MM-DD format.
- Several cells formatted as text, "1/26/03" and couldn't be automatically formatted as Date. Filtered these out with an IFERROR() statement.
- Extracted Year, Month and Day
- Year: "20"&RIGHT(H25;2)
- Month: LEFT(H25;SEARCH("/";H25)-1)
- Day: MID(H25;SEARCH("/";H25)+1;2)
- Combined the result into YYYY-MM-DD format and formatted as Date. Put the corrected data in "Date of Restoration"
- Data in "Time of Restoration" has leading spaces after last extraction. Correcting this with a TRIM().
- In "Time of Restoration" 12:00 is now shown as "12:00 midnight" or "12:00 noon". Correcting these to "12:00 p.m." and 12:00 a.m." respectively, to ease next action.
- Separated "12:00 p.m." into "12:00" and "p.m.". Corrected some of the results manually (misspellings and the like)
- Made the same transformation and replacement as for "Time Event Began"
- The next section of dates is formatted as "5:30 p.m. January 16".
- Started by replacing ". " with "._" and then using _ as a delimiter when splitting up the data to a Time component and a Date component.
- Added a Year column (based the rows other Date data) for the coming Date formatting action.
- The date is now formatted as "MM-DD", used "-" as delimiter to split this data, and get a field each for Year, Month and Day. Then combining this with a DATE() in order to get the correct data and formatting.
- Performing the same type of correction on "Time of Restoration" as in previous actions to the final section. Some manual corrections of outliers made as well.
- 2011-2014
- "Date Event Began" has correct formatting and no outliers or empty cells.
- "Time Event Began" has correct formatting and no outliers or empty cells. Some cells still had the Date component. This was removed to correspond with the rest of the data.
- "Date of Restoration" and "Time of Restoration" had a few "Ongoing". Changed these to "Unknown".
- "Date of Restoration" have two dates in 2001 and 2077. Replaced these (and the "Time of Restoration" data) with "Unknown" as the correct date is unknown.
- 2015-2023
- "Time Event Began" and "Time of Restoration" are good quality. There is no incorrect data and no empty cells.
- "Date Event Began" and "Date of Restoration" have good quality (YYYY-MM-DD format) in some of the data, but ´far from all. In 1964 out of 2259 cells the format is MM/DD/YYY. Excel sees the data as text so it can't be automatically reformatted.
- To correct this "/" was used as a delimiter to split the data into Month, Day and Year columns. These are then combined using DATE() to get the correct formatting. This correction was made to both fields.
- All Date and Time fields are now corrected and harmonized.
- Harmonize "Type of Disturbance" in 2002-2010 and "Event Type" in 2011-2014 to fit "Event Type" in 2015-2023
- 2015-2023
- This action began with securing the data quality in "Event Type" in 2015-2023 using UNIQUE() to identify misspellings, leading and trailing spaces, and overlapping or conflicting categories. Incorrect data was corrected using the Replace function in Excel.
- "Vandalism", for example, appeared in several forms (like "- Vandalism - Theft", "- Vandalism" or "Physical Attack"). These where harmonized into "Vandalism".
- The same line of thinking was applied to other overlapping categories.
- One category in "Alert Criteria" sometimes had different Event Types. This was handled by treating the fields as unrelated. A loss of electric service can be due to both severe weather and system operations. "Alert Criteria" can be used as a details field in the dashboard later in the project to provide more insight into individual incidents.
- This cleanup resulted in 5 defined categories, down from around 30 less well defined ones.
- 2002-2010
- "Type of Disturbance" had 257 unique categories. Many of these appeared less than 5-10 times. The categories were worked through to harmonize them with the category list from 2015-2023.
- Used IFERROR(IF(SEARCH("Storm";E2);"Severe Weather");E2) methodically to more quickly replace categories with the correct ones.
- Field renamed to "Event Type"
- 2011-2014
- "Event Type" had 135 unique categories. These are more consistent than in 2002-2010.
- The same process as for 2002-2010 was applied here.
- Change blanks (-) and N/A in "Number of Customers Affected" in all periods to Unknown
- 2002-2010
- Several N/A, NA and similar. Replaced these with "Unknown".
- A few of the cells are in text ("Under 50'000" for example). Went through these manually and corrected them to the values represented in the text.
- 2011-2014
- Several N/A and UNK. Replaced these with "Unknown".
- Found a "None", replaced it with 0.
- 2015-2023
- No N/A or similar. One empty cell replaced with "Unknown".
- Formatted field as Number. Observed inconsistencies in the data series (numbers not in proper order), so a portion of the data was Text strings. Corrected this issue by using TRIM() and then formatting results as Number.
- Change blanks (-) and N/A in "Loss (megawatts)" in 2002-2011 and "Demand Loss (MW)" in 2011-2014 to Unknown
- 2002-2010
- Several N/A, NA or similar, replaced with "Unknown".
- Several "unknown", replaced with "Unknown".
- Some cells with a format like "500-600", replaced with the highest value in each range.
- Some cells with a format like "Approx. 500", removed "Approx. " part using Excel Replace function.
- Other outliers also reformatted as numbers
- 2011-2014
- Several N/A, NA or similar, replaced with "Unknown".
- No other outliers found.
- 2015-2023
- No N/A or similar found.
- Several cells are text strings, used TRIM() to remove any trailing or leading spaces. This corrected the number sequence.
- Add "Month" and "Year" fields to all periods after Date/Time fields have been harmonized
- 2002-2010
- Year field added using YEAR() function
- Month field added using TEXT(C2;"MMMM") in order to get the month name, matching 2015-2023
- 2011-2014
- Year field added using YEAR() function
- Month field added using TEXT(C2;"MMMM") in order to get the month name, matching 2015-2023
- 2015-2023
- Year field added using YEAR() function
- Rename any inconsistent fields in 2002-2010 and 2011-2014 to fit 2015-2023
- 2002-2010
- Fields renamed and reordered
- 2011-2014
- Fields renamed and reordered
- Add "Alert Criteria" and leave the field as empty for 2002-2010 and 2011-2014
- 2002-2010
- "Alert Criteria" field added
- 2011-2014
- "Alert Criteria" field added
- Consolidate all periods into a single report
- Data consolidated into sheet "Consolidated Data"
- Add a State/Incident Matrix to identify which incidents affected which states
- Added a unique Incident ID to all rows in Consolidated Data like INC0001
- Added sheet "State Incident Matrix"
- Added Incident IDs to the X-axis of the State Incident Matrix using SORT(UNIQUE()) function
- Fetched list of US states from the internet and added them to the Y-axis of the State Incident Matrix
- Found a few "North and South Carolina" strings in Area Affected, and replaced with "North Carolina and South Carolina"
- Created a flag using IFERROR(IF(FIND(B$1;'Consolidated Data'!$H2);1;0);0) showing which states were affected in which incidents. Area Affected was used for this flagging. 1 = affected and 0 = not affected
- Added some conditional formatting to the State Incident Matrix to easier spot affected states
- Added Puerto Rico to the matrix as it is a US territory
- Added a "State Not Specified" field flagging which incidents had no state specified in Area Affected, using IF(SUM(B2:AY2)=0;1;0). 1 = No state affected. Also added conditional formatting to this field
- Identified rows containing state abbreviations (like AL for Alabama) and replaced them with the full state names
Create an exploratory dashboard
- Added field "Incident Duration in Days" field to Consolidated Data, using IFERROR(F2-D2;"Unknown")
- Identified 16 events that got "resolved" before they started (negative values in Incident Duration in Days). Changed Date of Restoration and Time of Restoration to Unknown in these cases
- Added "Calc" sheet for calculations and summaries
- Added "Dashboard" sheet for the dashboard
- Appended the State Incident Matrix data to Consolidated Data and formatted it as a table called IncidentDataStates
- Added calculation table for Incidents by State and Year to the Calc sheet
- Added State list for filtering to the Dashboard sheet
- Added a named range StateIncidents using OFFSET(MATCH())
- OFFSET(Calc!$C$4;0;MATCH(Dashboard!$B$12;Calc!$C$3:$BB$3;0)-1;22;1)
- Added a named range StateFlagMatrix using OFFSET(MATCH())
- OFFSET('Consolidated Data'!$O$2;0;MATCH(Dashboard!H12;IncidentDataStates[[#Headers];[Alabama]:[Wisconsin]];0)-1;3948;1)
- Added a chart for incidents per year, dynamically selecting state based on the filter and the StateIncidents named range
- Created some additional state level measures to Calc sheet
- Shortest Incident
- Finding the correct state via the StateFlagMatrix, and then finding the minimum Incident Duration in Days. If shorter than 1 day, show as "< 1" instead. Also has catch for "All States" to find shortest incident for all states.
- =IFNA(IF(MINIFS(IncidentDataStates[Incident Duration in Days];StateFlagMatrix;"1")<1;"< 1";MINIFS(IncidentDataStates[Incident Duration in Days];StateFlagMatrix;"1"));IF(MIN(IncidentDataStates[Incident Duration in Days])<1;"< 1";MIN(IncidentDataStates[Incident Duration in Days])))
- Longest Incident
- Finding the correct state via the StateFlagMatrix, and then finding the maximum Incident Duration in Days. Also has catch for "All States" to find longest incident for all states.
- IFNA(MAXIFS(IncidentDataStates[Incident Duration in Days];StateFlagMatrix;"1");MAX(IncidentDataStates[Incident Duration in Days]))
- Most Customers Affected
- Finding the correct state via the StateFlagMatrix, and then finding the maximum maximum customers affected. Also has catch for "All States" to find longest incident for all states.
- IFNA(MAXIFS(IncidentDataStates[Number of Customers Affected];StateFlagMatrix;"1");MAX(IncidentDataStates[Number of Customers Affected]))
- Highest MW Loss
- Finding the correct state via the StateFlagMatrix, and then finding the maximum MW Loss. Also has catch for "All States" to find longest incident for all states.
- IFNA(MAXIFS(IncidentDataStates[Demand Loss (MW)];StateFlagMatrix;"1");MAX(IncidentDataStates[Demand Loss (MW)]))
- Added the measures to the Dashboard sheet
- Added US level measures to the Calc sheet
- Shortest Incident
- IF(MIN(IncidentDataStates[Incident Duration in Days])<1;"< 1";MIN(IncidentDataStates[Incident Duration in Days]))
- Longest Incident
- MAX(IncidentDataStates[Incident Duration in Days])
- Most Customers Affected
- MAX(IncidentDataStates[Number of Customers Affected])
- Highest MW Loss
- MAX(IncidentDataStates[Demand Loss (MW)])
- Added the measures to the Dashboard sheet
- Added a matrix for Incidents by Event Type to Calc sheet
- COUNTIFS(IncidentDataStates[Event Type];Calc!$B42;IncidentDataStates[Alabama];1)
- Added named range IncidentEvent using OFFSET(MATCH())
- OFFSET(Calc!$C$42;0;MATCH(Dashboard!$B$12;Calc!$C$43:$BB$43;0)-1;5;1)
- Added a chart for incidents per event type, dynamically selecting state based on the filter and the IncidentEvent named range
- Arranged and formatted dashboard to final version
Project Results and Insights
- The reporting tool has improved over time, the data in 2015-2023 has a higher and more unified quality than the 2002-2010 period.
- The reporting tool could still use more data validation in order to avoid misspellings and the like.
- Affected US state or territory isn't specified in 136 out of the 3948 incidents.
- Alaska is the only state to have no incidents recorded.
- Severe Weather is the most common incident cause.
- Severe Weather incidents are much more common in the Gulf states (f.e. Florida and Mississippi)