Securing the Future Grid

by John Stewart, EPRI, USA

The article describes EPRI’s Cybersecurity Vision 2030

My first personal experience with a utility cyber event began with an unusual email. This email was not however, part of any phishing campaign. When it arrived in my inbox in the early 2000’s, the now widespread practice of spamming employees with phishing tests was still years in the future.

When I opened this email, I read through a forwarded request from the chief dispatcher from the transmission operations department. He had reached out to the lead SCADA engineer within my design group and recounted a puzzling event that had been reported and logged through the SCADA/EMS system.

Early the previous morning, the SCADA RTU at a medium sized substation had reported a number of alarms associated with the operation of multiple breakers at the same time. These breakers were normally closed during typical conditions, but the SCADA single-line display for that station showed that they had all tripped open simultaneously. The dispatchers responsible for that geographic area acknowledged the alarms and initiated a procedure to verify the condition of the system just before everything abruptly returned to the closed state. After a short discussion, a maintenance ticket was generated with a request for the local system engineer to visit the site, retrieve any locally archived data, and check for any equipment issues.

This was a time of a significant amount of technical transition that extended the reach of SCADA control to multiple stations. Many of the operations personnel had started their careers when most switching was still performed by local electricians. When the system was reconfigured, dispatchers would manually update a large map board that covered an entire wall of the control center. Throughout the 1990’s, multiple capital projects had driven the deployment of new RTUs and communications circuits to increase system visibility and control. This transition was not without issues, but most of the problems had been observed previously and the failure modes were somewhat consistent. The email described an event that did not seem to align with past failures.

Working at the direction of the SCADA system specialist, we began to explore the problem. The first step was focused on reviewing logs and historical data from the system to analyze the event and against a baseline. After our initial review, we noticed that the RTU had been experiencing communications failures at an elevated rate over recent months. Any time that a remote site fails to respond to polling from the SCADA master system, a communications outage is logged within the system. These outages were also occurring during the overnight shift in the operations center.

After the preliminary observation about the periodic communications failures, we examined the logs that were preserved in the master system database. This event was definitely abnormal since none of these status indications had been in alarm since the SCADA commissioning process. Looking beyond the alarms for the associated breakers we also noticed other points seemed to remain in a normal state before and after the event. Finally, we reviewed the reported analog values throughout the duration of the event. Surprisingly, they seemed to show power flowing through the tripped breakers without interruption. Since this wasn’t physically possible, we agreed that the SCADA system was exhibiting some strange behavior. After making some progress, but without determining a root cause, we had to move on to other priorities. If this event was an isolated issue, we might not have had the opportunity to continue the analysis. As these things happen, we soon learned that this was not an isolated issue, and the pressure to explain the issue was about to increase significantly.

Anyone involved with troubleshooting SCADA systems has probably encountered multiple similar events where a misconfiguration or application error causes erroneous reporting. These systems are complex and multiple variables can impact the accuracy of data and displays. In most cases, the underlying issue can be identified by finding patterns associated with common failure modes in collaboration with the system developers. Historically, data anomalies in SCADA systems were approached similarly to other types of unexpected system failures and treated as an engineering failure. The root cause would typically be some sort of user misconfiguration or error in the underlying software. That is still the most common resolution, but these days we much more aware that some of the same indications could be evidence of a cyber-attack.

At the time this event occurred, we were still trying to place cyber risk into context along with other more well-known risks to the system. The prosecution of a disgruntled ex-contractor in Queensland, Australia had made international news as one of the first widely known examples of someone misusing critical infrastructure control systems for malicious purposes. In that case, the defendant had used his knowledge of the control systems and some private radio equipment to flood 200,000 gallons of sewage into a protected ecosystem. While most utility engineers were already aware of this risk in theory, it provided a concrete example of an attack and raised the profile of discussions around critical infrastructure security.

With an increased awareness of cyber risks and the failure to identify a plausible root cause, the discussions around the event began to shift. We transitioned from exploring common system failures toward the possibility that someone had compromised SCADA communications circuits to intercept and proxy reported data. This explanation could account for both the periodic nature of the event as well as the delivery of valid data alongside the manipulated points. As we discussed this new perspective, additional factors bolstered the assessment by noting both the use of a leased communications circuit and the specific customers being served through this station. (Figure 1).

This realization raised a number of questions about how to proceed. If someone had compromised one or more SCADA communications channels, the potential impact could be massive. At any point, control operations could be issued to interrupt key transmission lines and cause system-wide outages. If the attackers had a more advanced understanding of the power system configuration, they might be capable of specific sequences of controls that could be used to disable protection systems and intentionally damage critical components. We had just started to discuss how to recommend that the lead engineer should notify utility leadership about this suspected attack. Since this predated the current cybersecurity regulatory environment, there were not dedicated security personnel and processes for the OT organizations. In this middle of these discussions, the lead engineer returned and we paused the discussion. He had been working independently to analyze the event data, and had developed a theory about the event, and it did not involve anything as dramatic as we had imagined.

After an extended lesson on the design of the communications path, SCADA protocols, and the remote initialization process, we were finally able to understand the lead engineer’s explanation. Following a few simple tests in the development environment the event was replicated, and the theory was confirmed. The origin of this event is representative of the types of failures that can occur in grid control systems and provides a good example of how difficult it can be to identify a cyber attack in OT systems.

SCADA system communications lines can be a precious commodity when the system scales beyond the projected number of remote sites. To expand capacity above the number of installed communications lines, multiple remotes with unique addressing can share a single communications line. There are multiple ways to share the channel, but in this case an analog bridge had been installed just after the modem to allow additional stations to be joined to the circuit in the future.

The effect was to create a telephone party line which was commonly used in early phone circuits. While these bridges are generally passive devices with high reliability, they may also reflect signals when there are impedance issues. When the master system generates a message, the reflections are returned to the master as an inverted version of the original message. (Figure 2).

In almost all cases, these reflected messages are discarded because the inverted string of bits are not structured correctly to appear as a valid response message. Additionally, other measures such as response delay timers are sometimes used to ensure that the master does not mistake an outbound poll for an inbound response. Finally, cyclic redundancy calculations (CRC) are inserted into the message to ensure integrity. While investigating this event it became clear that this protection against reflected messages was not effective for a very specific subset of conditions.

The legacy protocol configured for this SCADA remote was originally developed for use in early SCADA applications. It was designed to be simple and efficient to minimize communications bandwidth and system resources. When the master generated a polling message and that message was imperfectly reflected back the master was accepting the reflections as a valid response from the remote. When the master parsed the message, it seemed to provide updated indications for a range of status points that were configured to represent the breakers at that location. Additional testing confirmed that a software error caused the master to accept improperly formed messages in some cases.

This event and the subsequent investigation into its origin provided a strong lesson about how an improbable combination of errors and failures may initially appear almost identical to an attack. To investigate the event, the lead SCADA engineer had converted the ASCII characters into a string of 1’s and 0’s and reversed the order before converting back to ASCII. Once this translation was performed, the message appeared to be a valid response. This level of analysis made me realize how a deep understanding of OT systems is required to recognize a legitimate attack.

Cybersecurity OT Systems

In the years since this improbable combination of errors was misinterpreted as an attack, the industry has invested a significant amount of resources toward mitigating cyber risk. Innovative technologies have been used to develop new security controls and equipment vendors have improved protection against a range of threats, but there are still significant gaps.

One example of these gaps is especially relevant to the event described above. The imagined attack appeared to indicate a compromised SCADA communications circuit where an attacker is able to directly control SCADA devices with minimal effort. (Figure 3).

This weakness has been widely discussed within the industry and a number of individuals have volunteered their time to develop standards-based solutions to support authenticated SCADA controls. One potential solution involves an addition to the widely-used DNP protocol that is called DNP-Secure Authentication. Multiple equipment vendors have updated their products to support this method of securing SCADA communications, but utility adoption of this solution has been slow. Based on discussions with utility engineers, most are aware of the potential for unauthorized controls and have taken some steps to mitigate this risk. The slow pace of DNP-SA adoption can often be tied to the difficulties associated with making significant changes to critical systems. If a specific utility decides to implement this type of security control, the deployment must be carefully planned to avoid inadvertent impacts. In some cases, the utility may need to coordinate the rollout with a significant master system upgrade or replacement. Ultimately, this situation requires utility personnel to plan projects years in advance to take advantage of opportunities to effect change. Where the typical enterprise IT security program can adjust course rapidly based on emerging threats, OT security programs must be much more deliberate in their response. To borrow an analogy, IT security is a speed boat while OT security is more of a super tanker.

To close current security gaps and mitigate future threats, those tasked with OT security must plan extensively and be prepared to take advantage of opportunities to deploy solutions. Along with master system upgrades, security personnel should also track upcoming contract awards for critical components to insert key security requirements into the specification. Likewise, planned revisions to relevant design standards can also be leveraged to integrate security controls. Opportunities like these can be thought of as potential OT security inflection points where security needs are embedded into core grid systems. At utilities where security personnel are not involved in these activities, opportunities for change are extremely limited.

With complex problems and limited opportunities for change, utilities must maximize every opportunity to secure systems. That is why roadmaps and planning are so critical in this space, to take advantage of all available security inflection points and avoid being left out of key decisions.

EPRI Cybersecurity Vision 2030

To aid the industry in accelerating the adoption of security in OT systems, EPRI has launched an initiative to coordinate with key stakeholders and project the evolution of security requirements over the next 10 years.

As an organization, EPRI has some unique capabilities to address these issues by leveraging research and roadmaps from multiple programs. Each individual program area provides critical insights regarding anticipated technology and process changes for key parts of the grid. This allows the cybersecurity strategy to be aligned with the future evolution of P&C systems, SCADA, communications, and other subsystems.

In the absence of a comprehensive strategy, cyber security can become tactical and event-driven, changing its focus every few months to react to the latest federal and state directives and regulations or publicized security compromises. SolarWinds is just the latest in a string of incidents that remind us that the current cyber security approaches are not working.

We need a new vision to guide cohesive OT cyber security strategies and tactics to protect the electricity subsector. As a starting point, let’s examine 4 important trends contributing to an OT cyber security inflection point. (Figure 4).

Influential Trends and Impacts

Decarbonization

Utilities are reprioritizing long-term resource plans to support low-carbon grid initiatives, driving reliance to more distributed and renewable energy sources. This transition significantly impacts electricity subsector transmission and distribution grids and the data communications networks that monitor and control distributed energy resources (DER) as aggregated generation assets. The industry-wide result is an expansion of attack surfaces with less accountability or control of integration points. This trend presents serious challenges for OT cyber security teams because there are few tools, standards, and proven practices to help manage and protect these distributed resources.

Transformation

More grid connected devices (GCDs) and third-party services create a multi-party grid and increase cyber security risks on a continuum ranging from supply chain to access authorization. The responsibilities for multi-party OT cyber security are not well-defined leaving GCD vendors, system integrators, service providers, and utilities deploying piecemeal approaches to sensible risk reductions
FERC Order 2222 drives new market activities by enabling DER aggregators to compete in regional wholesale electric markets. This Order may produce new data capacity, integrity, and availability requirements for distribution grid networks that transport data and support authorized access to operational networks. It may create new cyber security vulnerabilities that must be mitigated by OT cyber security teams

The ongoing digital transformation within utilities will continue to drive changes to security policies, practices, and technologies. These data-centric initiatives can increase the number and type of vulnerabilities that OT cyber security resources must address. For instance, digital transformations often include deployment of sensors enabling remote monitor and control capabilities for grid operations. The virtualization of protection and control systems may become much more common with the increasing use of emerging digital substation standards such as IEC 61850 Sampled Values. (Figure 5).

Value

The practice of defining OT cyber security as a cost center is unsustainable given the current growth trajectories for attack surfaces and risk mitigation costs. Senior leadership in electric utilities, must recognize that cyber security vulnerabilities present a material risk to all utility initiatives and must be addressed with appropriate and timely investments. However, executive action alone is not enough to address this challenge.

The problem of how we value and pay for security must be addressed simultaneously on four fronts:

1) Incentives for voluntary cyber security investments

2) Reducing the cost of cyber security through automation and interoperability

3) Determining the economic value of grid resiliency, and

4) Changing the utilities’ internal value proposition for cyber security. Utility funding mechanisms tend to discourage investment in the latest technologies because cyber security is not always a capital expense eligible for cost recovery.

In utilities, operating expenses are expected to adhere to long depreciation cycles and expenses like software updates or training are often cut from constrained budgets. Digital transformation initiatives must include cyber security considerations upfront, when solutions, policies, and practices can be “baked in” during design and deployment phases. (Figure 6).

Additionally, utilities must continue to mature the role and leadership of cyber security across their entire enterprise. While OT security started as a cost center in most utilities, it has progressed to the “Security as Compliance” stage in many companies that must comply with the NERC Critical Infrastructure Protection (CIP) standards. While some companies have moved beyond a compliance mindset in OT security, the goal is to reach a state of Intrinsic Security, where security is fully integrated in the business’s mission, processes, technology, and culture. (Figure 7).

Intrinsic Security is a key enabler for the future energy system and will be a guiding principle of the 2030 Security Vision.

Resilience

The North American Transmission Forum’s (NATF) public document, “Transmission Resilience Overview,” defines resilience as: “the ability of the system and its components (i.e., both the equipment and human components) to minimize damage and improve recovery from non-routine disruptions, including high impact, low frequency (HILF) events, in a reasonable amount of time.” Resiliency embeds graceful degradation into products, systems, and processes to support the most critical users during disrupted performance and to enable faster restoration. A resilient grid has embedded cyber security into the electricity systems’ DNA.

Cyber security must be understood to be more than a critical building block to grid resiliency, it is the mortar that strengthens the foundation of grid resiliency. Incorporating intentional and risk-informed security into the DNA of a product or system increases its resiliency in contrast to cyber security that is bolted on as an afterthought. (Figure 8/9).

Conclusion

Developing and implementing a cybersecurity vision is not easy in this industry. The restricted pace of change and complexity of critical systems can obfuscate the correct path. Long technology refresh cycles may force personnel to design creative solutions to secure technologies that span 20 years of change.

This approach is not the most convenient in the near term, but it is necessary to stay in front of emerging risks.

Biography:

John Stewart – Principal Technical Leader within EPRI’s Cybersecurity Program, has over 20 years of experience with power delivery systems. John has held a number of engineering and architecture roles involving control systems, communications, and cybersecurity. He has also been engaged with a range of industry efforts focused on complex technical topics. He has served as an advisor to multiple academic consortiums. He currently leads EPRI’s Cybersecurity for Transmission and Distribution Systems Task Force and directs multiple research teams focused on emerging technologies and risk mitigation for electric utilities.