Smart Grid

Combining Engineering Asset Management, Reliability and Resilience:

A New Way to Approach the Uncertainty and the Complexity of Future Grid Design

by Jean Raymond, and Dragan Komljenovic, Hydro-Québec, Canada 

Modern power grids are transforming into highly integrated networks of power generation, transmission, and distribution systems and distributed energy resources (DER) located at customer premises. The energy transition is also experiencing rapid transformations involving technological advancements, changing consumer preferences, and new policies. Several factors and sources of risks contribute to the new complexity of the power system. 

For that reason, we are seeing more dynamic operating environments with decentralized equipment control at facilities where autonomous action is required. As a result, situational awareness is becoming a priority for future Hydro-Québec generating stations and substations. To that end, several findings on current reliability study have emerged, but all agree on the need to utilize and improve on the methodologies and concepts best suited to the many challenges of digital transformation and energy transition. This entails developing and implementing a structured reliability program as an integral part of the overall asset management system.

Power grids must deliver electricity and maintain their service level for long periods of time, necessitating maintenance operations, the upgrading and integration of complex new technologies, and capacity increases, all while operating in a hostile environment. 

This situation introduces deep uncertainties and emerging as well as systemic risks. Power grids should therefore be considered complex systems composed of numerous interacting elements and/or systems designed to provide optimal performance and safe, reliable operations. This situation requires flexibility and adaptability in responding to new contexts (technological, social, economic, legislative, political, climate-related, etc.), which determine the demand for services and expected performance. As a result of these factors, new concepts such as resilience and asset management have been introduced. 

Emerging advanced features are bringing about a number of performance improvements. These include the ability to adapt quickly and optimally to rapidly changing conditions and to use faster, real-time predictive analytics. The power grid is evolving into a very large-scale and non-converged system as it becomes more decentralized and integrated with a variety of heterogeneous components, which often have different needs and purposes. To understand these emerging dynamics, a new electricity landscape is required (Figure 1).

The energy transition brings with it a variety of challenges, including generating asset sustainability, generation unit cost optimization, and operating cost reductions. Addressing these challenges requires adjusting to the new grid-related challenges driven by the shift in electricity supply and demand as a result of greater DER penetration and changes in market behavior (Figure 2).

To succeed despite all the challenges and unknown risks posed by the digital transformation and energy transition shown in Figure2, HQ is planning to modernize its generating stations and substations based on situational awareness.

The energy transition requires autonomy, flexibility, adaptability, predictability, and resilience

To meet future expectations, several operation criteria must be met to achieve situational awareness and respond to changes associated with the energy transition.

Autonomy:  The system collects a set of information, processes the data obtained and makes the subsequent decisions.

Flexibility: With standardized and interoperable technologies and advanced applications, new functions and/or trends can be added at any time without requiring any modifications. Operational flexibility also refers to the ability to react to unpredictable fluctuations in supply and demand and to maintain stability and reliability.

Adaptability: This refers to the ability to make real-time adjustments to automation system operating parameters (protection and other systems) based on data analysis conclusions (autonomy).

Predictability: The environment’s constant variability calls for solutions that are probabilistic and dynamic rather than static. Data must be analyzed in real time to predict and simulate behavioral trends and anticipate failures.

Risk-informed decision-making (RIDM):  A process that provides a formalized, rational, and systematic methodology for identifying, assessing, and communicating the factors that support risk-informed decisions (CSA N290.19:18).

Resilience:  Power system resilience is the ability to limit the extent, severity, and duration of system or component degradation following a disruptive event so as to maintain its essential function, identity, and structure. Power system resilience is achieved through a set of actions taken before, during, and after extreme disruptive events, such as anticipating, preparing for, and coping with disruptions; maintaining critical system operations; ensuring rapid recovery and adaptation; and applying the lessons learned (Figure 4).

Resilient systems accept that failures are inevitable and are designed and operated to promote recovery and adaptation rather than merely to resist the initial disruption. Resilience thinking complements risk management by expediting system recovery when risk management strategies fail to prevent a disruption.

Resilience management goes beyond risk management to address the complexities of large integrated systems and the uncertainty of future threats, especially those associated with climate change.

Hydro-Quebec has developed a roadmap covering the current and future activities depicted in Figure 3 to ensure that its generating stations and substations are able to meet future expectations.

Ultimately, Phase 3 ensures generating station and substation operations are autonomous, predictive, prescriptive, and adaptive to the environment and that they integrate DERs. Thus, artificial intelligence, digital twins, operations research, IoT and other applications and technologies will be omnipresent and play a dominant role in optimizing and automating operations. Figure 6 schematically illustrates the generating station of the future and the relationship with the new control center concept.

However, the energy transition is not fully defined. It may therefore introduce new (emergent) risks that are not anticipated because there are no past records or experience postmortems. On the other hand, a set of new unrecognized or significant events are added to the architectural, operating, and maintenance constraints, including the growing demand for energy vs. the stress on existing facilities, breaks in the supply chain, retirements and losses of knowledges, etc. Consequently, it is important to provide a framework that can minimize such risks. The concepts of resilience and engineering asset management have strong potential for providing such a framework, in which new reliability methodologies and parameters are used.

Guaranteeing generating station and substation autonomy amidst the challenges of the energy transition

Modern utilities are capital-intensive organizations with relatively complex internal structures, operations, and technologies. They also operate in an increasingly complex business and operating environment characterized by significant uncertainties (evolving markets and customers, changing regulatory framework, new technologies, energy transition, malicious human actions, climate change, etc.). In addition, electricity companies must replace a large part of their assets when they reach the end of their useful life or are rendered obsolete by technological changes. The increased demands on their assets affect their operation and maintenance: profitability/performance requirements are higher, and the need for increased capacity necessitates major upgrades. In addition, high-impact, low-probability events such as extreme weather events, natural disasters, major geomagnetic disturbances, pandemics and cyberattacks are already significant factors. These factors pose systemic and emergent risks.

Meanwhile, few studies have thoroughly examined these risks and their effects on the overall performance and vulnerability of utilities exposed to them. Traditional analysis and management methods have recently proven to be less effective (too linear) in adequately managing complexity in an ever-changing and barely predictable environment. This situation is generally due to a lack of knowledge about the type and extent of uncertainties, the nature of the interconnections, the level of complexity, and our limited ability to predict future events. In such a context, the ability of electricity companies to implement innovative concepts is decisive in meeting performance and competitiveness requirements while minimizing their risks and costs and optimizing their resources. Emergent risks associated with the energy transition are arising on a societal scale never seen before. What potential surprises lie ahead? This question deserves thorough study.

In such circumstances, numerous challenges and issues must be considered when carrying out reliability studies. Every future plant should be considered as a “system of systems.” In addition, over the years, other challenges and issues will have to be addressed and resolved through the reliability program. Here are a few observations justifying the development of a structured reliability program:

  • The behavior of the network and consequently of the generating stations and substations park is becoming stochastic, quantifying the dynamic and unpredictable aspects of self-generation and customer load
  • Traditional system reliability models are typically able to describe the static logical structure of a system, but not its dynamic and dependent behaviors or its constituent elements
  • In traditional reliability models, the system and all its elements are assumed to have only two working states: working perfectly or not working at all, leading to complete failure. While this assumption simplifies complex reliability assessment issues, it does not reflect the reality that most systems degrade gradually and exhibit a wide range of states in both function and performance. Degradation of the system and its elements over time results in different levels of functional performance
  • Utilities have to replace a large portion of their assets when they reach the end of their useful life or become obsolete. Increased stress on the condition of these assets affects their operation and degrades their performance
  • Software is an integral part of various elements and systems and contributes directly to the success of their different functions. It is an essential asset for the use of IEDs, and with wide-scale digitalization it will become the critical core component of future real-time analysis, prediction, prescription and decision-making for generating stations and substations. That’s why, over time, software systems are continuously increasing in size and becoming more and more complex. The inherent criticality of software means that it should be considered in all future studies, in light of the associated risks
  • The behavior of a complex system cannot be easily deduced from the characteristics and behavior of its constituent parts
  • A system of systems has different capabilities than the mere sum of its constituent parts. While each system is able to operate independently, they also work together to achieve the desired additional capabilities
  • Human reliability plays a direct role in the maintenance and operation of systems (such as control centers) and in software development. It should therefore be incorporated into all studies
  • Transition rates between degradation states are assumed to be constant. However, in many situations with variable external factors influencing the degradation process, transition rates can no longer be considered time independent
  • In general, extreme meteorological and natural events cause elements and systems to behave abnormally, degrading their performance. Since they create a more hostile operating environment, these events can impact asset life

The challenges discussed above call for a holistic approach able to simultaneously consider key influence factors and create an adequate system-level framework for working and decision-making. Shifting our attention from an element-oriented view to an interaction-oriented, holistic view will provide a better understanding of complex future power systems and the emergent phenomena characterizing them. This paradigm shift will pave the way for new solutions to both long-standing and emergent problems. We are of the opinion that the concepts of structured Asset Management (AM) and resilience, combined, may provide an efficient framework in this regard. AM is a part of the larger concept of resilience management and is associated with all four phases of resilience shown in Figure 5.

Figure 5 shows the proposed holistic methodology, which integrates the relevant functions and activities and connects them with the four main phases of the resilience concept.

This approach incorporates and aligns the following functions and activities (Figure 5): 

1.  Operation, maintenance, monitoring and inspection management: A solid performance in these corporate functions ensures a safe and secure environment and contributes positively to all four resilience phases (plan, absorb, recover, and adapt)

2.  General management, risk and emergency management, return of experience and a review of the lessons learned are essential to overall resilience management. A solid performance in these functions increases overall resilience in all four phases. As far as risk assessment and management are concerned, following the general prescriptions of Standard ISO 31000 is recommended, but other approaches may also be used. A solid performance in these functions and activities enhances a company’s overall performance because of its greater ability to control and manage its main functions and processes. In this regard, attention should be paid to securing and improving human and organizational performance since these may also be major contributors to accidents and disasters

3.  Establishing adequate equipment and system design criteria; R&D and innovation: Given assets’ long-term operation in a dynamic environment, it is crucial to foresee future operating conditions as accurately as possible and include them in the design criteria. Because uncertainties regarding future conditions are significant, sufficient safety margins should be provided. For that purpose, R&D and innovation are vital to improving and gathering knowledge that can contribute to a better understanding and modeling of lesser-known phenomena

4.  The fourth element concerns the legal and regulatory framework, which is not under any individual company’s control. However, an adequate legal and regulatory framework is of key importance because it contributes greatly to safe operation, monitoring and inspections through legal and regulatory provisions, guidelines and requirements.

The relationships between these four functions and activities are multiple and of different natures (physical, spatial, informational, resource-related, etc.). Consequently, changes in their nature or characteristics may have unexpected impacts that are often non-linear due to the complexity of their interrelationships.

The overall resilience strategy is therefore composed of an array of interacting and interdependent activities and functions, both within the multilevel structure of a company and outside its walls. The cumulative effect of these actions is superior to the sum of their individual effects. It is worth emphasizing that it is important to ensure a positive overall performance in all the activities presented in Figure 5. Good or excellent performance in a single activity usually cannot make up for serious weaknesses in other areas.

AM is defined by ISO Standard 55000 as a set of coordinated activities implemented by an organization to realize the value of its assets. These activities have their own unique characteristics and models, which interact to translate raw data into knowledge and then into appropriate actions. Complete AM includes six main groups of activities: strategy and planning, decision-making, life-cycle management, asset information, organization and human resources, and risks and review. The reliability program corresponds to the life-cycle management (LCM) portion of the AM process.

The proposed reliability program leverages the nuclear power industry’s positive experience in this area. The program’s fundamental objective is to bring together all its activities, including reliability studies, to demonstrate the fitness for service of key assets and the gains in productivity, effectiveness, efficiency, and cost optimization throughout the asset life cycle for existing and future generating stations and substations, including the contribution to situational awareness. The reliability program on its own is not very meaningful since it cannot demonstrate its full value and benefits without being incorporated into a larger framework such as AM and resilience. 

The reliability program should encompass the following key activities:

  • Develop a methodology to identify critical assets and systems
  • Determine critical assets and systems
  • Identify reliability objectives or critical assets
  • Identify and describe possible failure modes
  • Determine and specify the minimum capacities and minimum performance levels of assets and their systems
  • Establish the gaps of the performance and define remediation activities
  • Provide information on maintenance and inspection programs
  • Ensure model reliability, including developing a reliability database
  • Monitor the performance and reliability of assets, systems and people
  • Assess the reliability of critical assets and systems
  • Implement the reliability program
  • Document reliability program results to demonstrate its effectiveness and produce the relevant reports
  • Document the reliability program

In the previous sections, we explained several criteria to be met. However, they are amplified by additional challenges and issues, further complicating traditional reliability studies. This set of challenges and issues are the main justification for the development of a reliability program. Table 1 shows some of these challenges and issues and their tie-ins with the reliability program, life cycle management and asset management.


HQ aspires to make every generating station and substation autonomous, based on the concept of situational awareness, while implementing the new control center concept to maintain the overall stability of the electricity grid.

Achieving this objective will involve three main phases of work. At all times, they will reflect changes in international standards, industry best practices, and the potential establishment of a canonical data model currently under consideration. 

Various challenges, issues, and events require the establishment of a reliability program within a global AM framework to demonstrate productivity, effectiveness, efficiency, and cost optimization gains for all aspects of an innovation initiative or project, right up to maintenance and operations. 

This article presents the key elements of the concept’s long-term development.


Jean Raymond – has worked at Hydro-Quebec for more than 31 years, both as a telecom engineer in several divisions of the company and as an engineer responsible for the evolution and long-term development of its transmission and electricity generation systems. He has developed considerable expertise in system reliability over the years. Jean holds a bachelor’s degree in Electrical Engineering from Universite Laval and a master’s degree and PhD in Electrical Engineering (Telecommunications) from Université Laval, in collaboration with the Defense Research Establishments. He also has nearly a dozen publications to his credit. Jean is an active member of international working groups (IEC, IEEE) and the convener of IEC TC57/AG22, “Prepare for the Future.”

Dragan Komljenovic received a BSc from the University of Tuzla, an MSc from the University of Belgrade, a first PhD from Universite Laval in 2002 and a second PhD from Universite du Quebec a Trois-Rivieres in 2018. He works as a researcher at Hydro-Quebec’s research institute (IREQ) in the field of reliability, asset management and risk analysis. He has also worked as a reliability and nuclear safety engineer at Gentilly-2 nuclear power plant as well as at Hydro-Quebec. Dragan collaborates with several universities and has published more than 100 refereed journal and conference papers. He is a Fellow of the International Society of Engineering Asset Management (ISEAM).