Resilience is defined as the ability to recover quickly from a setback or other adversity – literally, the ability to bounce back. So with computer networks, how do we conceive of resilience in the environment?
This article discusses four factors to consider when considering network resiliency, as well as how businesses can build redundancy into their network infrastructure.
1. Everything Fails
The first step in designing a resilient network is understanding the reality that everything fails: routers, switches, circuits, cables, small form factor pluggable modules, and even interconnects. It is necessary perform regular network maintenance. This maintenance keeps systems at proper software levels, allows security patches to be applied, and even schedules hardware maintenance and replacement.
2. Opening hours
Second, network teams need to think about the operating hours of the environment. For example, an office network might not have users after hours or on weekends. This type of network may have strict requirements for reliability and availability during normal hours, but it can be maintained after hours. Other environments, such as data centers or life and security systems – for example, 911 centers and hospitals – need to operate 24/7. Therefore, proper design of these networks must take into account both failures and the ability to operate during maintenance.
3. Virtualization, cloud and SaaS applications
The next step is to consider the effect of virtualization, cloud, and SaaS application suites. While it may seem that cloud-based applications are beyond the control of IT, nothing could be further from the truth. For example, AWS goes to great lengths to advise customers on the availability provided by applications. The applications provide significantly different Service Level Agreements to users based on where they are hosted, such as in single Availability Zones, in multiple Availability Zones, or operating in multiple Regions. How companies and their customers connect to cloud or SaaS providers is also important.
4. Reliable remote connectivity
Finally, in the age of the COVID-19 pandemic, businesses need to think about the reliability of their remote connectivity. Does connectivity run across primary or secondary VPN concentrators, or is it distributed across a group of systems, allowing for the scale needed for maintenance?
Create redundancy at all layers
So how do teams go about building a resilient network design? Ultimately, it is important to understand that redundancy is just a tool to create resiliency.
Dozens of books are filled with advice on resilient network design techniques – I recommend Computer network problems and solutions by Russ White and Ethan Banks. But the bottom line with resiliency is that companies need to apply redundancy to all layers of their infrastructure. This means designing with modularity and maintaining physical and logical separation between functional elements.
While site availability and resiliency can be established with circuit and component redundancy, applications that require continuous availability should be designed to be distributed across multiple data centers and Availability Zones. This allows operation of the application during AWS, VMware, or other maintenance at any given location.
The most important component of this paradigm is the concept of network automation. This is how teams can ensure that changes are not susceptible to human error. Script sets require rigorous review, and all changes require proper documentation and testing. Any given change requires a minimum set of scripts, which includes one script to apply the change and another to test and validate the change. Finally, teams need a plan to handle exceptions and have a backup script to roll the environment back to its pre-change baseline.