Human error in network operations and how to deal with it

Network outages can often be traced to four error-prone activities: fault analysis and response, configuration changes, scaling and failover, and security policies.

An exclamation-mark alert in a field of abstract technology.
Alengo

You might have been alarmed to read recently that half of all network problems are due to human error. Well, bad news. That’s true of the number of problems. If you look at the hours of degraded or failed operation, three-quarters of all of it is due to human error. Furthermore, the great majority of degraded or failed operation can be traced to four specific activities:

  • Fault analysis and response, which network professionals and their management say creates 36% of error-induced outage time
  • Configuration changes (attributed to 27% of error-induced outage time)
  • Scaling and failover tasks (attributed to 19% of error-induced outage time)
  • Security policies (attributed to 18% of error-induced outage time)

Not surprisingly, network professionals are eager to find remedies for each of the four primary culprits. Before that can happen, it’s important to understand why the human error occurs.

My research points to a handful of specific errors that are committed, and these errors are associated with more than one of the four activities. In fact, almost all the common errors can impact all of the activities, but it’s best to focus on those error conditions that are the major contributors to outage time. They are:

  • Events overwhelm the operations staff
  • Operations staff “loses the picture”
  • Cross-dependencies between IT/software configuration and network configuration
  • Incorrect, incomplete, and dated documentation
  • Troublesome gear
  • Under-qualified and under-trained staff

Event flood

The first of our error causes, cited as a problem by every enterprise I’ve talked with, is that events overwhelm the operations staff. Most planned improvements to network operations centers (NOC) focus on trying to reduce “event load” through things like root cause analysis, and AI tools (not generative AI) hold a lot of promise here. However, enterprises say that most of these overload errors are caused by lack of a single person in charge. Ops centers often go off on multiple tangents when there’s a flood of alerts, and this puts staff at cross-purposes. “If you divide your NOC staff by geographic or technical responsibility, you’re inviting colliding responses,” one user said. A NOC coordinator sitting at a “single pane of glass” and driving the overall response to a problem is the only way to go.

Losing the picture

Event floods relate to the second of our error causes: the operations staff “loses the picture,” which is reported by 83% of enterprises. In fact, NOC tools to filter errors or suggest root causes contribute to this problem by disguising some potential issues or creating tunnel vision among the NOC staff. According to enterprises, people making “local” changes regularly forget to consider the impact of those changes on the rest of the network. They suggest that before any configuration changes are made anywhere, even in response to a fault, the rest of the NOC team should be consulted and sign off on the approach.

Network/IT dependencies

Just over three-quarters of enterprises say that cross-dependencies between IT/software configuration and network configuration are a significant source of errors. Almost all of these users say that they’ve experienced failures because application hosting or configuration was changed without checking whether the changes could impact the network (the reverse is reported in only half that number). Overall, this source of human error is responsible for nearly all the problems with configuration changes and most of the problems with scaling and failover. Enterprises think that the best solution to this problem is to coordinate explicitly between IT and network operations teams on any changes in application deployment or network configuration.

That can reduce problems but won’t do much to find and fix some that slip through. The solution to that is to improve application observability within the NOC, something only a quarter of enterprises say they support. If there’s an overall NOC coordinator with a network single-pane-of-glass, then that pane should also provide an overview of application state, at least in terms of input/output rates. Users also suggest that any time steps are taken to change a network/IT configuration, parallel steps to reverse the changes should be prepared.

Documentation

The next error cause is one most users sympathize with, even though only 70% say it results in significant network outages. Incorrect, incomplete, and dated documentation on operations software and network equipment is sometimes a root cause in itself, but it more often contributes to operations confusion. A third of enterprises say that their operations library “should be better organized and maintained,” and I suspect that’s true of almost every operations library. A little less than ten percent of enterprises say they really don’t have a formal library at all. For a problem that’s reported this often, the solution is fairly easy; enterprises need both a formal technical library and a technical librarian responsible for checking regularly with vendors to keep it up to date. One in five enterprises say they have a “procedure” for library maintenance but less than half that number say they have even a part-time librarian, and frankly I don’t believe the real number is even that high. The library should also collect anecdotal sources like tech media, and file stories and documents with the proper vendor/product information. That means having anyone who follows tech publications feed appropriate material to the tech librarian.

Troublesome gear

Next on our list is a troublesome piece of equipment or service connection. Remember the old “cry wolf” story? Repeated problems that generate events not only tend to immunize operations people to the specific problem but also can desensitize them to the event type overall. A repeated line error problem, for example, may cause the staff to overlook line errors elsewhere. Only 23% of enterprises say this is a significant problem, but all of those who have something that’s constantly generating events that demand attention say it’s caused their staff to overlook something else. The solution is to change out gear that creates repeated alerts, and report service issues to the provider, escalating the complaint as needed. NOC procedures should require that a digest of faults be prepared at least once per shift and reviewed to spot trouble areas.

Staff, skills and training

Last on our list is under-qualified and/or under-trained staff, but it’s not last because it’s least. This problem is cited by just under 85% of enterprises, and I suspect from my longer-term exposure that this problem is more widespread than that. There are two faces to this problem. First, the staff may not be able to handle their jobs properly because they lack general skills and training. Second, the staff may have issues with a new technology that’s been introduced, either a feature, a package, or a piece of equipment.

Addressing the first face of the problem, according to enterprises, requires thinking of “apprenticeship.” A new employee should serve a period under close supervision, during which they’re trained in an organized way on the specific requirements of your own network, its equipment, its management tools. The apprenticeship might be extended to add in formal training if required, and it doesn’t end until the mentor signs off. Certifications, which enterprises say are helpful for the second face of the problem, aren’t as useful for the first phase. “Certifications tell you how to do something. Mentoring tells you what to do,” according to one network professional.

Mapping errors to error-prone activities

What’s the impact of errors on the four error-prone activities? Below is a breakdown of the four activities, the specific errors committed, and enterprise IT professionals’ views on how often the errors happen and how serious they are. (For my research, a common occurrence is one that’s reported at least monthly, occasionally is four to six times a year, and rare is once a year or less. A serious impact refers to a major disruption, and a significant impact refers to an outage that impacts operations.)

Fault analysis and response

Event flood: Common occurrence, serious impact
Losing the picture: Common occurrence, serious impact
Network/IT dependencies: Occasional occurrence, serious impact
Documentation: Common occurrence, serious impact
Troublesome gear: Occasional occurrence, significant impact
Staff, skills and training: Common occurrence, serious impact

Configuration changes

Event flood: Rare occurrence, significant to serious impact
Losing the picture: Common occurrence, significant impact
Network/IT dependencies: Common occurrence, serious impact
Documentation: Occasional occurrence, significant impact
Troublesome gear: Rare occurrence, significant impact
Staff, skills and training: Common occurrence, serious impact

Scaling and failover

Event flood: Occasional occurrence, serious impact
Losing the picture: Occasional occurrence, significant impact
Network/IT dependencies: Common occurrence, Serious impact
Documentation: Occasional occurrence, significant impact
Troublesome gear: Occasional occurrence, significant impact
Staff, skills and training: Common occurrence, serious impact

Security policies

Event flood: Rare occurrence, serious impact
Losing the picture: Occasional occurrence, serious impact
Network/IT dependencies: Occasional occurrence, serious impact
Documentation: Occasional occurrence, significant impact
Troublesome gear: Rare occurrence, significant impact
Staff, skills and training: Common occurrence, serious impact

Gauging the impact

How can enterprises organize the solutions to all these issues? The first step is to plot your own network problems in a similar way. Focus on the areas where the problems have the greatest impact. The second step is look for tools and procedures to address specific problems, not to “improve” management or serve some other vague mission. Layers of tools with marginal value can be a problem in itself. The third step is to test any changes systemically, even though you’ve justified them with a specific problem in mind. It’s not uncommon to find that a solution to one problem can exacerbate another.

Don’t fall into a simplification trap here. “Top-down” or “certification” or “single-pane-of-glass” aren’t fail-safe. They may not even be useful. Your problems are a result of your situation, and your solutions have to be tuned to your own operations. Take the time to do a thoughtful analysis, and you might be surprised at how quickly you could see results.

Copyright © 2023 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022