Use Machine Learning to Enhance & Advance Network Operations
Network telemetry is the lifeblood of IT operations teams that need to provide great customer and user experiences across a complex mix of cloud, Web-enabled, and traditional applications.
However, common practices rely on static approaches and manual conﬁguration to capture telemetry data. Operations teams must manually deﬁne and conﬁgure thresholds per device and, in many cases, per interface, all based on their understanding of the network environment and policies.
This creates enormous complexity as the enterprise environment scales in personnel, locations, applications, and network connectivity.
IT teams may also struggle to make sense of all the data they capture. Whether it’s ﬂow records, SNMP traps, or log data, IT must take raw information and turn it into actionable, contextual insights.
The entire IT Operations team remains in reactive mode if all it has to work with are simple ‘if, then’ statements associated with threshold crossings that generate an alert after an event has occurred.
Dashboards can aggregate statistics and provide graphs and charts, while alarms and alerts can warn of immediate problems, but engineers and admins still need help to distinguish what’s important from what’s interesting. At gigabit speeds across multiple infrastructures, owned by several service providers, what is normal behavior?
Monitoring and management tools need to evolve to meet these new demands and to do a better job of transforming the ﬁrehose of data into operator-relevant information.
That’s the promise of machine learning. Machine learning is a set of techniques that enable computers to “learn”—that is, identify patterns and establish relationships in a data set—without being explicitly programmed.
These techniques can be applied to computer networks; rather than setting an alert threshold on each interface, let the machine learning algorithms learn behaviors from the aggregated data across the entire enterprise.
A machine learning system uses algorithms to identify patterns. These algorithms ingest data, build a model or models, and then identify baseline behaviors and anomalies. There are multiple techniques for detecting anomalies, such as Isolation Forests, clustering, and so on.
A key advantage of machine learning is that it can eliminate the ‘barking dog’ eﬀect of alerts; an event crosses a threshold, then falls back, then crosses again, all the while generating more alerts for the IT team to sort through.
Machine learning techniques can be applied to all kinds of data sets, including ﬂow records generated by network devices such as routers, switches, and access points, as well as synthetic data from monitoring tools.
Note that machine learning doesn’t exclude human interaction or make a professional’s expertise irrelevant. In fact, many machine learning algorithms can incorporate expert feedback to help the system reﬁne its models. The method is commonly known as semi-supervised learning.
Machine Learning and AI (Artiﬁcial Intelligence)
Machine learning is a subset of the broader ﬁeld of artiﬁcial intelligence (AI).
While the distinctions between machine learning, AI, and other data science techniques (deep learning, neural networks, and so on) are outside the scope of this paper, IT vendors are incorporating such techniques into their software and
solutions to help users make better use of the ever-growing volumes of data available to them.
In fact, Gartner predicts that “AI” technologies will be in every new vendor software release by 2020. (https://www.gartner.com/newsroom/id/3763265)
The analyst ﬁrm also says 30 percent of CIOs will make AI a top-ﬁve investment priority within the next two years.
Machine Learning vs. Traditional Monitoring: Three Diﬀerences
Traditional monitoring and management systems can spot anomalies and send alerts, but machine learning advances the state of the art in three key ways.
First, in a traditional monitoring or management system, it’s up to the administrator to select the rules and set the thresholds that will trigger alarms and alerts.
The administrator must decide what values to set and may have to spend a substantial amount of time learning the vendor’s ruleset, and then cobble together a set of metrics to ﬁt to his or her own network. The monitoring system then correlates network events and matches them to the administrator-deﬁned rules.
In a machine learning system, the algorithms automatically measure data derived from the aggregated data from the network’s own ﬂow records. The system develops dynamic boundaries (e.g. Min/Max) based on observable patterns of normal and abnormal behavior for a given use case or scenario.
Thus, rather than working from a set of static rules, the machine learning system is trained to derive operational insight from the actual behavior of the network. Using that insight, the system then generates insights based on deviations from the norm.
The key value propositions are:
- automation of pattern recognition on aggregated network data vs. individual policies on each device and potentially each interface
- Learned behavior over time, allowing for richer insights into the data sets for the operations, engineering, and architecture teams
- By deﬁning a range (e.g. Min/Max) to describe what is normal, the classic ‘barking dog’ problem is averted and only insights that are relevant are shared
A second crucial diﬀerence is human feedback. When the machine learning system identiﬁes a deviation and alerts an operator, the operator can tell the system whether the alert is important. In essence the operator teaches the systems to focus on particular ranges, matches, or patterns of behavior.
If the operator tags an alert as not important, the system will capture this feedback and assign a special tag to all original data points that were used to generate these machine learning insights. From there the system can use this new data point to adjust; for example, relaxing min/max bounds for network traﬃc deviation alerts.
For example, if a ten millisecond bandwidth spike of 1Mbit is labeled as not important, the system will simply log a similar occurrence but not alert an operator. But a ten millisecond spike of 2Mbits might generate a new insight.
If an operator indicates that a deviation is important, the system incorporates that feedback into its model to further reﬁne its insights and dig deeper into the data sets.
Thus, an operations team can train a machine learning system over time to develop a ﬁne-grained and ﬂexible “eye” for what’s relevant and important to engineers and administrators. In other words, a machine learning system’s rules are adaptive.
By contrast, traditional monitoring systems are static. Regardless of how traﬃc patterns or application behavior might change over time, or how SLAs or business demands are adjusted, the rules are ﬁxed and thresholds are locked in.
Rules will be triggered, and alarms ﬁred whether or not that the information is relevant for a network engineer, which leads to the ‘barking dog’ syndrome discussed earlier.
In addition, the machine learning system may identify patterns and behaviors that an engineer or administrator hadn’t considered but are still relevant. In comparison, a deterministic, rules-based system wouldn’t surface up this information if a rule hasn’t been set.
In other words, a machine learning system may tell you things you things you didn’t know you wanted to know. For example, over time the system may learn that an
alert coming from the executive oﬃce ﬂoor is probably more important than the same alert coming from an open space hosting a transient visitor.
This leads to a third diﬀerence. In a traditional monitoring or management system, an administrator has to enter or adjust rules and thresholds by hand every time the business adds a new application or service, or when performance requirements change.
This work also has to happen when new network segments come online (say for example, when the company opens a branch oﬃce or acquires another business).
A machine learning system automatically incorporates new information into its model. The system identiﬁes new applications and new network paths, tracks performance, and begins to recognize patterns without engineers or administrators having to do anything but point ﬂow records to the cloud service.
This capability saves time and reduces the manual grunt work that keeps engineers and operators from other, more meaningful, tasks.
Machine Learning Inside LiveAction
LiveAction’s machine learning platform, LiveInsight, is a cloud-based software module that aggregates Flow records collected via LiveNX Nodes (software collectors). The Flow data is sent to a cloud repository, and LiveInsight runs the data through its ML algorithms.
Generally speaking, LiveInsight can learn the global behavior of a customer’s network after a day’s worth of records. It can build application-speciﬁc insights on a week’s worth of data.
LiveAction has designed the product so that it can build multiple ML models to optimize pattern search and data analysis and can ﬂexibly integrate an inventory of other models over time.
Customers access LiveInsight as a module from within the main LiveNX dashboard. Information from LiveInsight is presented to users under the ‘Insights’ tab in the LiveNX Operations Dashboard. Insights are individual use-case-speciﬁc components in the broader LiveNX dashboard application.
Clicking on an insight gives the user details on the event, such as an application path change. From this insight, customers then drill into the LiveNX Operations Dashboard or deeper yet into the Engineering Console to get more details.
In addition, customers can tag the insight in three ways: “important” “not important” and “dismiss”. This feedback is the “human in the loop” feature, which lets engineers and administrators train the system and reﬁne the insights that the service will generate.
At present, LiveInsight is geared to detecting anomalies in typical traﬃc and application performance patterns. Anomaly detection can identify problems that require immediate attention.
And because engineers and administrators can train the system on what’s relevant and what’s not, over time the insights become more actionable; if the system is throwing an alert, it’s likely worth further investigation. Examples of key insights include:
The system will send an alert if application behavior falls outside of its typical traﬃc pattern boundaries learned over time. The boundaries of minimum and maximum adjust according to the learned traﬃc patterns from the network and do not require additional manual policy conﬁgurations.
For organizations that have multiple network paths between sites, the system can alert administrators to changes in typical traﬃc patterns, which may have implications for QoS or pricing.
This is very advantageous to SD-WAN or multi-access environments where several service providers and telecom services are preferred for speciﬁc applications; for example, in a SD-WAN environment directing the voice traﬃc across a MPLS network, and data transfer across the public internet. Being made aware of path changes could prevent an SLA not being met or avoid a poor customer experience.
This insight tallies up the applications in use and bandwidth consumption. It provides a general overview of network behavior and performance. This insight is ideal for identifying any spikes in peak traﬃc that were not expected and providing an early indicator that capacity planning may be required – before user experience degrades.
Human In The Loop
The goal of LiveAction’s machine learning capability is to help engineers and administrators be more productive and focused on creating and maintaining a high- performance experience for their users.
In fact, LiveAction has built its machine learning platform to incorporate operator feedback—what it calls the “Human in the Loop”.
By receiving direct feedback via the LiveAction ML module on the LiveNX platform from engineers and operators, the machine learning system improves its ability to identify anomalous behaviors that an engineer wants to know about.
LiveInsight collects ﬂow records from its users’ network infrastructure and stores those records in the cloud. Then the company’s machine learning algorithms analyze the ﬂow data to build models and identify patterns in bandwidth consumption, latency, application usage, and so on to be presented on simple cards as insights to the operations and engineering teams.
These patterns form a baseline that can then be used to identify deviations. If the LiveInsight module spots a signiﬁcant deviation, it will proactively alert an administrator or engineer.
LiveAction is also delivering additional predictive analysis to the platform; that is, the ability to alert engineers about conditions that might lead to a problem before the problem occurs.
The machine learning system builds labeled data sets of problem periods. By examining data before those problems occurred, the system may be able to anticipate similar occurrences.
For example, if a voice call had jitter or delay, what happened in the system before that jitter or delay occurred? By examining network and application conditions
before problems happen, predictive analysis will be able to recognize similar conditions and warn an administrator before those conditions lead to a problem.
Same Data, More Value
Machine learning greatly enhances the value of information that many enterprises already collect, such as ﬂow records. By analyzing this data using machine learning techniques, organizations can extract more value from it.
One key area that holds enormous promise is the direction towards intent-based networking, where the machine learning continually learns about the state of the underlying network infrastructure to deliver service assurance end to end.
By building behavior-based models from real-world traﬃc, a machine learning system sets a baseline of typical application and network performance, and can automatically alert an engineer when a metric deviates from that baseline. With the growing complexity and the need for more agility, ML is clearly the future to keep the pulse on the entire environment and guide proactive adjustments.
Engineers then help reﬁne the system by providing feedback on which insights are important and what’s less essential. Over time, this feedback tailors the learning to your speciﬁc environment and needs.
The upshot is that engineers and operations teams have a platform that reﬁnes and ﬁlters vast quantities of raw data into expert-trained insights that are relevant and actionable, helping them deliver service assurance across the enterprise while decreasing the time to resolution and improve NetOps operations.
For more information about LiveInsight and machine learning, go to https://www.liveaction.com/solutions/network-data-analytics/.