Tidy Data for Network Traffic Analysis: Data Science Behind the Scenes

Security use cases
Detecting phishing attacks
Uncovering data exfiltration
Discovering lateral movement
Revealing anomalies
Determining DNS/IP reputation

Tidy data is a powerful tool for network traffic analysis. In this paper, Live Action Chief Data Scientist Andrew Fast, explores the topic and its cybersecurity applications.

Summary

Data Scientists estimate that they spend 80% of their time finding and cleaning data.

“Tidy” data means that the “shape” of the data matches the assumptions required by the analysis algorithms.

Common data formats for analysis include:

Tables
Timeseries
Graphs

Raw network traffic is an irregular time series with complete events

Irregular time series contain events that arrive at variable intervals
Complex events contain multiple variables

What is Tidy Data?

Tidy data is data that has been fully prepared for input into algorithms for analysis. Each instance in the data has uniform structure with no missing data (unless allowed by the algorithm). The assumptions for each individual algorithm depends on the required “shape” of the data.

In its raw form (whether raw packets or network meta-data), network traffic is an example of an irregular time series with complex events. An irregular timeseries has variable intervals between events. Complex events contain more than one attribute.

Unfortunately, few algorithms are designed to work with irregular time series. Instead, a common strategy is to “project” the data into a simpler format by removing some of the additional complexity. Projection is a term that is used in the relational algebra underlying SQL.

Common formats for analysis include:

Regular timeseries, single variable
Table data (ignoring time)
Graph data

Common Action for Tidying Data

Data Scientists spend 80% of their time tidying data

Action	Purpose
Binning	Reduce dimensionality
Imputation	Fill-in missing data
Join	Combine data from different tables
Aggregation	Group instances by another variable
Projection	Remove variables from the data
Normalize	Reduce data redundancy and improve data integrity

Timeseries Data

Timeseries data is the most common data type for analysis of network traffic. Typically, these analyses are created by aggregating data for a time window (e.g., events per second). This creates a data structure with a single variable occurring at fixed interval. Timeseries analysis also include techniques for accounting for seasonality and temporal trends.

Algorithms and Tasks include

Forecasting
Autoregressive Moving Averages (ARMA/ARIMA)
Anomaly Detection

Security Use Cases

Detect phishing attack using domain burst detection
Detect exfiltration using change point detection in Producer/Consumer Ratio

Timeseries data consist of a single variable at fixed intervals.

Tabular Data

A data table is the most common format for traditional data science analysis. Commonly found in either a relational database or Excel spreadsheet, a table is characterized by multiple variables combined in a row representing a single event or entity. Tables are created using data normalization, joining tables, and projection.

Algorithms and Tasks include

Classification (Supervised Learning)
Clustering (Unsupervised Learning)
Outlier Detection (Unsupervised Learning)

Security Use Cases

Detection of Domain Generation Algorithms using classification
Detect anomalous behavior using unsupervised learning

Tabular data can be easily stored in spreadsheets or database tables.

Graph Data

Representing data as a graph is natural for network data. A graph is a general data structure containing nodes, representing entities in the network, and edges, representing relationships between entities. More advanced graph types support the addition of one or more attributes on both nodes and edges. This rich representation requires specialized algorithms to process effectively.

Algorithms and Tasks include

Community Detection
Collective Classification
Anomalous Link Detection

Security Use Cases

Determine DNS/IP reputation based on network associations
Detect lateral movement with anomalous link detection

Graph data represents the relationships between network entities. Multi-attributed graphs are often projected to simpler data formats for analysis.

Download White Paper

Tidy Data for Network Traffic Analysis: Data Science Behind the Scenes

Download

Tidy Data for Network Traffic Analysis: Data Science Behind the Scenes

Contents

Summary