Tuesday, February 12, 2019

How to monitor and analyze network capacity usage

I've recently been working on an interesting problem that I wanted to share: how to monitor and analyze network capacity usage. I was brought onto the problem because we face a difficult challenge in Optiver–our business is driven by financial markets, which can unexpectedly become busy when, for example, major news events occur. When these market spikes happen, applications across our network suddenly try to send data as quickly as possible to financial exchanges across the world where we trade. These synchronized flows can then collide, overloading the capacity of our network equipment, leading to packet drops and long delays as the entire system needs to retransmit data. These delays can be very costly – packet drops can cause delays of multiple seconds in distributing our theoretical prices, far longer than the timescale at which markets move today.

Image 1: Example of incoming packet rates on a logarithmic scale to three machines (solid lines) vs historical baselines (dashed lines) around the US market open (15:30 Amsterdam) and an unexpected news event (~17:05)

From an infrastructure perspective, there are two ways we could try to solve this problem. On the one hand, we could make a major investment in the capacity of our infrastructure, buying new network switches with deep buffers, additional network lines, etc. On the other hand, it wasn't clear whether we were using our capacity wisely–for a long time, many people in the company had assumed our core infrastructure had infinite capacity. The process of gathering the data to check how we used our capacity turned out to be a very interesting and exciting project.

Gathering the Data

When I started this project, I was new to the world of network monitoring. Thankfully I work with a lot of talented engineers who were happy to teach me about the equipment we used, the metrics they thought were important, and what sort of data we could extract. We already had a lot of operational data on the load on our Unix machines and switches, and knew that many of our systems were highly loaded. However, the biggest missing piece was insight into what traffic was flowing over the network. We set out to gather new data to address this question.

One way to see what is traveling over the network is to directly tap and record the traffic. We already use tapping extensively to keep track of exactly what data we send and receive from exchanges. For routine monitoring of our broader production trading network, however, this approach does not scale easily – we use 10 Gbps networking equipment nearly everywhere, which would lead to huge amounts of traffic to store and process. Thankfully, in the networking world, switch vendors have come up with two competing services to provide scalable monitoring of network flows: Sflow and Netflow. Both of these services run on switches, which then report information on their flows to a server. Sflow works by taking random samples of the overall traffic passing through the switch and forwarding those packets to the reporting server. Netflow reports statistics on individual flows, summarizing, for example, that a flow from one IP and port to another carried N packets with a total size of S over 1 second. However, many implementations of Netflow also allow random sampling of packets like Sflow, making the two services largely similar in practice.

We have a heterogeneous networking environment at Optiver, which led to some challenges: many of our switches supported Sflow, but some others only supported Netflow, and others didn't support any packet reporting services! As a result, we had to handle both services. In the end, the team I work with built a custom packet proessing agent in C to capture the packet reports from switches, batch the results, and forward them in a more standardized JSONL format. These JSONL packet reports were then imported by a Python Flask-based microservice that stored the records in a custom Postgres database. An important advantage of this architecture is that the packet processing agent allows us to collect metrics from switches in remote datacenters without overloading the limited network capacity between them.

The initial trials of this architecture resulted in a LOT of data to handle. 10 Gbps of traffic corresponds to up to 125 million packets per second. At the most frequent sampling rate supported by our Sflow-capable switches (1 in 65535 packets), that is still up to 2000 sampled packets per second. Netflow without sampling produces a record per flow (IP-port source/target pair) per second, with a typical switch seeing tens of thousands of distinct flows. We currently collect data from ~30 switches and want to expand this to the hundreds of switches in our environment, so managing this data required carefully deciding what we wanted to see and fine tuning the parameters.

The biggest challenge with this flow data was that the most traffic is not interesting – flows that occur rarely or with low numbers of packets generally don't strain our environment and need our attention. Unsampled Netflow, however, was in our case dominated by these sorts of flows, producing massive amounts of uninteresting data and slowing down the overall system. Sflow, on the other hand, could be used to provide good average statistics over longer time periods and by its nature focuses on the biggest flows. The trade off is that it is difficult to zoom into specific events since packets are rarely sampled. In the end, we settled on using the most frequent sampling rate for Sflow (1 in 65535 packets), which generally gave us enough packets to analyze in relevant time windows for major flows, and the least frequent sampling rate for Netflow (1 in 1000 packets) to reduce the amount of data reported.

Image 2: Raw imported sampled packet flow records

Integrating Datasets

This sampled packet data provided a lot of insight into what our network was doing, but it was hard to directly turn it on its own into actions our engineers could use to improve the infrastructure. What does it mean to see that a switch reported a handful of sampled packets from one IP and port to another IP and port? The data would also have different meaning to a network engineer than an application engineer. To make it easier to interpret this data, we integrated the raw packet metrics with metadata on IPs and ports. We found several types of metadata relevant:

  • Physical connectivity: what switch an IP is connected to and in which datacenter is the machine
  • Machine info: hostname, interface, and functional name
  • Applications: labeling IP-port pairs with applications using our production application monitoring service
  • Traders: a lot of traffic goes to the Windows workstations of our traders, which we could label with their usernames

It was actually quite a bit of work to gather this metadata, since it was scattered across many isolated teams and exposed in different ways – git repositories, REST APIs, Microsoft SQL databases, and Postgres databases. We also wanted to track this data over time, since our production environment changes frequently and we would analyze flow patterns over longer timescales. In the end, we built a set of Python daemons that regularly poll these datasources, saving records in a custom Postgres database on the time ranges in which a set of metadata associated with an IP and port were valid.

This imported metadata turned out to be a useful dataset on its own, so we kept it as a separate microservice. In the process we ran into a common problem with microservice architectures – how to deal with distributed data? I didn't want to reimplement the data model yet again, so I took advantage of Postgres's ability to add tables from remote Postgres servers and imported a view from this metadata microservice. In the end, we were able to convert raw networking metrics into something more generally usable by people across the company, as the data was now labeled clearly with servers, applications, and even users.

Image 3: Sampled packet flow records with some metadata labels shown

Data Visualization

Time Series KPIs

In Optiver, we widely use the time series database Influx to store operational metrics for the whole company and display them in custom dashboards with Grafana. In order to let application engineers directly track the relevant metrics, I wrote a small script to query the Postgres database and export daily application traffic metrics to Influx (which was now possible thanks to the labeling with the application metadata!). We could then easily track long-term trends in how much traffic different types of applications were sending. It also revealed one of the first big insights – one particular class of applications, involved in distributing theoretical pricing data for our trading systems, was always at the top of the list of which applications sent the most traffic. We also saw over the course of the project that this class of application was also one that could easily breach our capacity limits with new releases or configuration changes.

Image 4: Grafana dashboard showing the outgoing traffic recorded for two applications over a month, with a noticeable uptick after a new configuration was deployed.

While it made sense to us that theoretical pricing data would be some of the bigger flows in our systems, we also saw some surprising things: for example, theoretical pricing components that distributed data to GUI clients often used more bandwidth than the ones that sent critical data to remote datacenters. What we really needed to see was where these flows were going. Of course, that data was in the database we built, but we needed a tool to make it easy to investigate these issues. Grafana is very useful for letting people make custom timeseries visualizations, but it is very difficult to natively visualize relationships in data there. We also frequently use Jupyter Python notebooks for interactive data analysis work within Optiver, but this network flow dataset was too large to easily interactively explore in a notebook:

Image 5: A heatmap of network flows between IPs and ports, showing sparse network flows but not much else.

Building an Interactive Data Visualization Webapp

I realized to quickly solve ongoing problems with this network flow dataset, what we needed was a way to interactively click through the data. I needed to dynamically control what level of aggregation I showed the data, as well as the hierarchical structure in the flows (datacenters → switches → functional names → hosts → applications). In thinking about the problem, I was inspired by many of the great data visualization examples from the D3 framework. In addition, from my work on a previous project within the company, I was impressed by the data visualization results you can achieve with a custom webapp in the Javascript framework React. I decided to build a D3-style network visualization in React of this dataset, with dynamic aggregation and highlighting of the network flows.

Within Optiver, we have been using the Redux framework to structure our React applications. In Redux, the application state is modeled as a central store that is updated in a deterministic way by reducer functions that process action objectsemitted by React components. In addition, since this webapp needed to fetch data exposed by web APIs within the company, we built a custom asynchronous reactive data-fetching process with RxJS and Redux-Observable that handles data fetching actions and emits data received or error actions upon completion. The actual visualization part is then handled by React components, which are either classes with some local state and a render method or pure functions. These components use their properties (and local state) to render to other components and eventually HTML. React then takes care of updating the browser with the desired view.

In integrating D3 with the React/Redux architecture, there are some competing ideas that make it hard to directly copy example D3 code. The original architecture of D3 used the concept of binding HTML elements to data and expressing the properties of these HTML elements as functions of the data. D3 would then take care of adding, removing, or updating the browser elements as necessary. In my opinion, the React approach of mapping from data (React properties) to HTML with pure functions is much more elegant than the D3 method in which you must define how to handle creating and removing elements. However, D3 has a lot of powerful visualization functions: convenient mapper functions, for example to normalize numerical scales, turn numbers into colors, and calculate SVG paths. Most importantly for the visualization I needed, it also has a powerful library to create dynamic network visualizations with a force simulation.

I therefore integrated the two libraries by writing a stateful React component that ran a D3 force simulation, saving the positions from simulation state to the React component, and then rendering a visualization based on the data and current simulation state. The actual visualization part could be written in a standard React way, with components for nodes and links in the graph that render based on their properties (data). I also used D3's useful helper functions to control the styles on the rendered elements. I wrote functions to convert the labeled flow data into a node per hierarchical level with the aggregated flow between each node. The React component also maintained state on which node was highlighted and/or being dragged. For the force simulation, after much experimentation, I settled on using a combination of forces to make a good layout: I used standard collision and charge forces with link strength a function of the aggregated flow data to generally arrange the nodes and links. I added a centering force to keep the whole network in the middle of the page, and a radial force to arrange each layer of the hierarchy at increasingly far out circles. Dynamic aggregation of the data was handled at the Redux level: I added actions and reducers to handle the concept of "bundling" flows, which resulted in dynamic aggregations of the raw data.

GIF 1: Anonymized example of the interactive network flow visualization, showing dynamic highlighting of aggregated flows and the ability to interactively dig into the data.

What We Learned

  • Before we began this project, it was hard to get application engineers to act on networking performance issues. The problem is that it wasn't easy to assign responsibility – it already took quite a bit of manual work to convert an IP into a machine. Once we labeled the data with application names, people immediately took interest. In particular, we found developers were very interested to see how their applications were behaving in the production environment.
  • The networking equipment we bought in the past was driven by networking performance requirements. As we built this dataset, we realized an unintentional trade-off: some switches at interesting locations in our network were lacking in analytical features that we now found very useful. As we look to buy new networking equipment, monitoring capabilities will be high on our priorities.
  • The labeled network flow data revealed many deployment issues due to the constantly evolving production environment – there were proxies in incorrect locations, applications that send TCP to growing numbers of clients in remote parts of the network without proxies in between, and closely communicating applications located in different server racks that needed to communicate across the core network. With this data now regularly available, we are looking at recommending automated optimal deployment configurations based on network traffic patterns.

Stephen Helms,
Engineer at Optiver



No comments:

Post a Comment