Change Rate Anomaly in Service Mesh

Overview

Classic Service Mesh architecture uses sidecar proxies and load balancers, like Envoy, HAProxy or other, maybe proprietary solution.

Load Balancer performs routing for different services or micro services and can track some obvious metrics – latencies, errors, availability.

One popular approach for anomaly detection – compare service metrics with its historical values.

But there is another interesting pattern to be tracked that can give you insight on how services behave. Load Balancer can track number of available instances for service to detect if some services have unusual patterns if compare with other services.

Note that this pattern makes sense only if you have many services. Otherwise what is “normal” and “anomaly” if you consider 5 services? Anomaly detection can give you false positive and negative on small data set. In my example I have analyzed system with several thousand micro services.

What?

By design Load Balancer should be aware about number of instances available for routing for services. Number of available (healthy) instances is not a constant and changing. Usually Control plane is delivering these updates to Load Balancer. There are different reasons for changes in number of instances:

  • Service owner can manually add or remove instances
  • Deployment/Continuous Delivery affects a number of available instances
  • Auto-scaling can change it automatically
  • Service errors, crash can make instances unavailable

This behavior can be tracked using the following anomaly scores per service

Change Rate  – Number of instance changes per second

Normalized Change Rate – Number of instance changes per second, normalized by the number of active instances at the time of change.

How?

“Anomaly” module should support anomaly Table data structure. This is essentially maps each service name to their anomaly scores and anomaly metadata. The anomaly Table should receive real time updates which contains information on:

  • Number of added instances
  • Number of removed instances
  • Time of update
  • Number of active instances after the update

All anomaly scores need this information to properly adjust. Below are adjustment formulas for each score.

Change Rate

Where ɑ [0, 1] is the level of impact incoming updates affect the current score.
See Exponential Moving Average

Normalized Change Rate

Note we normalize with max(curActive, oldActive) because we don’t want to divide by zero, and one of those values are guaranteed to be non-zero.

Classifying Anomalies

Change Rate

I collected ~5000 scores per pattern, one for every service. The next step was to look at a score, and be able to determine whether it is too high. A relatively high Change Rate score would mean that particular service is updating too frequently. A quick plot of the distribution of the non-zero Change Rate scores yielded interesting results

It looks like an exponential distribution[a][b][c]. There are many stable services (left) that doesn’t change number of available instances very often. As we look at higher change rates, the number of services that fit within that bin falls off exponentially, with very few, possibly anomalous, names that change often (right).

[a] Possibly Poisson?
[b] Considered it, but poisson is for discrete data. Change rate is continuous.
[c] I think that’s a side effect of how I have normalized the x axis. Multiply the buckets by the time window this was measured over, and it’s a more typical measurement of discreet updates during that window.

In statistics, data points are considered outliers if they fall above the an upper threshold of Q3 * 1.5 * IQR. This definition works decently for general data sets, but since we know our dataset follows an exponential distribution, we can apply that definition and arrive at a closed formula:

Where 𝜇 is the mean.

To determine whether a specific service is updating too fast and too much (anomaly), Load Balancer should make the above calculation under the hood, and check if the change rate score for that service falls above the threshold.

Pictorially, all services to the right of the orange line are considered anomalies.

Normalized Change Rate

The normalized change rate score follows a similar distribution, and is computed in the same manner.

Appendix

Exponential Distribution

Probability density function of exponential distributions:

Where λ = 1/μ and μ and μ is the set mean.

Outlier Threshold for Exponential Distributions

Step 1: Compute q1 and q3

Step 2: Compute and apply IQR

Plotting with Python

All plots in this document was rendered with the python library Matplotlib. It’s a powerful graphing tool for visualizing all sorts of datasets in python. Pip install it if you haven’t already.

Collect the dataset you would like to plot into a convenient format using whatever method you wish. Personally, I wrote an adhoc function to make the Load Balancer backend spit out a csv file containing scores of all the names, and SCP’ed it down to my local machine. Next, I parsed the csv into a list within python, and fed the list into the matplotlib library. The full Matplotlib documentation can be found at https://matplotlib.org, but these are the handy functions I used:

hist(x, bins=): takes in an array of numbers and plots a histogram with the specified number of bins

plot(x, y): Takes in an array of x and y values (of identical length) and plots the 2d line.

xlabel(label): Labels the x-axis

ylabel(label): Labels the y-axis

axvline(x=, color=): Plots a vertical line at the specified x value.

show(): Render the graph. To be called last.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s