## Overview

Classic Service Mesh architecture uses sidecar proxies and load balancers, like Envoy, HAProxy or other, maybe proprietary solution.

Load Balancer performs routing for different services or micro services and can track some obvious metrics – latencies, errors, availability.

One popular approach for anomaly detection – compare service metrics with its historical values.

But there is another interesting pattern to be tracked that can give you insight on how services behave. Load Balancer can *track number of available instances for service to detect if some services have unusual patterns if compare with other services*.

Note that this pattern makes sense only if you have many services. Otherwise what is “normal” and “anomaly” if you consider 5 services? Anomaly detection can give you false positive and negative on small data set. In my example I have analyzed system with several thousand micro services.

## What?

By design Load Balancer should be aware about number of instances available for routing for services. Number of available (healthy) instances is not a constant and changing. Usually Control plane is delivering these updates to Load Balancer. There are different reasons for changes in number of instances:

- Service owner can manually add or remove instances
- Deployment/Continuous Delivery affects a number of available instances
- Auto-scaling can change it automatically
- Service errors, crash can make instances unavailable

This behavior can be tracked using the following **anomaly scores** per service

**Change Rate ** – Number of instance changes per second

**Normalized Change Rate** – Number of instance changes per second, normalized by the number of active instances at the time of change.

## How?

“Anomaly” module should support anomaly Table data structure. This is essentially maps each service name to their anomaly scores and anomaly metadata. The anomaly Table should receive real time updates which contains information on:

- Number of added instances
- Number of removed instances
- Time of update
- Number of active instances after the update

All anomaly scores need this information to properly adjust. Below are adjustment formulas for each score.

#### Change Rate

#### Normalized Change Rate

## Classifying Anomalies

#### Change Rate

I collected ~5000 scores per pattern, one for every service. The next step was to look at a score, and be able to determine whether it is too high. A relatively high Change Rate score would mean that particular service is updating too frequently. A quick plot of the distribution of the non-zero Change Rate scores yielded interesting results

In statistics, data points are considered outliers if they fall above the an upper threshold of **Q3 * 1.5 * IQR**. This definition works decently for general data sets, but since we know our dataset follows an exponential distribution, we can apply that definition and arrive at a closed formula:

To determine whether a specific service is updating too fast and too much (anomaly), Load Balancer should make the above calculation under the hood, and check if the change rate score for that service falls above the threshold.

#### Normalized Change Rate

The normalized change rate score follows a similar distribution, and is computed in the same manner.

## Appendix

#### Exponential Distribution

Probability density function of exponential distributions:

#### Outlier Threshold for Exponential Distributions

Step 1: Compute q1 and q3

Step 2: Compute and apply IQR

#### Plotting with Python

All plots in this document was rendered with the python library **Matplotlib**. It’s a powerful graphing tool for visualizing all sorts of datasets in python. Pip install it if you haven’t already.

Collect the dataset you would like to plot into a convenient format using whatever method you wish. Personally, I wrote an adhoc function to make the Load Balancer backend spit out a csv file containing scores of all the names, and SCP’ed it down to my local machine. Next, I parsed the csv into a list within python, and fed the list into the matplotlib library. The full Matplotlib documentation can be found at https://matplotlib.org, but these are the handy functions I used:

**hist(x, bins=)**: takes in an array of numbers and plots a histogram with the specified number of bins

**plot(x, y): **Takes in an array of x and y values (of identical length) and plots the 2d line.

**xlabel(label):** Labels the x-axis

**ylabel(label):** Labels the y-axis

**axvline(x=, color=): **Plots a vertical line at the specified x value.

**show():** Render the graph. To be called last.