Regular measures are made on the 363 nodes of 8 Grid'5000 clusters to keep track of their evolution. Three main metrics are collected: the average CPU performance (in Gflop/s), the average CPU frequency (in GHz) and the average CPU temperature (in °C).

In [1]:

```
cluster = 'yeti'
factor = 'mean_gflops'
confidence = 0.9999
```

In [2]:

```
# Parameters
cluster = "chetemi"
factor = "nk_residual"
```

In [3]:

```
%load_ext autoreload
%autoreload 2
import requests
import pandas
import io
import plotnine
plotnine.options.figure_size = 10, 7.5
plotnine.options.dpi = 100
from cashew import non_regression_tests as nrt
import cashew
print(cashew.__git_version__)
```

697ee1f882418b003dd238ab6f7a1997176ac874

In [4]:

```
%%time
csv_url = nrt.DEFAULT_CSV_URL_PREFIX + nrt.DATA_FILES[factor]
df = nrt.format(nrt.get(csv_url))
```

2021-06-24 10:08:11,240 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 168840 rows and 39 columns

CPU times: user 2.2 s, sys: 113 ms, total: 2.31 s Wall time: 9.88 s

In [5]:

```
changelog = nrt.format_changelog(nrt.get(nrt.DEFAULT_CHANGELOG_URL))
outlierlog = nrt.format_changelog(nrt.get(nrt.DEFAULT_OUTLIERLOG_URL))
```

In [6]:

```
df = nrt.filter(df, cluster=cluster)
```

In [7]:

```
df = nrt.filter_na(df, factor)
```

In [8]:

```
%%time
nrt.plot_latest_distribution(df, factor)
```

2021-06-24 10:08:13,349 - non_regression_tests - INFO - Filtered the dataframe, there remains 30 rows

CPU times: user 122 ms, sys: 20.5 ms, total: 142 ms Wall time: 607 ms

Out[8]:

<ggplot: (8778204558333)>

In [9]:

```
%%time
marked=nrt.mark_weird(df, changelog, outlierlog, nmin=10, keep=5, window=5, naive=False, confidence=confidence, cols=[factor])
nb_weird = len(marked[marked.weird.isin({'positive', 'negative'})])
nb_total = len(marked[marked.weird != 'NA'])
print(f'{nb_weird/nb_total*100:.2f}% of measures are abnormal ({nb_weird}/{nb_total})')
```

/usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:872: RuntimeWarning: invalid value encountered in greater /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal /usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign

0.44% of measures are abnormal (24/5488) CPU times: user 11.9 s, sys: 33.2 ms, total: 11.9 s Wall time: 47.9 s

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign

In [10]:

```
%%time
import plotnine
nb_unique = len(marked[['node', 'cpu']].drop_duplicates())
height = max(6, nb_unique/8)
old_sizes = tuple(plotnine.options.figure_size)
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_raw_data(marked, changelog))
plotnine.options.figure_size = old_sizes
```

CPU times: user 1.78 s, sys: 12.1 ms, total: 1.79 s Wall time: 7.69 s

The goal of the following cells is to detect the eventual anomalies for the considered metric (performance, frequency or temperature).

Suppose that we have made 20 different experiments with a given CPU on a given node and measured its average temperature each time. We therefore have a list of 20 values. We can now compute:

- $\mu$ the sample mean of the 20 measures
- $\sigma$ the sample standard deviation of the 20 measures

For instance, we may have $\mu \approx 64.7°C$ and $\sigma \approx 3.2°C$.

Now, suppose that we perform a new experiment. This time, this CPU has an average temperature of $70°C$. This new temperature measure is higher than the mean of the 20 previous ones, but was it *significantly* too high? What was the probability of having a temperature at least as high if nothing changed on the CPU?

In the evolution plots, we show the observed values with a prediction region $\mu \pm \alpha\times\sigma$, where the factor $\alpha$ is defined for a given confidence. With a conficence of 99.99%, if nothing has changed on the CPU, then 99.99% of the measures will fall in the prediction region. In other words, if a measure fall *outside* of this region, then there is probably something unusual that happened on this CPU at this time. The factor $\alpha$ is computed using the quantile function of either the normal distribution or the F distribution.

Back to our example, if we use the normal distribution, with a 99.99% confidence $\alpha \approx 3.89$ and the associated prediction region is $[52.3°C, 77.1°C]$. Our latest observation of $70°C$ falls in this region, so we consider that there is nothing unusual here.

In the overview plots, the question is the other way around. We estimate what was the probability to observe a value as high (or as low) given the prior knowledge we had ($\mu$ and $\sigma$). First, we compute this probability (also called *likelihood*) using the cumulative distribution function of either the normal distribution or the F distribution. This probability can be very low, so for an easier visualization we take its logarithm. This new value, called *log-likelihood*, is always negative. For a better visualization, we then give it a sign (positive if the new observation is higher than the mean, negative otherwise). We also bound it to reasonable values to not distort too much the color scale.

Back to our example, if we use the normal distribution, the probability to observe a value at least as high as $70°C$ was $L \approx 0.049$. The log-likelihood is thus $LL \approx -3.02$. Finally, the new observation was higher than the mean, so we give it a positive sign: the final value is $3.02$.

In [11]:

```
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
```