Regular measures are made on the 363 nodes of 8 Grid'5000 clusters to keep track of their evolution. Three main metrics are collected: the average CPU performance (in Gflop/s), the average CPU frequency (in GHz) and the average CPU temperature (in °C).
cluster = 'yeti'
factor = 'mean_gflops'
confidence = 0.9999
# Parameters
cluster = "troll"
factor = "m_residual"
%load_ext autoreload
%autoreload 2
import requests
import pandas
import io
import plotnine
plotnine.options.figure_size = 10, 7.5
plotnine.options.dpi = 100
from cashew import non_regression_tests as nrt
import cashew
print(cashew.__git_version__)
697ee1f882418b003dd238ab6f7a1997176ac874
%%time
csv_url = nrt.DEFAULT_CSV_URL_PREFIX + nrt.DATA_FILES[factor]
df = nrt.format(nrt.get(csv_url))
2021-06-24 10:08:14,196 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 168840 rows and 39 columns
CPU times: user 2.21 s, sys: 150 ms, total: 2.36 s Wall time: 8.88 s
changelog = nrt.format_changelog(nrt.get(nrt.DEFAULT_CHANGELOG_URL))
outlierlog = nrt.format_changelog(nrt.get(nrt.DEFAULT_OUTLIERLOG_URL))
2021-06-24 10:08:14,478 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 24 rows and 5 columns 2021-06-24 10:08:14,486 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 52 rows and 5 columns
df = nrt.filter(df, cluster=cluster)
2021-06-24 10:08:14,669 - non_regression_tests - INFO - Filtered the dataframe, there remains 1726 rows
df = nrt.filter_na(df, factor)
2021-06-24 10:08:14,845 - non_regression_tests - INFO - Filtered the dataframe, there remains 1726 rows
%%time
nrt.plot_latest_distribution(df, factor)
2021-06-24 10:08:15,200 - non_regression_tests - INFO - Filtered the dataframe, there remains 8 rows
CPU times: user 55.9 ms, sys: 15.6 ms, total: 71.6 ms Wall time: 238 ms
<ggplot: (-9223363260205822478)>
%%time
marked=nrt.mark_weird(df, changelog, outlierlog, nmin=10, keep=5, window=5, naive=False, confidence=confidence, cols=[factor])
nb_weird = len(marked[marked.weird.isin({'positive', 'negative'})])
nb_total = len(marked[marked.weird != 'NA'])
print(f'{nb_weird/nb_total*100:.2f}% of measures are abnormal ({nb_weird}/{nb_total})')
/usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:872: RuntimeWarning: invalid value encountered in greater /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal /usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign
16.15% of measures are abnormal (240/1486) CPU times: user 3.61 s, sys: 5.29 ms, total: 3.62 s Wall time: 14.4 s
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign
%%time
import plotnine
nb_unique = len(marked[['node', 'cpu']].drop_duplicates())
height = max(6, nb_unique/8)
old_sizes = tuple(plotnine.options.figure_size)
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_raw_data(marked, changelog))
plotnine.options.figure_size = old_sizes
CPU times: user 1.05 s, sys: 12.9 ms, total: 1.06 s Wall time: 4.15 s
The goal of the following cells is to detect the eventual anomalies for the considered metric (performance, frequency or temperature).
Suppose that we have made 20 different experiments with a given CPU on a given node and measured its average temperature each time. We therefore have a list of 20 values. We can now compute:
For instance, we may have $\mu \approx 64.7°C$ and $\sigma \approx 3.2°C$.
Now, suppose that we perform a new experiment. This time, this CPU has an average temperature of $70°C$. This new temperature measure is higher than the mean of the 20 previous ones, but was it significantly too high? What was the probability of having a temperature at least as high if nothing changed on the CPU?
In the evolution plots, we show the observed values with a prediction region $\mu \pm \alpha\times\sigma$, where the factor $\alpha$ is defined for a given confidence. With a conficence of 99.99%, if nothing has changed on the CPU, then 99.99% of the measures will fall in the prediction region. In other words, if a measure fall outside of this region, then there is probably something unusual that happened on this CPU at this time. The factor $\alpha$ is computed using the quantile function of either the normal distribution or the F distribution.
Back to our example, if we use the normal distribution, with a 99.99% confidence $\alpha \approx 3.89$ and the associated prediction region is $[52.3°C, 77.1°C]$. Our latest observation of $70°C$ falls in this region, so we consider that there is nothing unusual here.
In the overview plots, the question is the other way around. We estimate what was the probability to observe a value as high (or as low) given the prior knowledge we had ($\mu$ and $\sigma$). First, we compute this probability (also called likelihood) using the cumulative distribution function of either the normal distribution or the F distribution. This probability can be very low, so for an easier visualization we take its logarithm. This new value, called log-likelihood, is always negative. For a better visualization, we then give it a sign (positive if the new observation is higher than the mean, negative otherwise). We also bound it to reasonable values to not distort too much the color scale.
Back to our example, if we use the normal distribution, the probability to observe a value at least as high as $70°C$ was $L \approx 0.049$. The log-likelihood is thus $LL \approx -3.02$. Finally, the new observation was higher than the mean, so we give it a positive sign: the final value is $3.02$.
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
CPU times: user 1.11 s, sys: 12.2 ms, total: 1.12 s Wall time: 4.52 s
%%time
node_limit = None if factor.startswith('mean') else 1
tmp = nrt.plot_evolution_cluster(marked, changelog=changelog, node_limit=node_limit)
troll-1
2021-06-24 10:08:50,765 - non_regression_tests - WARNING - To save space, only plotted the evolution of 1 node
CPU times: user 1.8 s, sys: 12.6 ms, total: 1.82 s Wall time: 7.5 s
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_windowed(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
CPU times: user 1.08 s, sys: 52.4 ms, total: 1.13 s Wall time: 4.64 s
%%time
import warnings
warnings.filterwarnings("ignore")
node_limit = None if factor.startswith('mean') else 1
tmp = nrt.plot_evolution_cluster_windowed(marked, changelog=changelog, node_limit=node_limit)
troll-1
2021-06-24 10:09:04,728 - non_regression_tests - WARNING - To save space, only plotted the evolution of 1 node
CPU times: user 1.9 s, sys: 26.4 ms, total: 1.93 s Wall time: 8.04 s