Regular measures are made on the 363 nodes of 8 Grid'5000 clusters to keep track of their evolution. Three main metrics are collected: the average CPU performance (in Gflop/s), the average CPU frequency (in GHz) and the average CPU temperature (in °C).
cluster = 'yeti'
factor = 'mean_gflops'
confidence = 0.9999
# Parameters
cluster = "gros"
factor = "mean_gflops_2048"
%load_ext autoreload
%autoreload 2
import requests
import pandas
import io
import plotnine
plotnine.options.figure_size = 10, 7.5
plotnine.options.dpi = 100
from cashew import non_regression_tests as nrt
import cashew
print(cashew.__git_version__)
/usr/local/lib/python3.7/dist-packages/scipy/__init__.py:149: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.16.2
a745d0dbd6b1fef88adcceaa507d5584085a755e
%%time
csv_url = nrt.DEFAULT_CSV_URL_PREFIX + nrt.DATA_FILES[factor]
df = nrt.format(nrt.get(csv_url))
2021-09-29 09:40:58,473 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 168840 rows and 23 columns
CPU times: user 1.04 s, sys: 60.3 ms, total: 1.1 s Wall time: 2.97 s
changelog = nrt.format_changelog(nrt.get(nrt.DEFAULT_CHANGELOG_URL))
outlierlog = nrt.format_changelog(nrt.get(nrt.DEFAULT_OUTLIERLOG_URL))
2021-09-29 09:40:58,666 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 24 rows and 5 columns 2021-09-29 09:40:58,674 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 52 rows and 5 columns
df = nrt.filter(df, cluster=cluster)
2021-09-29 09:40:58,819 - non_regression_tests - INFO - Filtered the dataframe, there remains 28998 rows
df = nrt.filter_na(df, factor)
2021-09-29 09:40:58,979 - non_regression_tests - INFO - Filtered the dataframe, there remains 28998 rows
%%time
nrt.plot_latest_distribution(df, factor)
2021-09-29 09:41:02,537 - non_regression_tests - INFO - Filtered the dataframe, there remains 124 rows
CPU times: user 1.08 s, sys: 244 ms, total: 1.32 s Wall time: 3.52 s
<ggplot: (-9223363280936775157)>
%%time
marked=nrt.mark_weird(df, changelog, outlierlog, nmin=10, keep=5, window=5, naive=False, confidence=confidence, cols=[factor])
nb_weird = len(marked[marked.weird.isin({'positive', 'negative'})])
nb_total = len(marked[marked.weird != 'NA'])
print(f'{nb_weird/nb_total*100:.2f}% of measures are abnormal ({nb_weird}/{nb_total})')
/usr/local/lib/python3.7/dist-packages/scipy/stats/_distn_infrastructure.py:967: RuntimeWarning: invalid value encountered in greater /usr/local/lib/python3.7/dist-packages/scipy/stats/_distn_infrastructure.py:1956: RuntimeWarning: invalid value encountered in greater_equal /usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign /usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign
1.03% of measures are abnormal (247/24042) CPU times: user 1min 5s, sys: 529 ms, total: 1min 6s Wall time: 1min 56s
%%time
import plotnine
nb_unique = len(marked[['node', 'cpu']].drop_duplicates())
height = max(6, nb_unique/8)
old_sizes = tuple(plotnine.options.figure_size)
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_raw_data(marked, changelog))
plotnine.options.figure_size = old_sizes
CPU times: user 5.62 s, sys: 128 ms, total: 5.74 s Wall time: 5.74 s
The goal of the following cells is to detect the eventual anomalies for the considered metric (performance, frequency or temperature).
Suppose that we have made 20 different experiments with a given CPU on a given node and measured its average temperature each time. We therefore have a list of 20 values. We can now compute:
For instance, we may have $\mu \approx 64.7°C$ and $\sigma \approx 3.2°C$.
Now, suppose that we perform a new experiment. This time, this CPU has an average temperature of $70°C$. This new temperature measure is higher than the mean of the 20 previous ones, but was it significantly too high? What was the probability of having a temperature at least as high if nothing changed on the CPU?
In the evolution plots, we show the observed values with a prediction region $\mu \pm \alpha\times\sigma$, where the factor $\alpha$ is defined for a given confidence. With a conficence of 99.99%, if nothing has changed on the CPU, then 99.99% of the measures will fall in the prediction region. In other words, if a measure fall outside of this region, then there is probably something unusual that happened on this CPU at this time. The factor $\alpha$ is computed using the quantile function of either the normal distribution or the F distribution.
Back to our example, if we use the normal distribution, with a 99.99% confidence $\alpha \approx 3.89$ and the associated prediction region is $[52.3°C, 77.1°C]$. Our latest observation of $70°C$ falls in this region, so we consider that there is nothing unusual here.
In the overview plots, the question is the other way around. We estimate what was the probability to observe a value as high (or as low) given the prior knowledge we had ($\mu$ and $\sigma$). First, we compute this probability (also called likelihood) using the cumulative distribution function of either the normal distribution or the F distribution. This probability can be very low, so for an easier visualization we take its logarithm. This new value, called log-likelihood, is always negative. For a better visualization, we then give it a sign (positive if the new observation is higher than the mean, negative otherwise). We also bound it to reasonable values to not distort too much the color scale.
Back to our example, if we use the normal distribution, the probability to observe a value at least as high as $70°C$ was $L \approx 0.049$. The log-likelihood is thus $LL \approx -3.02$. Finally, the new observation was higher than the mean, so we give it a positive sign: the final value is $3.02$.
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
CPU times: user 5.04 s, sys: 228 ms, total: 5.27 s Wall time: 5.19 s
%%time
node_limit = None if factor.startswith('mean') else 1
tmp = nrt.plot_evolution_cluster(marked, changelog=changelog, node_limit=node_limit)
gros-1
gros-2
gros-3
gros-4
gros-5
gros-6
gros-7
gros-8
gros-9
gros-10
gros-11
gros-12
gros-13
gros-14
gros-15
gros-16
gros-17
gros-18
gros-19
gros-20
gros-21
gros-22
gros-23
gros-24
gros-25
gros-26
gros-27
gros-28
gros-29
gros-30
gros-31
gros-32
gros-33
gros-34
gros-35
gros-36
gros-37
gros-38
gros-39
gros-40
gros-41
gros-42
gros-43
gros-44
gros-45
gros-46
gros-47
gros-48
gros-49
gros-50
gros-51
gros-52
gros-53
gros-54
gros-55
gros-56
gros-57
gros-58
gros-59
gros-60
gros-61
gros-62
gros-63
gros-64
gros-65
gros-66
gros-67
gros-68
gros-69
gros-70
gros-71
gros-72
gros-73
gros-74
gros-75
gros-76
gros-77
gros-78
gros-79
gros-80
gros-81
gros-82
gros-83
gros-84
gros-85
gros-86
gros-87
gros-88
gros-89
gros-90
gros-91
gros-92
gros-93
gros-94
gros-95
gros-96
gros-97
gros-98
gros-99
gros-100
gros-101
gros-102
gros-103
gros-104
gros-105
gros-106
gros-107
gros-108
gros-109
gros-110
gros-111
gros-112
gros-113
gros-114
gros-115
gros-116
gros-117
gros-118
gros-119
gros-120
gros-121
gros-122
gros-123
gros-124
CPU times: user 1min 38s, sys: 1.19 s, total: 1min 39s Wall time: 1min 38s
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_windowed(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
CPU times: user 2.53 s, sys: 160 ms, total: 2.69 s Wall time: 2.59 s
%%time
import warnings
warnings.filterwarnings("ignore")
node_limit = None if factor.startswith('mean') else 1
tmp = nrt.plot_evolution_cluster_windowed(marked, changelog=changelog, node_limit=node_limit)
gros-1
gros-2
gros-3
gros-4
gros-5
gros-6
gros-7
gros-8
gros-9
gros-10
gros-11
gros-12
gros-13
gros-14
gros-15
gros-16
gros-17
gros-18
gros-19
gros-20
gros-21
gros-22
gros-23
gros-24
gros-25
gros-26
gros-27
gros-28
gros-29
gros-30
gros-31
gros-32
gros-33
gros-34
gros-35
gros-36
gros-37
gros-38
gros-39
gros-40
gros-41
gros-42
gros-43
gros-44
gros-45
gros-46
gros-47
gros-48
gros-49
gros-50
gros-51
gros-52
gros-53
gros-54
gros-55
gros-56
gros-57
gros-58
gros-59
gros-60
gros-61
gros-62
gros-63
gros-64
gros-65
gros-66
gros-67
gros-68
gros-69
gros-70
gros-71
gros-72
gros-73
gros-74
gros-75
gros-76
gros-77
gros-78
gros-79
gros-80
gros-81
gros-82
gros-83
gros-84
gros-85
gros-86
gros-87
gros-88
gros-89
gros-90
gros-91
gros-92
gros-93
gros-94
gros-95
gros-96
gros-97
gros-98
gros-99
gros-100
gros-101
gros-102
gros-103
gros-104
gros-105
gros-106
gros-107
gros-108
gros-109
gros-110
gros-111
gros-112
gros-113
gros-114
gros-115
gros-116
gros-117
gros-118
gros-119
gros-120
gros-121
gros-122
gros-123
gros-124
CPU times: user 1min 10s, sys: 811 ms, total: 1min 11s Wall time: 1min 10s