An Exception was encountered at 'In [10]'.
Regular measures are made on the 363 nodes of 8 Grid'5000 clusters to keep track of their evolution. Three main metrics are collected: the average CPU performance (in Gflop/s), the average CPU frequency (in GHz) and the average CPU temperature (in °C).
cluster = 'yeti'
factor = 'mean_gflops'
confidence = 0.9999
# Parameters
cluster = "paravance"
factor = "m"
%load_ext autoreload
%autoreload 2
import requests
import pandas
import io
import plotnine
plotnine.options.figure_size = 10, 7.5
plotnine.options.dpi = 100
from cashew import non_regression_tests as nrt
import cashew
print(cashew.__git_version__)
697ee1f882418b003dd238ab6f7a1997176ac874
%%time
csv_url = nrt.DEFAULT_CSV_URL_PREFIX + nrt.DATA_FILES[factor]
df = nrt.format(nrt.get(csv_url))
2021-06-24 10:08:22,492 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 168840 rows and 39 columns
CPU times: user 2.3 s, sys: 85 ms, total: 2.39 s Wall time: 9.1 s
changelog = nrt.format_changelog(nrt.get(nrt.DEFAULT_CHANGELOG_URL))
outlierlog = nrt.format_changelog(nrt.get(nrt.DEFAULT_OUTLIERLOG_URL))
2021-06-24 10:08:22,806 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 24 rows and 5 columns 2021-06-24 10:08:22,822 - non_regression_tests - INFO - Loaded (from cache) a dataframe with 52 rows and 5 columns
df = nrt.filter(df, cluster=cluster)
2021-06-24 10:08:23,128 - non_regression_tests - INFO - Filtered the dataframe, there remains 38490 rows
df = nrt.filter_na(df, factor)
2021-06-24 10:08:23,421 - non_regression_tests - INFO - Filtered the dataframe, there remains 38490 rows
%%time
nrt.plot_latest_distribution(df, factor)
2021-06-24 10:08:27,430 - non_regression_tests - INFO - Filtered the dataframe, there remains 144 rows
CPU times: user 975 ms, sys: 197 ms, total: 1.17 s Wall time: 3.86 s
<ggplot: (-9223363308250295822)>
%%time
marked=nrt.mark_weird(df, changelog, outlierlog, nmin=10, keep=5, window=5, naive=False, confidence=confidence, cols=[factor])
nb_weird = len(marked[marked.weird.isin({'positive', 'negative'})])
nb_total = len(marked[marked.weird != 'NA'])
print(f'{nb_weird/nb_total*100:.2f}% of measures are abnormal ({nb_weird}/{nb_total})')
/usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:872: RuntimeWarning: invalid value encountered in greater /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less /usr/lib/python3/dist-packages/scipy/stats/_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal /usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign /usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in sign
100.00% of measures are abnormal (34026/34026) CPU times: user 1min 2s, sys: 609 ms, total: 1min 2s Wall time: 2min 53s
Execution using papermill encountered an exception here and stopped:
%%time
import plotnine
nb_unique = len(marked[['node', 'cpu']].drop_duplicates())
height = max(6, nb_unique/8)
old_sizes = tuple(plotnine.options.figure_size)
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_raw_data(marked, changelog))
plotnine.options.figure_size = old_sizes
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) /usr/local/lib/python3.7/dist-packages/plotnine/scales/scale_xy.py in map(self, series, limits) 79 try: ---> 80 seq = seq[idx] 81 except IndexError: IndexError: arrays used as indices must be of integer (or boolean) type During handling of the above exception, another exception occurred: IndexError Traceback (most recent call last) <ipython-input-10-938313fee005> in <module>() ----> 1 get_ipython().run_cell_magic('time', '', "import plotnine\nnb_unique = len(marked[['node', 'cpu']].drop_duplicates())\nheight = max(6, nb_unique/8)\nold_sizes = tuple(plotnine.options.figure_size)\nplotnine.options.figure_size = (10, height)\nprint(nrt.plot_overview_raw_data(marked, changelog))\nplotnine.options.figure_size = old_sizes") /usr/lib/python3/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell) 2115 magic_arg_s = self.var_expand(line, stack_depth) 2116 with self.builtin_trap: -> 2117 result = fn(magic_arg_s, cell) 2118 return result 2119 <decorator-gen-53> in time(self, line, cell, local_ns) /usr/lib/python3/dist-packages/IPython/core/magic.py in <lambda>(f, *a, **k) 186 # but it's overkill for just that one bit of state. 187 def magic_deco(arg): --> 188 call = lambda f, *a, **k: f(*a, **k) 189 190 if callable(arg): /usr/lib/python3/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns) 1191 else: 1192 st = clock2() -> 1193 exec(code, glob, local_ns) 1194 end = clock2() 1195 out = None <timed exec> in <module>() /usr/local/lib/python3.7/dist-packages/plotnine/ggplot.py in __str__(self) 86 Print/show the plot 87 """ ---> 88 self.draw(show=True) 89 90 # Return and empty string so that print(p) is "pretty" /usr/local/lib/python3.7/dist-packages/plotnine/ggplot.py in draw(self, return_ggplot, show) 203 self = deepcopy(self) 204 with plot_context(self, show=show): --> 205 self._build() 206 207 # setup /usr/local/lib/python3.7/dist-packages/plotnine/ggplot.py in _build(self) 295 # to ranges and all positions are numeric 296 layout.train_position(layers, scales.x, scales.y) --> 297 layout.map_position(layers) 298 299 # Apply and map statistics /usr/local/lib/python3.7/dist-packages/plotnine/facets/layout.py in map_position(self, layers) 112 set(data.columns)) 113 SCALE_Y = _layout['SCALE_Y'].iloc[match_id].tolist() --> 114 self.panel_scales_y.map(data, y_vars, SCALE_Y) 115 116 def get_scales(self, i): /usr/local/lib/python3.7/dist-packages/plotnine/scales/scales.py in map(self, data, vars, idx) 156 for i, sc in enumerate(self, start=1): 157 bool_idx = (i == idx) --> 158 results = sc.map(data.loc[bool_idx, col]) 159 if use_df: 160 df.loc[bool_idx, col] = results /usr/local/lib/python3.7/dist-packages/plotnine/scales/scale_xy.py in map(self, series, limits) 84 seq = np.hstack((seq.astype(object), np.nan)) 85 idx = np.clip(idx, 0, len(seq)-1) ---> 86 seq = seq[idx] 87 return seq 88 return series IndexError: arrays used as indices must be of integer (or boolean) type
The goal of the following cells is to detect the eventual anomalies for the considered metric (performance, frequency or temperature).
Suppose that we have made 20 different experiments with a given CPU on a given node and measured its average temperature each time. We therefore have a list of 20 values. We can now compute:
For instance, we may have $\mu \approx 64.7°C$ and $\sigma \approx 3.2°C$.
Now, suppose that we perform a new experiment. This time, this CPU has an average temperature of $70°C$. This new temperature measure is higher than the mean of the 20 previous ones, but was it significantly too high? What was the probability of having a temperature at least as high if nothing changed on the CPU?
In the evolution plots, we show the observed values with a prediction region $\mu \pm \alpha\times\sigma$, where the factor $\alpha$ is defined for a given confidence. With a conficence of 99.99%, if nothing has changed on the CPU, then 99.99% of the measures will fall in the prediction region. In other words, if a measure fall outside of this region, then there is probably something unusual that happened on this CPU at this time. The factor $\alpha$ is computed using the quantile function of either the normal distribution or the F distribution.
Back to our example, if we use the normal distribution, with a 99.99% confidence $\alpha \approx 3.89$ and the associated prediction region is $[52.3°C, 77.1°C]$. Our latest observation of $70°C$ falls in this region, so we consider that there is nothing unusual here.
In the overview plots, the question is the other way around. We estimate what was the probability to observe a value as high (or as low) given the prior knowledge we had ($\mu$ and $\sigma$). First, we compute this probability (also called likelihood) using the cumulative distribution function of either the normal distribution or the F distribution. This probability can be very low, so for an easier visualization we take its logarithm. This new value, called log-likelihood, is always negative. For a better visualization, we then give it a sign (positive if the new observation is higher than the mean, negative otherwise). We also bound it to reasonable values to not distort too much the color scale.
Back to our example, if we use the normal distribution, the probability to observe a value at least as high as $70°C$ was $L \approx 0.049$. The log-likelihood is thus $LL \approx -3.02$. Finally, the new observation was higher than the mean, so we give it a positive sign: the final value is $3.02$.
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
%%time
node_limit = None if factor.startswith('mean') else 1
tmp = nrt.plot_evolution_cluster(marked, changelog=changelog, node_limit=node_limit)
%%time
plotnine.options.figure_size = (10, height)
print(nrt.plot_overview_windowed(marked, changelog, confidence=confidence, discretize=True))
plotnine.options.figure_size = old_sizes
%%time
import warnings
warnings.filterwarnings("ignore")
node_limit = None if factor.startswith('mean') else 1
tmp = nrt.plot_evolution_cluster_windowed(marked, changelog=changelog, node_limit=node_limit)