Case Study: Excavator Survival Analysis
Mining Excavator dataset case study, as originally presented in Sexton et al. [^1].
from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import nestor
from nestor import keyword as kex
import nestor.datasets as dat
def set_style():
"""This sets reasonable defaults for a figure that will go in a paper"""
sns.set_context("paper")
sns.set(font='serif')
sns.set_style("white", {
"font.family": "serif",
"font.serif": ["Times", "Palatino", "serif"]
})
set_style()
df = dat.load_excavators()
df.head()
vocab = dat.load_vocab('excavators')
vocab
Knowledge Extraction
We already have vocabulary and data, so let's merge them to structure our data to a more useful format.
scikit-learn's Pipeline
Convenient way to use TagExtractor
to output a more usable format. Let's use the multi-index binary
format for now. Other options include:
- list-of-tokens multilabel
- NER-trainer iob
.
# merge and cleanse NLP-containing columns of the data
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('clean_combine', kex.NLPSelect(columns=["OriginalShorttext"])),
('tag', kex.TagExtractor(
thesaurus=vocab, # load up pre-trained vocab file from [^1]
group_untagged=True, # merge untagged tokens into a misc. `_untagged` label
filter_types=None, # keep all tag types
ngram_range=(1,2), # this vocab file includes compound tags, so make sure we find them
verbose=True, # report how much coverage our vocab file got on proposed tags (tokens)
output_type='binary', # could use `multilabel` or `iob` as well
)),
])
tags = pipe.fit_transform(df)
tags.sum().sort_values(ascending=False)
We can also access the trained steps in the pipeline to make use of convenience functions.
tagger = pipe.named_steps['tag']
tags_read = tagger.tags_as_lists
relation_df = tags.loc[:, ['PI', 'SI']]
tag_df = tags.loc[:, ['I', 'P', 'S', 'U', 'X', 'NA']]
Quality of Extracted Keywords
It's
nbins = int(np.percentile(tag_df.sum(axis=1), 90))
print(f'Docs have at most {nbins} tokens (90th percentile)')
tags_read.assign(text=df.OriginalShorttext).sample(10)
# how many instances of each keyword class are there?
print('named entities: ')
print('I\tItem\nP\tProblem\nS\tSolution')
print('U\tUnknown\nX\tStop Word')
print('total tokens: ', vocab.NE.notna().sum())
print('total tags: ', vocab.groupby("NE").nunique().alias.sum())
vocab.groupby("NE").nunique()
# tag-completeness of work-orders?
tagger.report_completeness()
# with sns.axes_style('ticks') as style:
sns.distplot(tagger.tag_completeness.dropna(),
kde=False, bins=nbins,
kde_kws={'cut':0})
plt.xlim(0.1, 1.0)
plt.xlabel('precision (PPV)')
An Aside: Effectiveness of Tagging
What have we actually gained using the TagExtractor
?
The vocab
file functions as a thesaurus, that has a default alias
representing multiple disparate tokens. This means our resulting matrix dimensionality can be significantly reduced using this domain knowledge, which can improve model predictability, performance, applicability, etc.
# original token string frequencies
cts = np.where(tagger.tfidf.todense()>0., 1, 0).sum(axis=0)
sns.histplot(
cts, log_scale=True,
stat='probability',discrete=True,
label='Raw Tokens',
color='grey'
)
# tag frequencies
sns.histplot(
tag_df[['I', 'P', 'S']].sum(), log_scale=True,
stat='probability', discrete=True,
label='Tagged Aliases',
hatch='///',
color='dodgerblue',alpha=0.3,
)
plt.legend()
plt.title('Probability of a Token/Tag\'s frequency')
The entire goal, in some sense, is for us to remove low-occurence, unimportant information from our data, and form concept conglomerates that allow more useful statistical inferences to be made.
Tags mapped from nestor-gui
, as the plot shows, have very few instances of 1x-occurrence concepts, compared to several thousand in the raw-tokens (this is by design, of course).
Additionally, high occurence concepts that might have had misspellings or synonyms drastically inprove their average occurence rate.
NOTE: This is without artificial thresholding of minimum tag frequency. This would simply get reflected by "cutting off" the blue distribution below some threshold, not changing its shape overall.
Survival Analysis
What do we do with tags?
One way is to rely on their ability to normalize raw tokens into consistent aliases so that our estimates of rare-event statistics become possible.
Say you wish to know the median-time-to-failure of an excavator subsystem (e.g. the engines in your fleet): this might help understand the frequency "engine expertise" is needed to plan for hiring or shift scheduls, etc.
To get the raw text occurences into something more consistent for failure-time estimation, one might:
- make a rules-based algorithm that checks for known (a priori) pattern occurrences and categorizes/normalizes when matched (think: Regex matching)
- create aliases for raw tokens (e.g. using suggestions for "important" tokens from TokenExtractor.thesaurus_template)
This was done in [^1], and whe demonstrate the technique below. See the paper for further details!
Rules-Based
From Hodkeiwiz et al, a rule-based method was used to estimate failure times for SA. Let's see their data:
df_clean = dat.load_excavators(cleaned=True)
df_clean['SuspSugg'] = pd.to_numeric(df_clean['SuspSugg'], errors='coerce')
df_clean.dropna(subset=['RunningTime', 'SuspSugg'], inplace=True)
df_clean.shape
df_clean.sort_values('BscStartDate').head(10)
We once again turn to the library Lifelines as the work-horse for finding the Survival function (in this context, the probability at time t since the previous MWO that a new MWO has not occured).
from lifelines import WeibullFitter, ExponentialFitter, KaplanMeierFitter
mask = (df_clean.MajorSystem =='Bucket')
# mask=df_clean.index
def mask_to_ETclean(df_clean, mask, fill_null=1.):
filter_df = df_clean.loc[mask]
g = filter_df.sort_values('BscStartDate').groupby('Asset')
T = g['BscStartDate'].transform(pd.Series.diff).dt.days
# T.loc[(T<=0.)|(T.isna())] = fill_null
E = (~filter_df['SuspSugg'].astype(bool)).astype(int)
return T.loc[~((T<=0.)|(T.isna()))], E.loc[~((T<=0.)|(T.isna()))]
T, E = mask_to_ETclean(df_clean, mask)
wf = WeibullFitter()
wf.fit(T, E, label='Rule-Based Weibull')
print('{:.3f}'.format(wf.lambda_), '{:.3f}'.format(wf.rho_))
# wf.print_summary()
wf.hazard_.plot()
plt.title('weibull hazard function')
plt.xlim(0,110)
wf.survival_function_.plot()
plt.xlim(0,110)
plt.title('weibull survival function')
print(f'transform: β={wf.rho_:.2f}\tη={1/wf.lambda_:.2f}')
# wf._compute_standard_errors()
to_bounds = lambda row:'±'.join([f'{i:.2g}' for i in row])
wf.summary.iloc[:,:2].apply(to_bounds, 1)
Tag Based Comparison
We estimate the occurence of failures with tag occurrences.
import math
def to_precision(x,p):
"""
returns a string representation of x formatted with a precision of p
Based on the webkit javascript implementation taken from here:
https://code.google.com/p/webkit-mirror/source/browse/JavaScriptCore/kjs/number_object.cpp
"""
x = float(x)
if x == 0.:
return "0." + "0"*(p-1)
out = []
if x < 0:
out.append("-")
x = -x
e = int(math.log10(x))
tens = math.pow(10, e - p + 1)
n = math.floor(x/tens)
if n < math.pow(10, p - 1):
e = e -1
tens = math.pow(10, e - p+1)
n = math.floor(x / tens)
if abs((n + 1.) * tens - x) <= abs(n * tens -x):
n = n + 1
if n >= math.pow(10,p):
n = n / 10.
e = e + 1
m = "%.*g" % (p, n)
if e < -2 or e >= p:
out.append(m[0])
if p > 1:
out.append(".")
out.extend(m[1:p])
out.append('e')
if e > 0:
out.append("+")
out.append(str(e))
elif e == (p -1):
out.append(m)
elif e >= 0:
out.append(m[:e+1])
if e+1 < len(m):
out.append(".")
out.extend(m[e+1:])
else:
out.append("0.")
out.extend(["0"]*-(e+1))
out.append(m)
return "".join(out)
def query_experiment(name, df, df_clean, rule, tag, multi_tag, prnt=False):
def mask_to_ETclean(df_clean, mask, fill_null=1.):
filter_df = df_clean.loc[mask]
g = filter_df.sort_values('BscStartDate').groupby('Asset')
T = g['BscStartDate'].transform(pd.Series.diff).dt.days
E = (~filter_df['SuspSugg'].astype(bool)).astype(int)
return T.loc[~((T<=0.)|(T.isna()))], E.loc[~((T<=0.)|(T.isna()))]
def mask_to_ETraw(df_clean, mask, fill_null=1.):
filter_df = df_clean.loc[mask]
g = filter_df.sort_values('BscStartDate').groupby('Asset')
T = g['BscStartDate'].transform(pd.Series.diff).dt.days
T_defined = (T>0.)|T.notna()
T = T[T_defined]
# assume censored when parts replaced (changeout)
E = (~(tag_df.S.changeout>0)).astype(int)[mask]
E = E[T_defined]
return T.loc[~((T<=0.)|(T.isna()))], E.loc[~((T<=0.)|(T.isna()))]
experiment = {
'rules-based': {
'query': rule,
'func': mask_to_ETclean,
'mask': (df_clean.MajorSystem == rule),
'data': df_clean
},
'single-tag': {
'query': tag,
'func': mask_to_ETraw,
'mask': tag_df.I[tag].sum(axis=1)>0,
'data': df
},
'multi-tag': {
'query': multi_tag,
'func': mask_to_ETraw,
'mask': tag_df.I[multi_tag].sum(axis=1)>0,
'data': df
}
}
results = {
('query', 'text/tag'): [],
# ('Weibull Params', r'$\lambda$'): [],
('Weibull Params', r'$\beta$'): [],
('Weibull Params', '$\eta$'): [],
('MTTF', 'Weib.'): [],
('MTTF', 'K-M'): []
}
idx = []
for key, info in experiment.items():
idx += [key]
results[('query','text/tag')] += [info['query']]
if prnt:
print('{}: {}'.format(key, info['query']))
info['T'], info['E'] = info['func'](info['data'], info['mask'])
wf = WeibullFitter()
wf.fit(info['T'], info['E'], label=f'{key} weibull')
to_bounds = lambda row:'$\pm$'.join([to_precision(row[0],2),
to_precision(row[1],1)])
params = wf.summary.T.iloc[:2]
params['eta_'] = [1/params.lambda_['coef'], # err. propagation
(params.lambda_['se(coef)']/params.lambda_['coef']**2)]
params = params.T.apply(to_bounds, 1)
results[('Weibull Params', r'$\eta$')] += [params['eta_']]
results[('Weibull Params', r'$\beta$')] += [params['rho_']]
if prnt:
print('\tWeibull Params:\n',
'\t\tη = {}\t'.format(params['eta_']),
'β = {}'.format(params['rho_']))
kmf = KaplanMeierFitter()
kmf.fit(info["T"], event_observed=info['E'], label=f'{key} kaplan-meier')
results[('MTTF','Weib.')] += [to_precision(wf.median_survival_time_,3)]
results[('MTTF','K-M')] += [to_precision(kmf.median_survival_time_,3)]
if prnt:
print(f'\tMTTF: \n\t\tWeib \t'+to_precision(wf.median_survival_time_,3)+\
'\n\t\tKM \t'+to_precision(kmf.median_survival_time_,3))
info['kmf'] = kmf
info['wf'] = wf
return experiment, pd.DataFrame(results, index=pd.Index(idx, name=name))
bucket_exp, bucket_res = query_experiment('Bucket', df, df_clean,
'Bucket',
['bucket'],
['bucket', 'tooth', 'lip', 'pin']);
tags = ['hyd', 'hose', 'pump', 'compressor']
hyd_exp, hyd_res = query_experiment('Hydraulic System', df, df_clean,
'Hydraulic System',
['hyd'],
tags)
eng_exp, eng_res = query_experiment('Engine', df, df_clean,
'Engine',
['engine'],
['engine', 'filter', 'fan'])
frames = [bucket_res, hyd_res, eng_res]
res = pd.concat(frames, keys = [i.index.name for i in frames],
names=['Major System', 'method'])
res
exp = [bucket_exp, eng_exp, hyd_exp]
f,axes = plt.subplots(nrows=3, figsize=(5,10))
for n, ax in enumerate(axes):
exp[n]['rules-based']['kmf'].plot(ax=ax, color='dodgerblue')
exp[n]['multi-tag']['kmf'].plot(ax=ax, color='xkcd:rust', ls=':')
exp[n]['single-tag']['kmf'].plot(ax=ax, color='xkcd:rust')
ax.set_xlim(0,110)
ax.set_ylim(0,1)
ax.set_title(r"$S(t)$"+f" of {res.index.levels[0][n]}")
sns.despine()
plt.tight_layout()
This next one give you an idea of the differences better. using a log-transform. the tags under-estimate death rates a little in the 80-130 day range, probably because there's a failure mode not captured by the [bucket, lip, tooth] tags (because it's rare).
f,axes = plt.subplots(nrows=3, figsize=(5,10))
for n, ax in enumerate(axes):
exp[n]['rules-based']['kmf'].plot_loglogs(ax=ax, c='dodgerblue')
exp[n]['single-tag']['kmf'].plot_loglogs(ax=ax, c='xkcd:rust', ls=':')
exp[n]['multi-tag']['kmf'].plot_loglogs(ax=ax, c='xkcd:rust')
if n != 2:
ax.legend_.remove()
# ax.set_xlim(0,110)
# ax.set_ylim(0,1)
ax.set_title(r"$\log(-\log(S(t)))$"+f" of {res.index.levels[0][n]}")
sns.despine()
plt.tight_layout()
f.savefig('bkt_logKMsurvival.png')
# kmf.plot_loglogs()