MSDN Magazine, May 2019

Page 45 - MSDN Magazine, May 2019

P. 45

The data for the machines includes a history of failures, main- tenance operations and sensor telemetry, as well as information about the model and age (in years) of the machines. This data is available in .csv files downloadable from the resource mentioned earlier. I’ll also provide a transformed data file (comp1_df.csv) that’s “survival analysis-ready” and will explain how to perform the transformations later on.
When building statistical models, you see covariates of three primary data types: categorical, ordinal and continuous.
Each machine in the original example has four different compo- nents, but I’m going to focus only on one component. The component can either be maintained proactively prior to a failure, or maintained after failure to repair it.
Survival Analysis
In my previous article about survival analysis, I introduced import- ant basic concepts that I’ll use and extend in this article. I encourage you to read that article to familiarize yourself with these concepts, including the survival and hazard functions, censoring and the non-parametric Kaplan-Meier (KM) estimator.
In this article, I’ll show how to extend the concept of the KM esti- mator to include covariates or variables (also known as features) that can have effects on survival, or, in this case, on machine components’ failure. In the example, I’ll use machine model, machine age and machine telemetry as covariates and use survival regression models to estimate the effects of such covariates on machine failure.
The notion of estimating the effects of covariates on a target vari- able, in this case time to failure, hazard rate, or survival probabilities, isn’t unique to survival analysis and is the basis for regression models in general.
When building statistical models, you see covariates of three pri- mary data types: categorical, ordinal and continuous. Categorical data types are those types that fall into a few discrete categories. Here, a machine model is a categorical data type—there are four different machine models. Ordinal data types are categorical data types that have some meaningful order. For example, ratings of movies from one to 10, where 10 is the most entertaining and one the least. Finally, continuous data types are those that represent continuous numbers. Those would be the machine telemetry read- ings here, which are continuous numbers sampled at certain times (in this case, hourly).
After identifying the data types and the methodology to be used, you should encode the various data types into covariates. Typically, for regression models, continuous variables are naturally encoded as continuous covariates, while categorical data types will require some form of encoding. A popular option for such encoding, which msdnmagazine.com
I’ll use in this article, is where, for categorical data types with N categories, N-1 covariates are created, and a category i is represented by setting its specific covariate to value one and all others to zero. The Nth category is represented by setting all covariates to zero. This is typically a good fit for regression models with an explicitly defined baseline, where all covariates can be equal to zero. This is also the format that the R programming language uses to encode categorical variables or factors.
This encoding for categoricals has a straightforward interpreta- tion for what it means for some or all covariates to be set to zero. However, for continuous data types, setting a certain covariate to zero may not always be meaningful. For example, if a covariate represents machine height or width, setting that covariate to zero would be meaningless, because there are no such machines in reality.
One way around this problem is to use mean centered continuous covariates, where for a given covariate, its mean over the training dataset is subtracted from its value. Then, when you set that trans- formed covariate to zero, it’s equivalent to setting the original co- variate to its mean value. This technique is called “mean centering” and I’ll use it here for the machine age and telemetry covariates.
It’s important to remember, that following this transformation, you should always use mean centered covariates as an input to the model. This is also the case when applying the regression model to a new test dataset.
Once the data values are encoded as covariates, survival regres- sion models then take those covariates and a certain form of survival target variables (which I’ll talk about soon) and specify a model that ties the effects of such covariates on survival/time-to-event.
Transformation of the Data to Survival
Format and Feature Engineering
In order to work with the survival regression models that I’ll describe, your data needs to have at least two fields: the time stamp of the event of interest (here, machine failure) and a Boolean field indicating whether censoring occurred. (Here, censoring describes a situation in which no failure occurred at or before a specified time. In my example, maintenance happening in a preventive manner, rather than as a response to failure, is considered to be censoring.
The survival regression models I’ll discuss have different assumptions made to simplify their mathematical derivation.
The survival regression models I’ll discuss have different assump- tions made to simplify their mathematical derivation. Some of these assumptions may not hold here, but it’s still useful to apply survival modeling to this example.
The survival analysis literature is very rich and many advanced survival regression models and techniques have been developed to
May 2019 39

43 44 45 46 47