Page 49 - MSDN Magazine, May 2019
P. 49
In this case, the R script generates the more elaborate output shown in Figure 6.
Before moving on to describe the output, I should mention that the Weibull parameterization in Spark MLLib and in survreg is a bit different than the parameterization I discussed.
A transformation is required and can be done as follows. Denote the parameters reported—intercept by m and scale by s—then k = 1/s, lambda = exp(-m/s) and each coefficient should be multiplied by (-1/s). There’s an R package called SurvRegCensCov that can do this conversion automatically, using ConvertWeibull on the model that survreg estimated:
Figure 6 Output for the Weibull AFT Regression
survreg(formula = Surv(time_to_event, event) ~ model + age_mean_centered +
volt_mean_centered + rotate_mean_centered + pressure_mean_centered + vibration_mean_centered, data = df, dist = "weibull")
(Intercept) modelmodel2 modelmodel3 modelmodel4 age_mean_centered volt_mean_centered rotate_mean_centered pressure_mean_centered vibration_mean_centered Log(scale)
Scale= 0.602
Weibull distribution Loglik(model)= -1710.3
Value Std. Error 8.172991 0.119133 0.040289 0.154668 0.027225 0.129629
-0.163865 0.136382 -0.000753 0.007960
z p 68.6040 0.00e+00 0.2605 7.94e-01 0.2100 8.34e-01 -1.2015 2.30e-01 -0.0946 9.25e-01 -7.6391 2.19e-14 -0.9334 3.51e-01
.4795 1.39e-01 -0.9789 3.28e-01 -9.7773 1.41e-22
-0.019731 -0.000767 0.005173 -0.008214 -0.508060
0.002583 0.000821 0.003496 0.008391 0.051963
Loglik(intercept only)= -1747.2 Chisq= 73.73 on 8 degrees of freedom, p= 8.9e-13
Number of Newton-Raphson Iterations: 8 n= 709
$vars
lambda
gamma
modelmodel2 modelmodel3 modelmodel4 age_mean_centered volt_mean_centered rotate_mean_centered pressure_mean_centered vibration_mean_centered
Estimate 1.260459e-06 1.662064e+00 -6.696297e-02 -4.524990e-02 2.723541e-01 1.251958e-03 3.279500e-02 1.274045e-03 -8.598142e-03 .365213e-02
SE 8.642772e-07 8.636644e-02 2.569595e-01 2.155000e-01 2.268785e-01 1.322780e-02 3.947495e-03 1.365339e-03 5.807130e-03 1.391255e-02
As with the Cox PH model estimation, the p column in the output ofsurvregprovidesinformationaboutthestatisticalsignificanceof the coefficients estimated, though in this case the figures are better (lower p-values). There’s still room for feature engineering here as was described before for the Cox PH model.
It’s also important to perform model diagnostics here, as was the case in the Cox PH regression, to make sure that the Weibull AFT model is a good fit for the data, compared, for example, to other parametric models. While I won’t describe this process here, you can learn more about it by referring to the “Survival Analysis” book I mentioned earlier.
Wrapping Up
I’ve presented the use of predictive maintenance for the IIoT as a motivating example for the adoption of two survival regression models that are available in h2o.ai and Spark MLLib. I showed how to model a machine failure predictive maintenance problem in the survival analysis framework by encoding variables as covar- iates and transforming the time series data to survival format. I also described the two survival models, the differences between them and how to apply them to the data. Finally, I talked briefly about interpretation of the results and model diagnostics. It’s important to note that I only scratched the surface of this fascinating and very rich topic, and I encourage you to explore more. A starting point for doing so is by referring to the literature I mentioned in the article. n
Zvi Topol has been working as a data scientist in various industry verticals, including marketing analytics, media and entertainment, and Industrial Inter- net of Things. He has delivered and lead multiple machine learning and analytics projects, including natural language and voice interfaces, cognitive search, video analysis, recommender systems and marketing decision support systems. Topol is currently with MuyVentive LLC, an advanced analytics R&D company, and can be reached at zvi.topol@muyventive.com.
Thanks to the following Microsoft technical expert for reviewing this article: James McCaffrey
May 2019 43
Here,gammaisequaltokfromthepreviousWeibullparameteriza- tion. (For more information on SurvRegCensCov, see bit.ly/2CgcSMg.) Given the estimated parameters, unlike with the Cox PH model, it’s now possible to directly obtain the survival function (it’s the Weibull AFT survival function) and use it to predict survival prob- abilities for any covariates. Assuming the first point in the dataset
is a new data point, you can run the following:
predict(machineModel, newdata=df[1,], type='quantile')
This yields the time to event (in hours) for the quantiles 0.1 and 0.9 (the defaults), like so:
807.967 5168.231
Estimation of the coefficients for the AFT Weibull model in Spark MLLib is done using the maximum likelihood estimation algorithm.
This means that given the covariates of the first data point (listed here), the probability of failure is 10 percent at or just before 807.967 hours following a maintenance operation, and the probability of failure is 90 percent at or just before 5168.231 hours following the maintenance operation:
model age volt_mean_centered model3 18 3.322762
pressure_mean_centered vibration_mean_centered 10.10773 11.4267
rotate_mean_centered 51.8113
age_mean_centered 6.488011
You can also use parameter “p” to get the survival time for any quantiles between zero and one; for example, adding the parameter “p=0.5” will give the median failure time, which, for the first data point, is 2509.814 hours after a maintenance operation. msdnmagazine.com