-
Notifications
You must be signed in to change notification settings - Fork 0
/
code.Rmd
904 lines (698 loc) · 38.3 KB
/
code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
---
output:
md_document:
variant: markdown_github
always_allow_html: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
The PHM Data Challenge was part of the PHM08 conference. It was a competition open to all potential conference attendees. The goal was to estimate remaining life of an unspecified component using data-driven techniques. Teams were comprised of one or more researchers. Winners from three categories (determined on the basis of score) received a $2500 cash prize and were invited to present the successful methods and results in a special session. Results were due June 2, 2008.
Following the competition, the datasets were posted at the data repository of the [NASA Ames Prognostics Center of Excellence](https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan). Note that this repository hosts a number of interesting datasets that can be used in a prognostic context
## Original description from the competition
Data sets consist of multiple multivariate time series. Each data set is further divided into training and test subsets. Each time series is from a different engine - i.e., the data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data are contaminated with sensor noise.
The engine is operating normally at the start of each time series, and starts to degrade at some point during the series. In the training set, the degradation grows in magnitude until a predefined threshold is reached beyond which it is not preferable to operate the engine. In the test set, the time series ends some time prior to complete degradation. The objective of the competition is to predict the number of remaining operational cycles before in the test set, i.e., the number of operational cycles after the last cycle that the engine will continue to operate properly.
The data are provided as a zip-compressed text file with 26 columns of numbers, separated by spaces. Each row is a snapshot of data taken during a single operational cycle; each column is a different variable.
Data Sets: train.txt, test.txt
Train trajectories: 218
Test trajectories: 218
final_test trajectories: 435
Dataset structure
* Unit: Engine number [-]
* t: time [cycles]
* 0p1: operational setting 1
* op2: operational setting 2
* op3: operational setting 3
* s1: Sensor measurement 1
* s2: Sensor measurement 2
* ...
* s20: sensor measurement 20
## Usage
Researchers are expected to train their algorithms using data in the file named train.txt. They must then evaluate the RUL prediction performance on data provided in file test.txt. Associated true RUL values are not being revealed just like in the competition. Researchers are able to use a web application to upload their results and get an aggregate score here: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan
Note:
* Results should be formatted as a column vector of RULs in a text file.
* Evaluation is limited to only one trial per day.
## Evaluation
The final score is a weighted sum of RUL errors. The scoring function is an asymmetric function that penalizes late predictions more than the early predictions. The perfect algorithm will have a score equal to zero.
## Best performing algorithms
The final score is a weighted sum of RUL errors. The scoring function is an asymmetric function that penalizes late predictions more than the early predictions. The perfect algorithm will have a total score of zero. Note that researchers can still submit their predictions to the link provided in the beginning, to compare their efforts against the winners of the challenge in 2008.
The best performing scores on the test set during the competition are the following:
1. 436.841
2. 512.426
3. 737.769
4. 809.757
5. 908.588
6. 975.586
7. 1,049.57
8. 1,051.88
9. 1,075.16
10. 1,083.91
11. 1,127.95
12. 1,139.83
13. 1,219.61
14. 1,263.02
15. 1,557.61
16. 1,808.75
17. 1,966.38
18. 2,065.47
19. 2,399.88
20. 2,430.42
## Final Submission
Once algorithms were trained to satisfaction, users had to apply them to the final test dataset contained in the file named final_test.txt, and send the vector of RULs for the final test set to the PHM Society for evaluation. Here we will limit ourselves to predicting on the test set, in which we can still receive a score.
# Exploratory Data Analysis
```{r read_data, warning=FALSE, message=FALSE}
library(corrplot)
library(reshape2)
library(gridExtra)
library(ggplot2)
library(plot3D)
library(dplyr)
library(fastICA)
library(broom)
library(fields)
library(caret)
library(caretEnsemble)
# Function to read the data
read_data <- function(filename)
{
dset <- read.table(filename)
colnames(dset)[1:5] <- c("unit_no", "t", "op1", "op2", "op3")
colnames(dset)[6:26] <- sprintf("s%s", 1:21)
dset$unit_no <- as.factor(dset$unit_no)
return(dset)
}
train <- read_data('train.txt')
summary(train)
print(paste('Missing Data on the training set:', sum(is.na(train))))
print(paste('Missing Data on the test set:', sum(is.na(train))))
print(paste('# of units:', max(as.numeric(train$unit_no))))
```
## Target variable
At first, let's have a look at the total operating time of the units in the dataset:
```{r append_health, warning=FALSE, message=FALSE}
train %>%
group_by(unit_no) %>%
mutate(op.time = max(t)) %>%
ungroup() %>%
distinct(unit_no, op.time) %>%
select(op.time) %>%
mutate(op.time = as.numeric(op.time)) %>%
ggplot(aes(y = op.time)) +
geom_boxplot() +
ggtitle('Boxplot of unit lifetimes')
```
Assuming normal data, we have three relatively long-lived components. At some point later on we might need to remove some units with exceptionally long or short lifetimes, as they can affect predictions.
Moving on, we need to append the target variable to the training set - Remaining Useful Lifetime (RUL [cycles]). According to the description, each engine starts with different degrees of initial wear and manufacturing variation which is unknown, and the degradation grows in magnitude until a predefined threshold is reached, beyond which it is not preferable to operate the engine (i.e. RUL = 0). It will make it easier to consider the failure cycle (the cycle at which failure occurs) to be point 0, and just use negative values for the cycles prior to failure. Let's call this time-to-failure:
```{r append_health2, warning=FALSE, message=FALSE}
ttf <- train %>%
group_by(unit_no) %>%
arrange(., t, .by_group = T) %>%
mutate(ttf = -(rev(t) - 1)) %>%
select(unit_no, ttf)
train$ttf <- ttf$ttf
```
## Sensor data
Let's have a closer look at the sensor recordings:
```{r sensor_timeseries, warning=FALSE, message=FALSE, fig.width=10, fig.height=14}
# Isolate the sensor columns
sensor.cols <- colnames(train)[grepl('s\\d', colnames(train))]
# Generate a plot list
plotlist <- lapply(sensor.cols,
function(col){
train %>%
select(col, ttf) %>%
ggplot(aes(x = ttf, y = train[, col])) +
geom_point(alpha = 0.01) + ylab(col)
})
n <- length(plotlist)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(plotlist, ncol = nCol))
```
There seem to be specific clusters of values for all the sensors, which could be related to RUL. We will test this hypothesis later on. Also, there are several sensors which seem to be correlated. For example, s11, s15 with s17 present the same patterns. This is normal in thermal or mechanical systems, as it could be the case that these sensors measure, for example, pressure and temperature at the same point of the process (i.e., after the compressor). It is expected that such values are correlated. For now, let's have a quick check on the unique values:
```{r sensor_data_unique_vals, warning=FALSE, message=FALSE, fig.width=10, fig.height=8}
train %>%
select(starts_with('s')) %>%
sapply(., function(x) length(unique(x)))
```
Indeed some sensors only have a small number of unique values. We should check what these values are. My guess is that if we see binary or ordinal values, then the sensor must be reporting some kind of on/off switch or a set of encoded integers to indicate operational state (auto/manual/alarm etc.). If it is so, we should treat these sensors as factors.
```{r sensor_data_unique_vals2, warning=FALSE, message=FALSE, fig.width=10, fig.height=8}
for (col in c('s1', 's5', 's16', 's10', 's16', 's18', 's19'))
{
print(col)
print(unique(train[, col]))
}
```
These don't look like the switches I imagined. Judging from the values, these might be environmental and engine properties. For example, the ratio of specific heats for air at different temperatures, or something like that. Sensor 18 could be enthalpy or something of that sort... Bottom line, let's skip the factor assumption (for now).
Alright, let's have a look at monotonic relations (Spearman coeffs.) between the sensors:
```{r sensor_data_4, warning=FALSE, message=FALSE, fig.width=7, fig.height=7}
corr.mtrx <- cor(train[, sensor.cols], method = 'spearman')
corrplot(corr.mtrx, method = 'pie', type = 'lower', diag = F)
```
There is quite some redundancy in the dataset. We should attempt variable selection or dimensionality reduction later on.
Before we go deeper on the analysis of the sensors, let's instead focus our attention on the operating conditions. The description says that they affect the engine operation severely. Maybe if we - somehow - account for these effects, more patterns will start emerging.
## Operating Conditions
Let's start with a simple time-series plot for every setting.
```{r plot_2, warning=FALSE, message=FALSE}
train %>%
select(t, op1, op2, op3) %>%
melt(id = "t") %>%
ggplot(., aes(x = t, y = value)) +
geom_point(alpha = 0.0125) +
facet_wrap(~ variable, scales = "free") +
ggtitle('Operating condition settings')
```
This is interesting. There seem to be distinct operating conditions for all settings. Let's have a look at unique values:
```{r plot_3, warning=FALSE, message=FALSE}
#op1
for (setting in c('op1', 'op2', 'op3'))
{
print(paste('Unique values for', setting, ':', length(unique(train[, setting]))))
}
print(paste('Unique combinations :', dim(unique(train[, c('op1', 'op2', 'op3')]))[1]))
```
The number of unique values show a different picture. Looking at the graphs above, each operational setting must be clustered around the small number of points shown in the graph, with a very small variance around them. We could round them up to the most frequent values, and reduce the overall number of operating conditions. Let's plot all of them together:
```{r plot_4, warning=FALSE, message=FALSE}
scatter3D(x = train$op1,
y = train$op2,
z = train$op3,
bty = "b2", colkey = F,
pch = 19, cex = 1,
theta = 40, phi = 0,
main = '3D Scatterplot of operating conditions',
xlab = "op.setting 1",
ylab ="op.setting 2",
zlab = "op.setting 3")
```
This is convenient. The operating conditions are clustered around 6 distinct operating profiles. k-Means should easily identify the clusters without even scaling the data:
```{r cluster_ops, warning=FALSE, message=FALSE}
set.seed(1)
op_clusters <- kmeans(train[, c('op1', 'op2', 'op3')], centers = 6, nstart = 48)
op_clusters$centers
```
Here are our cluster centers. Let's append the cluster number to the training dataset, and use that to further explore the data.
```{r cluster_ops2, warning=FALSE, message=FALSE}
# Append cluster no. to the training set
train$op_cluster <- factor(op_clusters$cluster)
```
This is potentially the most useful information we have extracted from the data so far. By significantly reducing the number of operating conditions into a manageable size, we have the ability to completely remove their effects from the sesnsor measurements.
## Effect of operating conditions on sensor behavior
Now, let's see how the sensor values differ under different operating conditions, by making the initial plot highlighting each operating cluster with a different color:
```{r sensor_data_5, warning=FALSE, message=FALSE, fig.height=14, fig.width=10}
# Generate a plot list
plotlist <- lapply(sensor.cols,
function(col){
train %>%
select(col, ttf, op_cluster) %>%
group_by(op_cluster) %>%
ggplot(aes(x = ttf, y = train[, col], color = op_cluster)) +
geom_point(alpha = 0.0075) + ylab(col) +
theme(legend.position = 'none')
})
n <- length(plotlist)
nCol <- floor(sqrt(n))
do.call("grid.arrange", c(plotlist, ncol = nCol))
```
This is really convenient. We can see that for most of the sensors the differences occur due to the operating conditions. Finding the clusters as we did earlier has indeed simplified things a lot. To remove the effect of the operating conditions, we can standardize the data per cluster. This will remove the highly volatile changes caused by transitioning from condition to condition. Let's begin:
```{r sensor_data_51, warning=FALSE, message=FALSE, fig.height=14, fig.width=10}
# Calculate summary statistics of each sensor per cluster
get_summary_stats <- function(dset)
{
sensor.cols <- colnames(dset)[grepl("s\\d", colnames(dset))]
sum.stats <- dset %>%
select(sensor.cols, op_cluster) %>%
group_by(op_cluster) %>%
summarize_all(funs(mean, sd))
return (sum.stats)
}
# Function to standardize the sensors
standardize_sensors <- function(dset, summary_stats)
{
sensor.cols <- colnames(dset)[grepl("s\\d", colnames(dset))]
std.sensors <- left_join(dset, summary_stats, by = 'op_cluster')
std.sensors <- (std.sensors[, sensor.cols] - std.sensors[, paste0(sensor.cols,"_mean")]) / std.sensors[, paste0(sensor.cols,"_sd")]
return(std.sensors)
}
train.sum.stats <- get_summary_stats(train)
std.sensors <- standardize_sensors(train, train.sum.stats)
```
## Removing the effect of operating conditions
Note that we should use the *same* summary statistics (the ones calculated on the training set) to standardize the data per operating cluster on the test set, and not calculate them directly on the test set. Assume that a perfect linear predictor (sensor) exists for the training set, with a zero value when RUL = 100%, and a value of 100 when RUL = 0. Its mean value is 50. If we go and standardize it directly on a test set (and not use the summary statistics calculated on the training set above), which contains recorded data from engines with RULs varying from 0 to 50% - half their lifetime has already passed - we will standardize it with a mean value of 25, rendering it essentially useless for prediction purposes.
Let's see how our standardized sensor data looks:
```{r sensor_timeseries_stdized, warning=FALSE, message=FALSE, fig.height=6}
std.sensors %>%
mutate(ttf = train$ttf) %>%
melt(id.vars = 'ttf') %>%
ggplot(aes(x = ttf, y = value)) +
geom_point(alpha = 0.0075) +
geom_smooth(formula = y ~ s(x, bs = "cs"),
method = 'gam') +
facet_wrap(~variable, scales = 'free') +
ggtitle('Standardized sensor measurements')
```
Indeed, some patterns have already started to emerge. Note that the blank graphs correspond to sensors which present zero variation within operating clusters. When the cluster mean value is subtracted from these sensors, the resulting recorded value is constantly zero, i.e. no predictive power for our purposes. Furthermore, there are a few sensors which present zero variability within *some* operating clusters, for which the resulting values from the above code snippet are NaNs, due to the fact we divide by a zero-valued standard deviation.
## A closer look on the sensor values
Let's have a look at how much missing data we get from this:
```{r sensor_timeseries_stdized_2, warning=FALSE, message=FALSE}
missing.data <- sapply(std.sensors, function(x) sum(is.na(x)) / length(x))
missing.data[which(missing.data > 0)]
```
We have zero within-operating-cluster-variability for sensors 1, 5, 18, 19. Sensor 16 presents zero variability for (I would guess) 4 clusters, and sensor 10 for around two clusters. Looking at the above graph, sensors 10 and 16 do not seem to have a pattern even when they do show some variability. I think we can safely drop them altogether.
Let's update the training set and check at a violin plot:
```{r violin_plot, warning=FALSE, message=FALSE}
# Append the standardized sensors
train[, sensor.cols] <- std.sensors
sensor.cols.to_drop <- c('s1', 's5', 's10', 's16', 's18', 's19')
# Remove spares
train[, sensor.cols.to_drop] <- NULL
# Make a list with the usable sensors, for later on
usable_sensor.cols <- sensor.cols[!(sensor.cols %in% sensor.cols.to_drop)]
# Append the standardized sensors
train %>%
select(usable_sensor.cols) %>%
melt() %>%
ggplot(aes(y = value)) +
geom_violin(aes(x = variable))+
ggtitle('Violin plot of most sensor measurements')
```
Normal looking data, apart from sensors 6 and 17. Also sensors 8, 9, 13, 14 seem to have outliers. Let's take a closer look:
```{r violin_plot_2, warning=FALSE, message=FALSE}
train %>%
select(ttf, s6, s8, s9, s13, s14, s17) %>%
melt(id.vars = 'ttf') %>%
ggplot(aes(x = ttf, y = value)) +
geom_point(alpha = 0.02) +
facet_wrap(~variable, scales = 'free') +
ggtitle('Closer look at specific sensors')
```
Indeed, the scatterplots explain the irregularities for sensors 6 and 17, and the (mis)identifiacation of outliers. The general trend for sensors 8, 9, 13, 14 seems to be legitimate. Sensors 6 and 17 seem to act as discrete variables. Let's check the number of unique values:
```{r violin_plot_3, warning=FALSE, message=FALSE}
# Append the standardized sensors
for (sensor in c('s6', 's17'))
{
no_unique_vals <- length(unique(train[, sensor]))
print(paste('Unique values', sensor, ':', no_unique_vals))
}
```
Indeed, they have a very small number of unique values compared to the entire dataset (45k observations). By looking at the recorded values for sensor 6, it might not be a bad idea to drop it altogether. Before we decide on any treatment for these sensors, let's have a look at monotonic relations:
```{r sensor_data_corrplot, fig.width=7, fig.height=7}
# Plot a correlation matrix
corr.mtrx <- cor(train[, usable_sensor.cols], method = 'spearman')
corrplot(corr.mtrx, method = 'number', type = 'lower', diag = F)
```
Alright, there are several sensor that are correlated with each other. We need some form of dimensionality reduction.
## Independent Component Analysis
At this point, a bit of intuition is going to help. We have sensor data from a component. The values recorded from the sensors are an interaction of:
* Operating Conditions
* Degradation
* Sensor noise
If we assume that the interaction between these signals result in their superposition (linear addition), we are confronted by a linear mixture problem. This is something that ICA can, in principle, solve (its goal is to solve Blind Source Separation problems that arise from a linear mixture). Think about this in terms of the cocktail party problem (which is the prototypical problem addressed in PCA). Two sounds $s_1, s_2$ are generated by music and a voice, recorded simultaneousy in multiple microphones. Since sound adds linearly, each microphone reconds a unique linear summation of the two sounds. The linear weights for each microphone reflect the proximity of each speaker to the respective microphones. The goal then is to recover the original sources (music and voice), solely using the microphone recordings.
In our case, the microphones correspond to the sensors, and the original sources are the ones mentioned above. Now, we have already removed the effect of operating conditions. This leaves two sources: degradation, and sensor noise. If we also assume that the effect of sensor noise is relatively small compared to the effect of degradation (hopefully this is tha case), we're only left with one source: degradation. This implies that we have a single basis vector, therefore only *one* independent component. In this case, ICA boils down to *whitening* the dataset (decorrelate it through PCA and normalize to unit variance). Let's apply this and see what happens:
```{r ica, message=FALSE, warning=FALSE}
set.seed(1)
ica <- fastICA(train[, usable_sensor.cols], n.comp = 1)
as.data.frame(ica$S) %>% # Grab the source matrix
mutate(ttf = train$ttf) %>%
ggplot(aes(x = ttf, y = V1)) +
geom_point(alpha = 0.05) +
ylim(-6, 3) +
ggtitle("Independent Component Analysis - IC 1")
```
Interestingly enough, this looks like part of the P-F curve. This is a positive indication that our assumptions do make sense after all.
## Modeling approaches
An interesting approach would be to map the sensor recordings into one health index, following the general trend shown above, but dimensionless. Such a health index would have a value of one when the component is brand new, and it would gradually move towards zero near the end. On the test set, this health index would have to be combined with a time-series predictor (possibly an LSTM or GRU network), to give an estimation on how many cycles ahead will it hit the zero point. Let's see how the above plot looks per unit, when smoothened out (loess regression) and normalized between zero and one:
```{r hi}
train %>%
select(ttf, unit_no) %>%
mutate(IC = as.numeric(ica$S)) %>%
group_by(unit_no) %>%
do(loess_IC = predict(loess(IC ~ ttf, span = 0.4, data = .),
data = (.))) %>%
tidyr::unnest() %>%
ungroup() %>%
mutate(ttf = train$ttf) %>%
mutate(unit_no = train$unit_no) %>%
group_by(unit_no) %>%
mutate(norm_loess_IC = (loess_IC - min(loess_IC)) / (max(loess_IC) - min(loess_IC))) %>%
ungroup() %>%
ggplot(aes(x = ttf, y = norm_loess_IC, color = unit_no)) +
geom_smooth(method = "loess", se = F) +
ggtitle('Dimensionless Health Index per unit') +
theme(legend.position = "none")
```
Indeed, we could use a model to map the sensor readings to the curves shown above, and then use another model to forecast on them.
A different approach would be to track similarities, based on the independent component (the dimensionless health index cannot be determined directly on the test set). Let's apply a log transform on the IC, to linearize the model:
```{r ica2}
IC.min <- abs(min(as.numeric(ica$S))) # For log transforms
as.data.frame(ica$S + IC.min) %>% # Grab the source matrix
mutate(ttf = train$ttf) %>%
ggplot(aes(x = ttf, y = log1p(V1))) +
geom_point(alpha = 0.1) +
ylim(-6, 3) +
ggtitle("IC 1 - log transform")
```
Now, the points have much less spread, and we've almost linearized the independent component. Let's try to summarize the above plot (log-transformed IC) per unit:
```{r ica3, fig.width = 13, fig.height = 5, message=FALSE, warning=FALSE}
train %>%
select(ttf, unit_no) %>%
mutate(IC = ica$S + IC.min) %>%
group_by(unit_no) %>%
ggplot(., aes(x = unit_no, y = log1p(IC), color = unit_no)) +
geom_boxplot(notch = T, outlier.shape = NA) +
ylim(1.4, 2.25) +
ggtitle("log1p(IC) - Boxplots per unit") +
theme(legend.position = 'none', axis.text.x = element_blank())
```
Not the most succesful visualization, but we can still see that there is considerable deviation among different units. Regarding the test set, we could determine the IC, compare its trajectory with all trajectories of the train set, and use a weighted average approach to estimate when it will reach the end of its life.
To do such a comparison we need an appropriate distance metric. Given that we want to compare time-series with different lengths (no. of points), the most common method is Dynamic Time Warping. However, given the curves above, the shifting capabilities of DTW may be more of a hindrance rather than help. An easy way to go about this would be to fit a regression line on each log-transformed IC trajectory in the training and test sets, and compare Euclidean distances among these coefficients. Let's extract them for every trajectory (unit) in the training set, and visualize them:
```{r lr_fits}
lr.coeffs.train <- train %>%
select(t, unit_no) %>%
mutate(IC = log1p(as.numeric(ica$S) + IC.min)) %>%
group_by(unit_no) %>%
do(fit_IC = lm(IC ~ t, data = .)) %>%
mutate(tidys = list(broom::tidy(fit_IC))) %>%
tidyr::unnest(tidys) %>%
select(unit_no, term, estimate) %>%
mutate(intercept = ifelse(term == '(Intercept)', estimate, NA)) %>%
mutate(t = ifelse(term == 't', estimate, NA)) %>%
mutate(intercept = lag(intercept)) %>%
select(-term, -estimate) %>%
na.omit()
op.time <- train %>%
group_by(unit_no) %>%
mutate(op.time = max(t)) %>%
ungroup() %>%
distinct(unit_no, op.time) %>%
select(op.time)
lr.coeffs.train$op_time <- op.time$op.time
ggplot(lr.coeffs.train, aes(x = intercept, y = t, label = unit_no, color = op_time)) +
geom_point() +
geom_text(size = 2.5, check_overlap = T, nudge_y = 1e-4) +
ggtitle('log1p(IC) ~ t: Regression coeffs and total operating time') +
scale_color_gradient(low="blue", high="red")
```
Let's see if the method is valid. We can see (near the top right corner) that units 25 and 151 are relatively close together, but far away from 217 (right corner). Let's plot their trajectories:
```{r lr_fits_2}
train %>%
select(t, unit_no) %>%
mutate(IC = as.numeric(ica$S)) %>%
filter(unit_no == 25 | unit_no == 151 | unit_no == 127) %>%
ggplot(aes(x = t, y = IC, color = unit_no)) +
geom_line() +
ggtitle("IC trajectory proximity between units 25, 151 and 127")
```
Indeed, the method works. Units 12 and 98 present similar trajectories, in contrast to unit 15. This means, that a possible approach would be to apply k-Nearest Neighbor regression on the linear model's coefficients directly.
# Predictive modeling
Let's summarize the approach that we will follow. For each unit in the test set:
* Cluster the operating conditions into 6 distinct clusters, as identified in the training set.
* Drop the unnecessary sensors: 1, 5, 10, 16, 18, 19.
* Standardize the sensor values within each operating cluster.
* Calculate the independent component, as identified by ICA.
* Perform regression with the natural logarithm of the independent component as the dependent variable and time as the independent variable.
* Extract the coefficients of the regression model, and standardize them.
Then we'll try different algorithms and see from there on...
## Preprocessing steps
Now, let's make a few wrappers for the above steps, starting with a function to split the dataset:
```{r splitter, warning=FALSE, message=FALSE}
# Function to split the dataset into a train and a validation subset
split_dataset <- function(dset, ratio)
{
# Get the total number of units present in the dataset
total_units <- max(as.numeric(dset$unit_no))
# Randomly select a number of units
val_units <- sample(1:total_units, size = round(total_units * ratio))
# Get the indices from the dataset that belong to these units
val_indices <- which(dset$unit_no %in% val_units)
# Split into new datasets
val <- dset[val_indices, ]
train <- dset[-val_indices, ]
return(list(train, val))
}
```
Clustering the operating conditions:
```{r wrappers, warning=FALSE, message=FALSE}
# Function to return the operating cluster
get_cluster <- function(dset, centers)
{
# Isolate the operating settings
clusters <- dset %>%
select(op1, op2, op3)
# Calculate Euclidean distance between each operating setting and all cluster centers
clusters <- as.data.frame(t(apply(clusters, 1,
function(y) apply(centers, 1,
function(x) sum(sqrt((y - x) ^2)) )))) %>%
mutate(op_cluster = apply(., 1, function(x){names(x)[which.min(x)]})) %>%
mutate(op_cluster = as.numeric(op_cluster)) %>% # Convert from list
select(op_cluster)
# Return
return(clusters)
}
```
The functions to calculate summary statistics on the train set and to standardize the sensors within each cluster were developed earlier, there's no need to do it again.
Next, get the principal component of ICA (which is just a matrix multiplication with the whitening matrix):
```{r wrappers_2, warning=FALSE, message=FALSE}
# Function to apply whitening
get_IC <- function(input_df, ica_obj)
{
# Multiply each row of the dataset with the whitening matrix.
ic <- as.matrix(input_df) %*% ica_obj$K
# Return it
return(ic)
}
```
Apply linear regression and extract the coefficients:
```{r wrappers_3, warning=FALSE, message=FALSE}
# Function to get the regression coefficients
get_coeffs <- function(dset, ica_obj)
{
IC.min <- min(as.numeric(ica_obj$S)) # For the log transform
lr.coeffs <- dset %>%
select(t, unit_no, IC) %>%
filter(IC > IC.min) %>%
mutate(IC = log1p(as.numeric(IC) + abs(IC.min))) %>%
group_by(unit_no) %>%
do(fit_IC = lm(IC ~ t, data = .)) %>%
mutate(tidys = list(broom::tidy(fit_IC))) %>%
tidyr::unnest(tidys) %>%
select(unit_no, term, estimate) %>%
mutate(intercept = ifelse(term == '(Intercept)', estimate, NA)) %>%
mutate(t = ifelse(term == 't', estimate, NA)) %>%
mutate(intercept = lag(intercept)) %>%
select(-term, -estimate) %>%
na.omit() %>%
ungroup()
return(lr.coeffs)
}
```
Note that the coefficients should be standardized, depending on the approach we'll follow, as the absolute values of the intercepts might dominate the time coefficients in distance-based algorithms:
```{r wrappers_4}
# Function to standardize the coefficients of the LR model
std_coeffs <- function(input.coeffs, train.coeffs)
{
# Get summary stats
mean.t <- mean(train.coeffs$t)
sd.t <- sd(train.coeffs$t)
mean.int <- mean(train.coeffs$intercept)
sd.int <- sd(train.coeffs$intercept)
# Normalize them
input.coeffs <- input.coeffs %>%
mutate(t = (t - mean.t) / sd.t) %>%
mutate(intercept = (intercept - mean.int) / sd.int)
return(input.coeffs)
}
```
Also, a function to calculate the error metric is needed:
```{r wrappers_5}
# Function to calculate the error metric
get_metric <- function(model.predictions, true.values)
{
d <- model.predictions - true.values
d <- ifelse(d <= 0, exp(-d / 13) - 1, exp(d / 10) - 1)
return(mean(d))
}
```
We also need a few more helper functions, to remove some data from the validation set for us to forecast, and a few more functions to abstract away some manual work through the process:
```{r wrappers_6}
# Function to cut down the validation data below a certain time threshold
cut_dset <- function(dset, cutoff)
{
partial <- dset %>%
group_by(unit_no) %>%
filter(t < max(t) * cutoff) %>%
ungroup()
return(partial)
}
# Function to return the total operating time for each unit in a dataset
get_total_time <- function(dset)
{
dset.new <- dset %>%
group_by(unit_no) %>%
mutate(op.time = max(t)) %>%
ungroup() %>%
distinct(unit_no, op.time) %>%
select(unit_no, op.time) %>%
mutate(op.time = as.numeric(op.time))
return(dset.new)
}
# Function to determine the mode of a given set of data
get_mode <- function(x) {
# Estimate the probability mass function distribution, given the data
dx <- density(x, na.rm = T)
# Get the value that maximizes the probability mass function
mode <- dx$x[which.max(dx$y)]
return(mode)
}
```
Next, let's wrap some of the wrappers themselves in a function to preprocess the train and validation sets, up to the point the steps are common for both approaches, and a function to return the datasets (training, validation) for each fold:
```{r wrappers_combo_1}
# Function to preprocess the training set
preprocess_original_dset <- function(dset)
{
# Train a k-means algorithm to estimate the operating clusters
op_clusters <- kmeans(dset[, c('op1', 'op2', 'op3')],
centers = 6,
nstart = 48)
# Append the clusters on the set
dset[, 'op_cluster'] <- op_clusters$cluster
# Drop redundant sensors
dset[, c('s1', 's5', 's10', 's16', 's18', 's19')] <- NULL
# Calculate summary statistics for each sensor per cluster on the training set
dset.sum.stats <- get_summary_stats(dset)
# Standardize the sensors
usable_sensor.cols <- colnames(dset)[grepl('s\\d', colnames(dset))]
dset[, usable_sensor.cols] <- standardize_sensors(dset, dset.sum.stats)
# Apply ICA on the data set
ica <- fastICA(dset[, usable_sensor.cols], n.comp = 1)
# Get the independent component
dset$IC <- as.numeric(get_IC(dset[, usable_sensor.cols], ica))
# Extract the linear regression coefficients
lr.coeffs <- get_coeffs(dset, ica)
# Return list with outputs
out <- list("train_summary_statistics" = dset.sum.stats,
"op_settings_clustering" = op_clusters,
"ICA_object" = ica,
"LR_coeffs" = lr.coeffs)
return(out)
}
# Function to preprocess the validation set (up to the point which is common for the two models)
preprocess_synthetic <- function(dset, params, for_test_set)
{
if (for_test_set == F)
{
# Get total operating time
y <- dset %>%
distinct(unit_no, op.time) %>%
select(op.time)
}
# Get operating cluster
clusters <- get_cluster(dset, params$op_settings_clustering$centers)
dset <- cbind(dset, clusters)
# Drop redundant sensors
dset[, c('s1', 's5', 's10', 's16', 's18', 's19')] <- NULL
# Standardize the sensors
usable_sensor.cols <- colnames(dset)[grepl('s\\d', colnames(dset))]
dset[, usable_sensor.cols] <- standardize_sensors(dset, params$train_summary_statistics)
# Get the independent component
dset$IC <- as.numeric(get_IC(dset[, usable_sensor.cols], params$ICA_object))
# Get the linear coefficients from the ICA object
lr.coeffs <- get_coeffs(dset, params$ICA_object)
# Generate the new dataset
if (for_test_set == F)
{
new_dset <- as.data.frame(std_coeffs(lr.coeffs, params$LR_coeffs)) %>%
mutate(target = y$op.time) %>%
select(-unit_no)
} else {
new_dset <- as.data.frame(std_coeffs(lr.coeffs, params$LR_coeffs)) %>%
select(-unit_no)
}
return(new_dset)
}
```
Alright, we can start building our models now. One key idea is to make the training/validation sets resemble the test set. Note that the test set contains truncated data, i.e. components whose sensor trajectories do not reach the end of their lifetime, in contrast to the training set we've been observing so far. Therefore, if we truncate the training set at random moments through a unit's lifetime, well be able to make a dataset more similar to the one that our models will be evaluated on. This randomness gives us the ability to perform the truncation process multiple times for each unit (with the same operating time of course), therefore generating a multitude of datasets! Let's do this:
```{r generate_dset}
# Reproducibility
set.seed(1)
val_ratio <- 0.3 # 30% of the data will go to the validation set
# Clean read
data <- data.frame(read_data('train.txt'))
# Split into three datasets
data <- split_dataset(data, val_ratio) # Remove the holdout set
train <- data[[1]]
val <- data[[2]]
# Get the objects needed for preprocessing the new datasets
preprocess_objects <- train %>% preprocess_original_dset
# Truncation limits and number of datasets to generate
no.datasets <- 80
min.lifetime <- 0.2
max.lifetime <- 0.9
# Generate the new datasets
new.train <- lapply(seq_len(no.datasets), function(x){
train %>%
group_by(unit_no) %>%
mutate(op.time = max(t)) %>%
filter(t <= round(max(t) * sample(seq(min.lifetime,
max.lifetime,
by = 0.05),
size = 1))) %>%
ungroup %>%
do(preprocess_synthetic(., preprocess_objects, F))}) %>%
do.call("rbind", .)
```
Now we have our processed datasets. We will generate a set of trained models on the data using caretEnsemble. The models to be generated are the following:
* Linear regression
* Regression tree
* Support vector regressor with radial kernel
* Rule-based model, using Cubist
* Random forest
* Stochastic gradient boosting regressor
Let's create the instances and train them using Caret. We'll perform repeated cross validation (10 folds, 20 repetitions), using the competition's metric as the objective function to minimize:
```{r pipeline, message=FALSE, warning=FALSE}
# Reproducibility
set.seed(1)
# Re-write this to the proper format for Caret
eval_metric <- function(data, lev = NULL, model = NULL) { # Compute the competition metric
y_hat = data$pred
y = data$obs
d <- y_hat - y
d <- ifelse(d <= 0, exp(-d / 13) - 1, exp(d / 10) - 1)
score_val <- mean(d)
c(S = score_val)
}
# Generate the models
model_list <- caretList(
target ~ .,
data = new.train,
metric = "S",
maximize = F,
trControl = trainControl(
method = "repeatedcv", # Don't know how stable each model will be
number = 5,
repeats = 3,
savePredictions = "final",
summaryFunction = eval_metric,
index = createResample(new.train$target, 5)
),
methodList = c("lm", "rpart", "svmRadial", "cubist", "rf"),
tuneList = list(gbm = caretModelSpec(method = "gbm", verbose = F))
)
# collect resamples
results <- resamples(model_list)
# Show results
bwplot(results)
```
We can see that the best model is a random forest. Let's apply our best approach on the prediction set, and generate the submission file:
```{r test_set, warning=FALSE, results="hide"}
prediction_set <- data.frame(read_data('test.txt'))
running.cycles <- get_total_time(prediction_set) %>% select(op.time)
total.cycles <- prediction_set %>%
do(preprocess_synthetic(., preprocess_objects, T)) %>%
predict(model_list, .) %>%
round(.)
total.cycles - running.cycles %>%
mutate(op.time = ifelse(op.time < 0, 0, op.time)) %>% # In case we predict negative values
write.table(., file = "predictions.txt", sep = "", row.names = F, col.names = F)
```
# Conclusions
We did get a score of 986.342, which gives us the 7th place on the test set.