For me, handling spikes at 21 seem to make some difference in model performance(I’ve used default sklearn RandomForest model for comparison).

My assumption was that there is a column showing higher correlation with column ‘21’, so imputing the spikes at 21 with those corresponding values(then, dealing '0’s in the following step).

When excluding the spikes at 21, the most correlated column was found by:

df[~df.index.isin(df_zscore[(df_zscore > 3).any(1)].index)].iloc[:, 3:].corr()[‘21’].sort_values(ascending=False)

, which shows that ‘23’ has correlation value of 0.951920.

So, I’ve tried several cases of spike at 21 handling , for example:

- df_spk.loc[idx_spk, ‘21’] = 0.0
- df_spk.loc[idx_spk, ‘21’] = df_spk.loc[idx_spk, ‘23’]
- df_spk.loc[idx_spk, ‘21’] = 3*df_spk.loc[idx_spk, ‘23’]

Then, the 3rd case(spike at ‘21’ = 3x value at ‘23’) gave me the highest score of 96.25, and this might results from the fact that ‘II_Marine’ samples have a relatively larger portion in test data, where its mean value ratio of ‘21’ to ‘23’ is roughly 3x (1.597 vs. 0.463, from its statics in section 4. Generating Test Data)

Thank you for reading this.