Hi everyone,
I have just submitted my result and I am very excited to get the current top score of 100! I only had a week of learning Python before I started this challenge, so I’m super chuffed to get this score. My background is geology and geoscience, and I have familiarity with Petrel workflows. I also am an Excel ninja, which really helped me conceptualise the problem, although learning to use Python instead was an uphill battle. It took me maybe about a week full-time to do this challenge, so hopefully you can complete this faster!
Anyway, here are some tips on the tricky bits. I’m no Python expert, but here is where I got confused and here is how I solved it. (Please note my terminology may be not 100% correct, I’m still learning things like “passing through” and “keying in”!)
- Zscore cutoff:
This was a difficult one because I wasn’t sure how the stats function worked. Eventually, I had to slice the dataframe and then change it to a tuple to get it to work.
Additionally, the stats.zscore function only works on columns, I believe? So you would have to transpose the matrix, and then transpose it back again.
This was how I did it:
# Arranging by sample in columns by transposing matrix to prepare # for stats.zscore function
df_transpose = np.array(df.iloc[:, 3:].T)
# Finding all spikes using z score, and transpose it back to the
# original state
zscore_full = stats.zscore(df_transpose, nan_policy='omit').T
Any value with zscore above 3 (i.e. above 3 standard deviations) was removed and replaced with a NaN
-
Simple Imputer
I used the median value here, and I carried out the operation on the despiked data -
Scaling of data
Oh wow, this was very difficult. Again, the MinMaxScaler function works to scale data along columns rather than rows. I got SO many errors from trying to do this.
So, like the zscore cutoff, I transposed the numpy array to carry out the MinMaxScaler function, and then transposed it back to the normal shape after the function was carried out, and put it back into a panda DataFrame
source_data = df_despiked_new.copy()
data = source_data.iloc[:, 3:].T
scaler = MinMaxScaler()
a = pd.DataFrame(scaler.fit_transform(data)).T
df_scaled = pd.concat([source_data.iloc[:, :3], a], axis=1)
df_scaled.columns = source_data.columns
-
Generating Test Data
This took a while. My general workflow was:
a) creating an empty numpy array
b) Looping to query each of the families, similar to the III_Deltaic example
c) Looping within each family, getting the mean and SD of each variable. Then populating down each row first with np.random.normal function of the mean and SD, then appending new columns.
d) Appending each synthetic family to the bottom of the array
e) Converting to Dataframe
f) Scaling of data, similar to above. -
Building the machine learning model was fairly straightforward, if you follow the link to Matt Hall’s example at Agile Scientific blog, as linked. There’s even an example notebook in that link, which I found really helpful.
-
I adjusted the predictions in the end, the erroneous data was fairly obvious!
Hope this helps!