Challenge tips from someone who achieved the top score of 100!

Hi everyone,

I have just submitted my result and I am very excited to get the current top score of 100! I only had a week of learning Python before I started this challenge, so I’m super chuffed to get this score. My background is geology and geoscience, and I have familiarity with Petrel workflows. I also am an Excel ninja, which really helped me conceptualise the problem, although learning to use Python instead was an uphill battle. It took me maybe about a week full-time to do this challenge, so hopefully you can complete this faster!

Anyway, here are some tips on the tricky bits. I’m no Python expert, but here is where I got confused and here is how I solved it. (Please note my terminology may be not 100% correct, I’m still learning things like “passing through” and “keying in”!)

  1. Zscore cutoff:
    This was a difficult one because I wasn’t sure how the stats function worked. Eventually, I had to slice the dataframe and then change it to a tuple to get it to work.
    Additionally, the stats.zscore function only works on columns, I believe? So you would have to transpose the matrix, and then transpose it back again.

This was how I did it:

# Arranging by sample in columns by transposing matrix to prepare # for stats.zscore function
df_transpose = np.array(df.iloc[:, 3:].T)

# Finding all spikes using z score, and transpose it back to the 
# original state
zscore_full = stats.zscore(df_transpose, nan_policy='omit').T

Any value with zscore above 3 (i.e. above 3 standard deviations) was removed and replaced with a NaN

  1. Simple Imputer
    I used the median value here, and I carried out the operation on the despiked data

  2. Scaling of data
    Oh wow, this was very difficult. Again, the MinMaxScaler function works to scale data along columns rather than rows. I got SO many errors from trying to do this.
    So, like the zscore cutoff, I transposed the numpy array to carry out the MinMaxScaler function, and then transposed it back to the normal shape after the function was carried out, and put it back into a panda DataFrame

source_data = df_despiked_new.copy()
data = source_data.iloc[:, 3:].T

scaler = MinMaxScaler()
a = pd.DataFrame(scaler.fit_transform(data)).T
df_scaled = pd.concat([source_data.iloc[:, :3], a], axis=1)
df_scaled.columns = source_data.columns
  1. Generating Test Data
    This took a while. My general workflow was:
    a) creating an empty numpy array
    b) Looping to query each of the families, similar to the III_Deltaic example
    c) Looping within each family, getting the mean and SD of each variable. Then populating down each row first with np.random.normal function of the mean and SD, then appending new columns.
    d) Appending each synthetic family to the bottom of the array
    e) Converting to Dataframe
    f) Scaling of data, similar to above.

  2. Building the machine learning model was fairly straightforward, if you follow the link to Matt Hall’s example at Agile Scientific blog, as linked. There’s even an example notebook in that link, which I found really helpful.

  3. I adjusted the predictions in the end, the erroneous data was fairly obvious!

Hope this helps!

2 Likes

@elizabeth.wong thank you very much for this thread. It helped me a lot with building my model!

Following your tips I was able to build a model that constantly got score of 95% and above. As you mentioned – misclassified points were not difficult to spot and I also agree with the note in the Notebook, that our geological brain is a much more more robust than anything in Python. I thought, however, that attempt to build a model that gets 100% score without a need to be hand corrected would be a good exercise.

And I managed to achieve it after all! But lets some technical notes on what Liz said above.

Transposition
I believe transposition is a bit “computational heavy”, which probably does not matter in a small database like ours, but still I wanted to avoid it. A lot of methods have an axis parameter that you can use. So for example stats.zscore(df, axis=1) would let you avoid transposing before and after.

Scaling
I found scaling curtail in achieving good prediction. And I agree with you, Liz, that MinMaxScaler() is not the best in our case as it looks at columns rather than rows. Again, you can transpose it, but I ended up just implementing MinMaxScaler logic in two lines of code: first subtracting row minima from whole data, then dividing each row by it’s max. In effect you end up with each row (each sample) in range 0 to 1 (inclusive) which is what we wanted to achieve. Rescaled chromatography looked right on the spot! image

Classifier choice
Notebook suggests using sklearn.ensemble.RandomForestClassifier, but I found it to be not very reliable. If I ran my whole notebook many times, occasionally it would end up showing some pretty poor results. My first thought was to increase number of estimators (RandomForestClassifier(n_estimators=500)) or size of the synthetic database but that was a dead end road. Solution for me was to use different supervised classification model. I don’t want to say which one in order to not destroy all the fun, but scikit-learn has a couple of them available so you can pick and choose what works best for you. Check for example this comparison or this article if you are looking for inspiration.

Building a 100% model!
Finally my end goal :wink: All this mentioned above by Liz and myself allowed me to build a model that could predict samples family with result at least 95% every time I ran it. But my goal was to build a model which gives 100% accuracy every time.

Note: that may not be such a good idea in real life scenario as you risk overfitting but I thought it would be a good exercise just for practise.

As Liz noticed, those few misclassified samples were easy to spot when plotted on the map as they were usually placed in a cluster of points with different class. Therefore after prediction based on chromatography, I put the results through a clustering algorithm and looked for the outliers in every cluster formed. There is a good documentation of clustering algorithms on scikit-learn page if you want to read up. Basically I chose one that was supposed to be good for distances between points, defined threshold distance instead of numbers of clusters (as it wasn’t realistic to assume any number of clusters beforehand) and unlike in previous steps I used df[ ['x','y'] ] to fit it. After that all you have to do is to look through every cluster and if you find more than one class in it just assign the mode to all points.

And that’s it! I tried to run this model about 20 times (just by hand, not any rigorous testing, sorry) and each time I got 100% final result with preliminary results (before clustering) varying between 95 and 98%.

Hope that helps! It was a really good couple hours of fun and learning. Thank you Xeek!