Impressions from a Kaggle Noob. What I learned from my very first… | by Jonas Schröder | Jan, 2021
As briefly mentioned I work as a Junior Data Analyst for an FMCG giant. In my job I create business reports, dig into CRM and digital media data, and talk to colleagues about ways to make their work life more productive. I am not a ML engineer nor a computer scientist. I studied Philosophy (B.A.) and Management (M.Sc)
To create more knowledge from data I constantly work on improving my data science skills but I am only at the beginning of my journey. I can code in R and Python, am familiar with reporting tool like PowerBI, and use query languages like SQL and BigQuery. I know a few ML algorithms and work on regression problems, however, my experience is limited (hence the Junior).
In short: While I create business value from data in my job I don’t expect to have the tools necessary to perform well in a Kaggle competition yet. But that’s not the point for me anyway. The goal is gaining more experience through application of machine learning algorithms on datasets apart from work. Jump out of the plane and assemble the parachute while falling down, right?
There are some datasets no one can escape when starting to read about data science. What the MNIST dataset is for image classification is the Titanic dataset for Kaggle starters. The task of the Kaggle Titanic competition is to predict who will survive the Titanic crash.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). — Kaggle Website
You start with around 900 instances and 10 features, build you model, predict the survival state for another 400 instances, and upload your prediction to Kaggle for inspecting the accuracy.
So everybody starts off with identical data but the models in the competition perform very differently. As you know, it’s not just the model that is important. It’s probably equally or even more important to find new features (feature engineering) and to decide which features to keep and which to ignore in the training phase.
Still, since this is a rather small dataset and the competition has been around for a while, there are a number of common strategies. However, since my goal was to learn as much as possible by trying out, I decided to not read anything before my first upload.
In the initial feature engineering phase I only created the feature “Deck”, which is based on the character in the cabin number (e.g. “C” in cabin “C103”). A little research revealed that cabins of the upper decks (A, B, and C) were the most luxurious and expensive ones, nicely decorated and as far away as possible from the rumbling machines at the bottom of the ship. This must play a role, right? Rich survive? Turns out, having a cabin at all was already a sign of wealth.
After that I started taking a look at the data in the EDA phase to see whether I find a relationship between the features visually. I don’t want to focus on it here and I plan on writing a Kaggle notebook on my process and findings soon (link will be added here, but feel free to follow my Kaggle profile). I created a summary graph which should you provide with a clear picture: primarily upper-class and females survived the tragedy.
Next, I preprocessed the data to prepare it for ML algorithms. I did pretty standard stuff like replacing missing values (e.g Age and Fare) with the variable’s median value, standardizing the numeric values using sklearn’s StandardScaler and turning categories into “numeric values” using sklearn’s OneHotEncoder, all put into a ColumnTransformer. If you’re interested in the code, check out my GitHub repo for it.
The train_df and test_df (which I need to predict) were ready for some models! Now comes the fun part, training the ML models! I was already quite familiar with basic Random Forest from my work experience, so naturally that is what I started with. I did not pay much attention to cross validation or accuracy scores by choice since I just wanted to get a baseline. I’ve uploaded my prediction and got … a score of 0.46… For a binary classification problem I probably would have gotten a better result using something like a coin flip (np.random.binomial(1, 0.5)) than for my “sophisticated” ML model.
This is the point where I started to research how other people dealt with the dataset and how their feature engineering and models work. My goal was learning more about the general process and I already spent a few hours completely on my own, so I was fine with getting some inspirations.
I particularly liked Ken Jee’s video Beginner Kaggle Data Science Project Walk-Through (Titanic). While sticking with my preprocessed dataset and features, I created a few more models like Logistic Regression, Naive Bayes, and a VotingClassifier based on these models after watching his video and reading his notebook on Kaggle.
Kaggle notebooks in general as well as the discussion section of the Titanic competition turned out to be great places to learn from others. While you could basically “steal” the complete code from others to score quite well in the competition, this was not a strategy that fit my goal. Nonetheless, it helped me to come up with a few models that reached on average a cross validation score of 0.82 (which screamed overfitting, but I decided to ignore it for now).
I was ready for uploading some more predictions.
In general, while great at the training phase I found the Random Forest models to not work well on the actual to-be-predicted-on dataset (or so I thought). The voting classifiers, however, worked significantly better with a score of 0.62200 for both the hard and soft voting version.
I was quite happy but also sleepy at that point. So I decided to call it a day and go to bed. The next day I went back to the feature engineering phased and incorporated a few techniques I read about in the discussion section. My score became worse and I changed back to where I left off the night before.
A few models were now reaching the 0.62200 score and I decided to compare the prediction results. Turns out, the models all predicted the class identically: nobody survived, all 0. Of course, I must be right quite often when the baseline survival rate for the Titanic is roughly 32%.
To put it in terms I learned later when reading about classification performance measures: My Recall rate was perfect (I identified all the Non-survivors) but my Precision (considering false-positives) was as bad as it can get. Turns out this issue is known as the Precision-Recall Tradeoff. Only the random forest classifier predicted a few survivors but not the right one, hence the worse score.
I eventually found the problem in my pre-processing section of the code, solved it, rerun all the models. This time I checked the prediction tables before uploading them to Kaggle and it worked. Since the models now were trained on the right training data, my scores significantly improved.
As of the time of writing this article, the Random Forest model tuned with RandomizedSearchCV has been my best model with a score of 0.77511 which I am quite happy with. It still lags behind Jee’s model with 0.79425 but it brought me up to the top 55% and gave me a better idea of how Kaggle works and what’s important when approaching such a classification problem.
Let’s summarize what I learned the past two days.
Well, the most obvious take-away and Learning 1 has been to check the outputs before uploading the predictions to Kaggle and to think about whether they make sense apart from the model. It’s easy to get carried away when you’re in the flow state and excited (especially when lacking sleep) but this is a low-hanging fruit!
The reason why I only found out about my mistake quite late was my obsession with the Kaggle score. The score of 0.62200 was higher than for my other models, so it must be better, right? Well, the model predicted all to be dead which turns out to be right most of the time when the survival rate is 0.31. Learning 2: Don’t be obsessed with the accuracy score too much or it will blind you.
Score values in the training phase turned out to be misleading too. Even though I used cross validation, having <1000k instances and 30+ features resulted in overfitting. The Kaggle score was found to be 3% to 8% less than the training score using cross validation. Learning 3: Get your feet wet from time to time and upload a few predictions to see which models work well on the real data instead of overfitting on training data.
Kaggle turned out to be better than I expected for learning. The community is really active and people share their approaches and code in the Discussion and Notebook sections as well as outside of Kaggle in YouTube videos, on GitHub or through Medium posts. I heard that this was true for real competitions and that incorporating other’s approaches is even necessary to score high, which reminded me a bit of the Market Efficiency debate in stock valuations, but that’s another story. However, working more within the Kaggle universe (i.e. by reading and writing notebooks and participating in the Discussion section) as a great way to improve one’s skills has been my Learning 4.
Talking about Kaggle notebooks: I tend to work in Spyder and split my scripts into sections that I run subsequently. The new feature of Spyder 4 where I can observe all plots in the new plot section was really a great extension. I knew before experimenting with Kaggle that Data Scientists love Jupyter Notebooks primarily because of combination of coding and documenting in a publish-ready format. I cannot live without Spyder’s Variable Explorer and df.head(5) just does not do the trick for me. However, Spyder alone is really not that great for EDA. Documenting the findings in #comments and ’’’docstrings’’’ is not as visually appealing as a great notebook. This leads me to my final Learning 5: Data Scientists love Jupyter/Kaggle/Google Colab notebooks for a reason and I should start using them more for that kind of work (EDA, documenting the approach, summarizing the findings, sharing with others, etc.).
Read More …