You're reading from The Kaggle Book Data analysis and machine learning for competitive data science

Product type Paperback

Published in Apr 2022

Publisher Packt

ISBN-13 9781801817479

Length 534 pages

Edition 1st Edition

Concepts

Data Analysis

Authors (2):

Luca Massaron

Konrad Banachewicz

View More author details

Table of Contents (20) Chapters

Preface

1. Part I: Introduction to Competitions

2. Introducing Kaggle and Other Data Science Competitions FREE CHAPTER

3. Organizing Data with Datasets

4. Working and Learning with Kaggle Notebooks

5. Leveraging Discussion Forums

6. Part II: Sharpening Your Skills for Competitions

7. Competition Tasks and Metrics

8. Designing Good Validation

9. Modeling for Tabular Competitions

10. Hyperparameter Optimization

11. Ensembling with Blending and Stacking Solutions

12. Modeling for Computer Vision

13. Modeling for NLP

14. Simulation and Optimization Competitions

15. Part III: Leveraging Competitions for Your Career

16. Creating Your Portfolio of Projects and Ideas

17. Finding New Professional Opportunities

18. Other Books You May Enjoy

19. Index

Building your portfolio with Kaggle

Kaggle’s claim to be the “home of data science” has to be taken into perspective. As we have discussed at length, Kaggle is open to everyone willing to compete to figure out the best models in predictive tasks according to a given evaluation metric.

There are no restrictions based on where you are in the world, your education, or your proficiency in predictive modeling. Sometimes there are also competitions that are not predictive in nature, for instance, reinforcement learning competitions, algorithmic challenges, and analytical contests that accommodate a larger audience than just data scientists. However, making the best predictions from data according to a metric is the core purpose of Kaggle competitions.

Real-world data science, instead, has many facets. First, your priority is to solve problems, and the metric for scoring your model is simply a more or less exact measurement of how well it solves the problem. You may not only be dealing with a single metric but have to take into account multiple ones. In addition, problems are open to being solved in different ways and much depends on how you formulate them.

As for data, you seldom get specifications about the data you have to use, and you can modify any existing dataset to fit your needs. Sometimes you can even create your own dataset from scratch if you need to. There are no indications about how to put data together or process it. When solving a problem, you also have to consider:

Technical debt
Maintainability of the solution over time
Time and computational costs for running the solution
Explainability of the workings of the model
Impact on the operating income (if the real-world project is a business one, increasing profits and/or reducing costs is the leitmotif)
Communication of the results at different levels of complexity and abstraction

Often, all these aspects count more than raw performance against evaluation metrics.

Technical debt is a term more common in software development than data science, though it is a relevant one. For technical debt, you should consider whatever you have to do in order to deliver your project faster but that you will have to redo again later at a higher cost. The classic paper Hidden Technical Debt in Machine Learning Systems by David Sculley and other Google researchers should enlighten you on the relevance of the problem for data science: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

Not all this expertise can be supplemented by Kaggle competitions. Most of this should be gained by direct practice and experience-building in an enterprise environment. Yet, the knowledge and skills attached to Kaggle competitions are not completely separate from many of the considerations we discussed above, and they are a good complement to many of the enterprise-level data science processes. By competing on Kaggle, you are being exposed to different types of data and problems; you need to execute extensive feature engineering and fast iterations of model hypotheses; you also have to devise methods of putting together state-of-the-art solutions using common open-source packages. This is a set of valuable skills, and it should be promoted on your side. The best way to do so is to build a portfolio, a collection of your solutions and work based on Kaggle competitions and other resources from Kaggle.

In order to build a portfolio from Kaggle competitions, you can take multiple approaches. The easiest is to leverage the facilities offered by Kaggle, especially the Datasets, Notebooks, and Discussions.

Gilberto Titericz

https://www.kaggle.com/titericz

Before we proceed, we present a discussion on career opportunities derived from Kaggle in our interview with Gilberto Titericz. He is a Grandmaster in Competitions and Discussions, the former number 1 in rankings, and the current number 1 in total gold medals from Kaggle competitions. He is also a Senior Data Scientist at NVIDIA and was featured not long ago in an article on Wired on the topic (https://www.wired.com/story/solve-these-tough-data-problems-and-watch-job-offers-roll-in/).

What’s your favourite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

Since I started to compete on Kaggle in 2011, the types of competitions that I prefer are the ones with structured tabular data. The techniques that I use more in Kaggle are target encoding of categorical features (there are infinite ways to do it wrong) and stacking ensembles.

How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?

Kaggle is a great playground for machine learning. The main difference from real-life projects is that in Kaggle we already have the problem very well defined and formatted, the dataset created, the target variable built, and the metric chosen. So, I always start a Kaggle competition playing with EDA. Understanding the problem and knowing the dataset is one of the keys to an advantage over other players. After that, I spend some time defining a proper validation strategy. This is very important to validate your model correctly and in line with the way that Kaggle scores the private test set. Besides the fact that using a stratified Kfold is something that works for most binary classification problems, we must evaluate if a grouped Kfold or a time-based split must be used in order to validate correctly, avoid overfitting, and mimic, as much as possible, the private test set. After that, it is important to spend some time running experiments on feature engineering and hyperparameter optimization. Also, I always end a competition with at least one Gradient Boosted Tree model and one deep learning-based approach. A blend of such diverse approaches is very important to increase diversity in the predictions and boost the competition metric.

Has Kaggle helped you in your career? If so, how?

Yes, Kaggle was the main reason I changed the direction of my career. Up to 2016 I worked as an electronic engineer, and due to everything that I learned competing since 2011 I was able to switch to the data science area. Kaggle helped me to understand the concepts of machine learning and apply everything I learned from the theory. In addition, Kaggle is an excellent place for experimentation, where you can download a dataset and play with it to extract the maximum information possible from the data. That, combined with the competition environment, makes it perfect to learn coding and machine learning, and at the same time, it gets addictive and makes you want to learn more and more. Winning a few competitions puts your name at the top of the leaderboard and this is priceless for anyone’s career. Headhunters all around the world look at Kaggle to find good matches for their positions and the knowledge and experience gained from competitions can boost any career.

How have you built up your portfolio thanks to Kaggle?

Once I joined Kaggle, I spent some years learning all the techniques, algorithms, and tricks to extract more information from data and boost the metrics as much as possible. High accuracy is the main goal of most of the competitions, but to do that relying on luck alone is almost impossible; knowledge and experience play a big role when the goal is to win or at least finish in the gold medal zone. The number of medals I have in Kaggle competitions is my portfolio; up to now (11/2021) it’s 58 gold and 47 silver, which summarizes well the ML experience I got from Kaggle. Taking into account that each competition runs for at least 1 month, this is more than 105 consecutive months of experience doing competitive ML.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

Novices often overlook a proper validation strategy. That doesn’t just happen in Kaggle; I’ve seen data scientists all around the world building models and neglecting one of the most important things in the experimentation theory. There is no general rule when setting a proper validation strategy, but the data scientist must take into account how the model is going to be used in the future, and make the validation as close as possible to that.

What mistakes have you made in competitions in the past?

Several mistakes; it is impossible to list them all. I have probably made all the possible combinations of mistakes. The good thing about mistakes is that you can learn from them. Once you make a mistake and you detect it, it is very likely that you won’t make it again. The main mistake people make in Kaggle is to trust in the leaderboard score and not in their local validation score. Overfitting to the leaderboard is a constant in Kaggle and this is the main difference from the real world. In a real project, we must build a strong validation strategy that we can trust, because in the real world the models will be tested on real data and you have only one chance to hit the mark, not multiple submissions per day.

Are there any particular tools or libraries that you would recommend using for data analysis and machine learning?

Some years ago I would have recommended R, but taking into account how fast Python is growing in the ML space and how generic and easy it is to use in production, I recommend to anyone starting ML that they learn it. In terms of libraries for tabular data, I recommend pandas for manipulation, and if you want speed then go with cuDF (the RAPIDS.ai GPU version of pandas). For EDA, I recommend using DataFrame with the Seaborn or Matplotlib libraries, and for machine learning Scikit-learn, SciPy, cuML (GPU), XGBoost, LightGBM, CatBoost, and PyTorch. Keep in mind that building a simple XGBoost model using the raw features is fast and can give you a good benchmark to compare with further models.

What’s the most important thing someone should keep in mind or do when they’re entering a competition?

Entering a Kaggle competition and submitting a public Notebook is easy, but finishing a competition in the gold zone can be extremely challenging. So the most important thing, at least for me, is to keep in mind that independent of the final ranking, we should use Kaggle to have fun and learn as much as possible from the discussion forums, from the public Notebooks, and even from the post-deadline winners’ posts describing their ideas and what worked.

Also keep in mind that what makes a competition winner is not just replicating what everyone else is doing, but thinking out of the box and coming up with novel ideas, strategies, architectures, and approaches.

Do you use other competition platforms? How do they compare to Kaggle?

I have won a couple of competitions on other competition platforms, but the main difference compared to Kaggle is the number of users. Kaggle has 171k active users as of November 2021, which makes the forums, Notebooks, and dataset interactions much richer in terms of content. Also, Kaggle offers something unique: Notebooks where you can write and run code for free using Google servers, which can be priceless if you don’t have access to good hardware.

Leveraging Notebooks and discussions

Besides rankings themselves, Notebooks are the way to get you noticed on Kaggle because they simultaneously demonstrate how you solve problems, how you present ideas, and how you code them. Conceived as a way to easily and openly share solutions and ideas among participants, Notebooks are the most important tool (after rankings) for demonstrating abilities that are appreciated by employers.

In fact, one of the most important changes in the world of data science in recent years has been its transition from a game of outstanding talents (unicorn data scientists) to a team game, where data scientists have to collaborate with each other and with other departments to ensure the success of a project. Consequently, in their hiring processes, companies often care more about you being able to communicate ideas and results, as well as coding in a clean and effective way.

In the previous section, we discussed how real-world projects require a wider range of skills, ranging from dealing with technical debt to designing cost-effective solutions. You can still demonstrate these skills on Kaggle, even if they are not the ones that will make you win a competition. Notebooks are the best tools for doing this.

Refer to Chapter 3, Working and Learning with Kaggle Notebooks, for an introduction to Kaggle Notebooks.

You will find different types of Notebooks on Kaggle. As a good approximation, we can group them into four categories:

Solutions and ideas for ranking in a competition
Exploratory data analysis (EDA) on the data
Tutorials explaining machine learning models or data science principles
Fresh implementations of models derived from papers or other original solutions

Each of these can provide you with an edge by means of an interesting set of skills. If solutions and ideas for competitions are the classic way to demonstrate that you know how to tackle a complex problem in data science, the other three can show the world that you can:

Manipulate, represent, and extract visual and non-visual insights from data (EDA), which is a skill deemed very important in every setting, from scientific research to business
Educate on data science, opening the door to roles in education, mentorship, and developer advocacy
Translate research into practice, a key skill at a time when innovations in data science (especially in deep learning) appear daily and need to be translated into working solutions quickly

Even if you don’t rank highly in Kaggle competitions or have astonishing solutions to present, these other three kinds of Notebooks (EDA, tutorials, and paper implementations) can provide you opportunities in the real world if you can promote them in the best way. To do so, you need to understand how to code readable and interesting Notebooks, which is something that you learn from practice and experience. Since it is an art, our suggestion is to learn from others, especially from the Notebooks Grandmasters who place high in the Notebooks user ranking (https://www.kaggle.com/rankings?group=notebooks&page=1&pageSize=20).

We recommend you look at what kind of Notebooks they have developed, how they have arranged their work using figures, how they have structured their code, and then, finally, based on your skills and interests, try to imitate one of their Notebooks. We also suggest that you do not bet your chances for success only on code and charts, but also on the narrative that you present. No matter whether you are showing off a solution, teaching, or implementing a neural architecture in TensorFlow, how you explain the Notebook’s cells with words is very important in terms of leaving a lasting positive impression.

Aside from browsing the Notebooks of high rankers, there is also a way to be notified about less mainstream – yet still finely crafted – Notebooks that have recently appeared on Kaggle. The astrophysicist and passionate Kaggle user Heads or Tails, Martin Henze (https://www.kaggle.com/headsortails), publishes on the discussion forums a weekly Notebooks of the Week: Hidden Gems post, a collection of the most interesting Notebooks around. At the moment, there are already over 100 volumes and the author continues to search Kaggle for anything that could prove interesting. If you would like to be updated about cool Notebooks, just follow Martin Henze’s profile on Kaggle or check if he has published something new under his discussions from time to time.

If you love digging through Notebooks looking for ideas and learning from them, we never tire of stressing that you should not brainlessly copy other people’s work. There are many Notebooks on Kaggle, and often someone copies one, makes some small changes, and re-presents the Notebook to other Kagglers as if it were their own original idea. It is also customary to cherry-pick a function, or part of the code from a Notebook, and insert it into your own. In both these cases, please remember always to quote the source and the author. If you cannot retrace something to the original author, even referring to the last Notebook where you found the code you used is enough. While the main purpose of a showcase is to display your own efforts and skills, it is very important to recognize that some parts of your code or some ideas are taken from elsewhere. Aside from being a sign of respect toward your fellow Kagglers, a source attribution highlights that you are knowledgeable enough to recognize other people’s efforts and inventions, and that you know how to employ them in your own work.

In a minor way, discussions on Kaggle’s forums can help you get noticed for specific roles in data science and software development. Initially, discussions on Kaggle were just for communicating with organizers or for asking pressing questions about the competition itself. At the end of competitions, participants seldom felt compelled to present or discuss their solutions. However, since discussions obtained their own user rankings and mastery grades, you have been able to find much more information on forums.

Refer to Chapter 4, Leveraging Discussion Forums, for an introduction to discussions on Kaggle.

In our experience, discussions on Kaggle can be split into four categories:

Competition solutions that explain in detail (sometimes with the help of an associated Notebook) how a team managed to reach a certain position on the private leaderboard
Help with and an explanation of requirements during a competition
Thanks, compliments, and chit-chat
Posts that help and tutor other competitors, explaining things to them

We have observed that excelling in the last type of post and being widely noticed for it can help you achieve the role of developer advocate, especially if you also have other active channels where you interact with your fellow data scientists (for instance, a Twitch or YouTube channel, a Twitter account, or a Medium blog).

With the growth of developer advocate roles in both large companies and start-ups, there is an important demand for experts skilled at helping other data scientists and developers in their projects. If you want to learn more about this role, the following article on draft.dev is quite explanatory and exhaustive: https://draft.dev/learn/what-is-a-developer-advocate.

Leveraging Datasets

Kaggle competitions are often criticized for presenting data that is already cleaned, well arranged, and far from representative of data found in the real world. Our point of view is slightly different; we find the data that Kaggle presents in competitions can also be quite messy or noisy. Sometimes the data presented will not actually suffice in terms of quality and quantity for getting a top score, and you will need to look around for additional data on the internet.

What Kaggle does miss out with regard to data in a data science project is the process of collecting and gathering data in organized repositories and files, a process that, in real-world settings, is not possible to standardize because it differs from company to company and problem to problem. Data handling in the real world should mostly be learned on the field.

The introduction of datasets into Kaggle was aimed at mitigating the idea that Kaggle was just focused on modeling problems. Kaggle Datasets are very helpful in this sense because they allow you to create and upload your own data and document the features and their values; they also require you to manage your data over time by planning the frequency with which you are going to update or completely replace it.

Refer to Chapter 2, Organizing Data with Datasets, for an introduction to Kaggle Datasets.

More interestingly, in Kaggle Datasets, you are also given the opportunity to attach different analyses and models built using Kaggle Notebooks, uploaded from your data or a competition. These models could be work you came up with during a competition, or something you devised because you studied the uploaded data attentively and found a set of interesting problems you could solve with it.

In addition, Kaggle Datasets offer you a template to check for the completeness of the meta-information accompanying your data. A description, tags, a license, sources, and the frequency of updates: these are only a few of the required pieces of information (used to calculate a usability score) that will help anyone using your data to understand how to use it. You may even point out (in the description or in discussions) tasks for the dataset that relate to pending work you would like to do with it. This is a good way to communicate your full understanding of the potential value of the data you have uploaded.

Previously, Tasks were part of the Kaggle Dataset functionality, but they have recently been removed: https://www.kaggle.com/product-feedback/292674. Nevertheless, you can use the data description and discussions to point out what you expect your data could be used for.

All these characteristics make Kaggle Datasets a very good way to show off your experience with problems on Kaggle and, in general, your ability with data and machine learning algorithms, because they allow you to:

Publish and maintain a dataset
Demonstrate that you have understood the value of the data with a tasks roadmap
Show coded and fully working solutions (since Kaggle Notebooks can immediately work on the same data, without any preparation), ranging from data preparation to explanatory data analysis to predictive modeling

We strongly recommend using Kaggle Datasets for showing off the work you have done during Kaggle competitions or on any other project, because they separate your work from others’ and integrate data and Notebooks. In short, Kaggle Datasets can demonstrate to anyone a working solution that you have implemented. There is a downside, though: you are mostly tied to a Notebook environment (even when you use scripting), which is not perfectly transparent in terms of the package and version requirements necessary for someone to know to run the code in other environments.

In fact, Kaggle Notebooks depend on a Docker environment (https://www.docker.com/) set by a configuration file, a Dockerfile, that determines which versions have been installed. When browsing a Notebook, it is not immediately evident what version of packages are being used until you inspect this configuration file. For this purpose, as well as for replicating the settings, the Dockerfile can be found on the Kaggle repository on GitHub (https://github.com/Kaggle/docker-python/blob/main/Dockerfile.tmpl), though it changes over time and you may need to keep track of the one used in your work.

Finally, in addition to this aspect, don’t forget that getting even a glimpse of a Dataset and its related Notebooks requires access to the Kaggle community.

Gabriel Preda

https://www.kaggle.com/gpreda

We had an inspiring career-oriented talk with Gabriel Preda, a Kaggle Grandmaster in Datasets, Notebooks, and Discussions, and Principal Data Scientist at Endava. Gabriel has a PhD in Computational Electromagnetics and had a long career in software development before devoting himself completely to data science. When he discovered Kaggle, he felt at home on the platform and invested a lot of time and effort into it, which paid dividends for him professionally.

Has Kaggle helped you in your career? How?

Kaggle helped me to accelerate my learning curve in data science. Before Kaggle, I was looking all around for sources of information or problems to solve, but it was not very methodical or effective. On Kaggle, I found a community of people interested in the same things as me. I was able to see the work of top experts in the field, learn from their published Notebooks with analyses or models, get insights from them, ask them questions, and even compete against them. I was mostly in data analysis at the time I joined Kaggle, but very quickly I started to compete; that means learning how to build, validate, and iteratively improve models. After around two years on Kaggle, I switched my career; I went from managing software projects to a full-time data science job. Kaggle also gave me some visibility, and during interviews with candidates at my present company they mentioned that they wanted to join because they saw that I worked there.

Have you ever used something you have done on Kaggle as part of your portfolio to show potential employers?

I use my Kaggle portfolio as the main source of information for potential employers; my LinkedIn profile points to my Kaggle profile. Also, in recent years, employers have become more aware about Kaggle, and some of them ask specifically about your Kaggle profile. There are also potential employers that make very clear that they do not consider Kaggle relevant. I disagree with this view; personally, before interviewing candidates, I normally check their GitHub and Kaggle profiles. I find them extremely relevant. A good Kaggle profile will demonstrate not only technical skills and experience with certain languages, tools, techniques, or problem-solving skills, but also how well someone is able to communicate through discussions and Notebooks. This is a very important quality for a data scientist.

You reached Grandmaster in Notebooks (Kernels) first, then in Discussions, and finally in Datasets. Can you tell us about your journey?

I became the seventh Kernels Grandmaster and I got as high as the third rank. For maybe two years I think I was in the top 10 in the Kernels hierarchy as well. I started writing Kernels primarily to improve my knowledge of the R language while analyzing datasets I found more interesting. I also experimented with all kinds of techniques, including polygon clips, building dual meshes of Voronoi polygons, and 2D Delaunay tessellation. I gradually started to focus on exploratory data analysis, followed by building models for datasets and then for competitions. Also, once I started to compete more, I started to write Kernels for competing in Python. About the same time, I began to notice that some of my Kernels attracted attention from Kagglers, primarily upvotes and forks but also favorable comments. Some of my Kernels written for exploration of data in active competitions reached a very wide audience and brought me many gold medals; therefore, I reached the Master and then Grandmaster tier. Currently, I do not publish many Kernels related to competitions; mostly I create starting Kernels related to datasets that I publish.

Next, I also obtained the Discussions Grandmaster level. I never anticipated that I would reach this tier in discussions. First, I started commenting on other people’s Kernels. Then, gradually, as I got more involved in competitions, most of my comments were in the discussion sections of active competitions, either asking questions about topics of interest in these competitions or starting new topics, for example, suggesting solutions for one problem in a competition or collections of resources to address various open issues related to the competition. I want to mention a special set of comments that I added. As a Kaggle Kernels Grandmaster (one of the first), I frequently upvoted new Kagglers’ Notebooks when I discovered very good content.

In such cases, I try to find a few moments to also praise (especially if the content is of good quality) the achievement of the author. Especially to beginners, giving not only the expression of your appreciation by upvoting their work, but also adding some positive feedback about their contribution, might give them a boost of confidence so that they will invest more in learning and contributing even more on Kaggle. I like to do this, and I hope it helps. I once also compiled a list of recommendations about how to comment on Kaggle. This is the list: be short (but not too short); be specific; provide information, not opinions; praise other people’s work when you have the opportunity; keep calm and try to be helpful; do not tag people in your comments unless it makes sense (for example, if it is a discussion, and you need to direct your comment to someone that addressed you in that thread).

The last Grandmaster tier I reached is in Datasets. This is also the tier where I reached the highest ranking, second. My progress through the ranks was slow. I started with something I liked. Getting a high profile in Datasets requires investment in curating, cleaning, and documenting the data. If it is not something that you really like, you most probably will not keep going. I pursued things that were important to me but also to a wider community: to my country, my continent, or the whole world. I published datasets about elections in my country, and about various social, demographic, and economic topics in Europe. I focused on subjects of actuality, that were both relevant and of high importance for the community. For example, during the pandemic, I published datasets on COVID-19 cases, about vaccinations, tests, and virus variants both from my country and worldwide. I captured data that went beyond simple numerical, tabular values. Text data, especially originating from direct contributions from people, provided important insights for many people. One of my most upvoted datasets consists of collections of Reddit posts and comments or Twitter posts (tweets) on subjects as diverse as vaccine myths, cricket, pandemics, sports events, and political personalities. I invested significantly in automating data collection, data cleaning, and data processing scripts. This saved me precious time (especially for datasets updated frequently – some of them were collected continuously, with scripts triggered every hour) but also made it possible to have better control of the process. Every time I publish a new dataset, I also write one or more starting Kernels. These Kernels are not intended to reach a large audience. I create them as helper Kernels for potential users of my Datasets, so that they find it easier to use the data. In many cases, I prefer to keep the original data (as I collected it, or downloaded from an alternative source) and include a Kernel for data cleaning, transformation, and preliminary analysis as well as the result of this process, the data in a more accessible format. In this way, I try to capture in the dataset more than the data itself; I also provide information about techniques for data transformation.