Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Kaggle Book

You're reading from   The Kaggle Book Data analysis and machine learning for competitive data science

Arrow left icon
Product type Paperback
Published in Apr 2022
Publisher Packt
ISBN-13 9781801817479
Length 534 pages
Edition 1st Edition
Arrow right icon
Authors (2):
Arrow left icon
Luca Massaron Luca Massaron
Author Profile Icon Luca Massaron
Luca Massaron
Konrad Banachewicz Konrad Banachewicz
Author Profile Icon Konrad Banachewicz
Konrad Banachewicz
Arrow right icon
View More author details
Toc

Table of Contents (20) Chapters Close

Preface
1. Part I: Introduction to Competitions
2. Introducing Kaggle and Other Data Science Competitions FREE CHAPTER 3. Organizing Data with Datasets 4. Working and Learning with Kaggle Notebooks 5. Leveraging Discussion Forums 6. Part II: Sharpening Your Skills for Competitions
7. Competition Tasks and Metrics 8. Designing Good Validation 9. Modeling for Tabular Competitions 10. Hyperparameter Optimization 11. Ensembling with Blending and Stacking Solutions 12. Modeling for Computer Vision 13. Modeling for NLP 14. Simulation and Optimization Competitions 15. Part III: Leveraging Competitions for Your Career
16. Creating Your Portfolio of Projects and Ideas 17. Finding New Professional Opportunities 18. Other Books You May Enjoy
19. Index

Introducing Kaggle

At this point, we need to delve more deeply into how Kaggle in particular works. In the following paragraphs, we will discuss the various aspects of the Kaggle platform and its competitions, and you’ll get a flavor of what it means to be in a competition on Kaggle. Afterward, we’ll come back to discuss many of these topics in much more detail, with more suggestions and strategies in the remaining chapters of the book.

Stages of a competition

A competition on Kaggle is arranged into different steps. By having a look at each of them, you can get a better understanding of how a data science competition works and what to expect from it.

When a competition is launched, there are usually some posts on social media, for instance on the Kaggle Twitter profile, https://twitter.com/kaggle, that announce it, and a new tab will appear in the Kaggle section about Active Competitions on the Competitions page (https://www.kaggle.com/competitions). If you click on a particular competition’s tab, you’ll be taken to its page. At a glance, you can check if the competition will have prizes (and if it awards points and medals, a secondary consequence of participating in a competition), how many teams are currently involved, and how much time is still left for you to work on a solution:

Figure 1.2: A competition’s page on Kaggle

There, you can explore the Overview menu first, which provides information about:

  • The topic of the competition
  • Its evaluation metric (that your models will be evaluated against)
  • The timeline of the competition
  • The prizes
  • The legal or competition requirements

Usually the timeline is a bit overlooked, but it should be one of the first things you check; it doesn’t tell you simply when the competition starts and ends, but it will provide you with the rule acceptance deadline, which is usually from seven days to two weeks before the competition closes. The rule acceptance deadline marks the last day you can join the competition (by accepting its rules). There is also the team merger deadline: you can arrange to combine your team with another competitor’s one at any point before that deadline, but after that it won’t be possible.

The Rules menu is also quite often overlooked (with people just jumping to Data), but it is important to check it because it can tell you about the requirements of the competition. Among the key information you can get from the rules, there is:

  • Your eligibility for a prize
  • Whether you can use external data to improve your score
  • How many submissions (tests of your solution) a day you get
  • How many final solutions you can choose

Once you have accepted the rules, you can download any data from the Data menu or directly start working on Kaggle Notebooks (online, cloud-based notebooks) from the Code menu, reusing code that others have made available or creating your own code from scratch.

If you decide to download the data, also consider that you have a Kaggle API that can help you to run downloads and submissions in an almost automated way. It is an important tool if you are running your models on your local computer or on your cloud instance. You can find more details about the API at https://www.kaggle.com/docs/api and you can get the code from GitHub at https://github.com/Kaggle/kaggle-api.

If you check the Kaggle GitHub repo closely, you can also find all the Docker images they use for their online notebooks, Kaggle Notebooks:

Figure 1.3: A Kaggle Notebook ready to be coded

At this point, as you develop your solution, it is our warm suggestion not to continue in solitude, but to contact other competitors through the Discussion forum, where you can ask and answer questions specific to the competition. Often you will also find useful hints about specific problems with the data or even ideas to help improve your own solution. Many successful Kagglers have reported finding ideas on the forums that have helped them perform better and, more importantly, learn more about modeling in data science.

Once your solution is ready, you can submit it to the Kaggle evaluation engine, in adherence to the specifications of the competition. Some competitions will accept a CSV file as a solution, others will require you to code and produce results in a Kaggle Notebook. You can keep submitting solutions throughout the competition.

Every time you submit a solution, soon after, the leaderboard will provide you with a score and a position among the competitors (the wait time varies depending on the computations necessary for the score evaluation). That position is only roughly indicative, because it reflects the performance of your model on a part of the test set, called the public test set, since your performance on it is made public during the competition for everyone to know.

Before the competition closes, each competitor can choose a number (usually two) of their solutions for the final evaluation.

Figure 1.4: A diagram demonstrating how data turns into scores for the public and private leaderboard

Only when the competition closes, based on the models the contestants have decided to be scored, is their score on another part of the test set, called the private test set, revealed. This new leaderboard, the private leaderboard, constitutes the final, effective scores for the competition, but it is still not official and definitive in its rankings. In fact, the Kaggle team will take some time to check that everything is correct and that all contestants have respected the rules of the competition.

After a while (and sometimes after some changes in the rankings due to disqualifications), the private leaderboard will become official and definitive, the winners will be declared, and many participants will unveil their strategies, their solutions, and their code on the competition discussion forum. At this point, it is up to you to check the other solutions and try to improve your own. We strongly recommend that you do so, since this is another important source of learning in Kaggle.

Types of competitions and examples

Kaggle competitions are categorized based on competition categories, and each category has a different implication in terms of how to compete and what to expect. The type of data, difficulty of the problem, awarded prizes, and competition dynamics are quite diverse inside the categories, therefore it is important to understand beforehand what each implies.

Here are the official categories that you can use to filter out the different competitions:

  • Featured
  • Masters
  • Annuals
  • Research
  • Recruitment
  • Getting Started
  • Playground
  • Analytics
  • Community

Featured are the most common type of competitions, involving a business-related problem from a sponsor company and a prize for the top performers. The winners will grant a non-exclusive license of their work to the sponsor company; they will have to prepare a detailed report of their solution and sometimes even participate in meetings with the sponsor company.

There are examples of Featured competitions every time you visit Kaggle. At the moment, many of them are problems relating to the application of deep learning methods to unstructured data like text, images, videos, or sound. In the past, tabular data competitions were commonly seen, that is, competitions based on problems relating to structured data that can be found in a database. First by using random forests, then gradient boosting methods with clever feature engineering, tabular data solutions derived from Kaggle could really improve an existing solution. Nowadays, these competitions are run much less often, because a crowdsourced solution won’t often be much better than what a good team of data scientists or even AutoML software can do. Given the spread of better software and good practices, the increase in result quality obtainable from competitions is indeed marginal. In the unstructured data world, however, a good deep learning solution could still make a big difference. For instance, pre-trained networks such as BERT brought about double-digit increases in previous standards for many well-known NLP task benchmarks.

Masters are less common now, but they are private, invite-only competitions. The purpose was to create competitions only for experts (generally competitors ranked as Masters or Grandmasters, based on Kaggle medal rankings), based on their rankings on Kaggle.

Annuals are competitions that always appear during a certain period of the year. Among the Annuals, we have the Santa Claus competitions (usually based on an algorithmic optimization problem) and the March Machine Learning Mania competition, run every year since 2014 during the US College Basketball Tournaments.

Research competitions imply a research or science purpose instead of a business one, sometimes for serving the public good. That’s why these competitions do not always offer prizes. In addition, these competitions sometimes require the winning participants to release their solution as open-source.

Google has released a few Research competitions in the past, such as Google Landmark Recognition 2020 (https://www.kaggle.com/c/landmark-recognition-2020), where the goal was to label famous (and not-so-famous) landmarks in images.

Sponsors that want to test the ability of potential job candidates hold Recruitment competitions. These competitions are limited to teams of one and offer to best-placed competitors an interview with the sponsor as a prize. The competitors have to upload their CV at the end of the competition if they want to be considered for being contacted.

Examples of Recruitment competitions have been:

Getting Started competitions do not offer any prizes, but friendly and easy problems for beginners to get accustomed to Kaggle principles and dynamics. They are usually semi-permanent competitions whose leaderboards are refreshed from time to time. If you are looking for a tutorial in machine learning, these competitions are the right places to start, because you can find a highly collaborative environment and there are many Kaggle Notebooks available showing you how to process the data and create different types of machine learning models.

Famous ongoing Getting Started competitions are:

Playground competitions are a little bit more difficult than the Getting Started ones, but they are also meant for competitors to learn and test their abilities without the pressure of a fully-fledged Featured competition (though in Playground competitions sometimes the heat of the competition may also turn quite high). The usual prizes for such competitions are just swag (an acronym for “Stuff We All Get,” such as, for instance, a cup, a t-shirt, or socks branded by Kaggle; see https://www.kaggle.com/general/68961) or a bit of money.

One famous Playground competition is the original Dogs vs. Cats competition (https://www.kaggle.com/c/dogs-vs-cats), where the task is to create an algorithm to distinguish dogs from cats.

Mentions should be given to Analytics competitions, where the evaluation is qualitative and participants are required to provide ideas, drafts of solutions, PowerPoint slides, charts, and so on; and Community (previously known as InClass) competitions, which are held by academic institutions as well as Kagglers. You can read about the launch of the Community competitions at https://www.kaggle.com/product-feedback/294337 and you can get tips about running one of your own at https://www.kaggle.com/c/about/host and at https://www.kaggle.com/community-competitions-setup-guide.

Parul Pandey

https://www.kaggle.com/parulpandey

We spoke to Parul Pandey, Kaggle Notebooks Grandmaster, Datasets Master, and data scientist at H2O.ai, about her experience with Analytics competitions and more.

What’s your favorite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

I really enjoy the Data Analytics competitions, which require you to analyze the data and provide a comprehensive analysis report at the end. These include the Data Science for Good competitions (DS4G), sports analytics competitions (NFL etc.), and the general survey challenges. Unlike the traditional competitions, these competitions don’t have a leaderboard to track your performance compared to others; nor do you get any medals or points.

On the other hand, these competitions demand end-to-end solutions touching on multi-faceted aspects of data science like data cleaning, data mining, visualizations, and conveying insights. Such problems provide a way to mimic real-life scenarios and provide your insights and viewpoints. There may not be a single best answer to solve the problem, but it gives you a chance to deliberate and weigh up potential solutions, and imbibe them into your solution.

How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?

My first step is always to analyze the data as part of EDA (exploratory data analysis). It is something that I also follow as part of my work routine. Typically, I explore the data to look for potential red flags like inconsistencies in data, missing values, outliers, etc., which might pose problems later. The next step is to create a good and reliable cross-validation strategy. Then I read the discussion forums and look at some of the Notebooks shared by people. It generally acts as a good starting point, and then I can incorporate things in this workflow from my past experiences. It is also essential to track the model performance.

For an Analytics competition, however, I like to break down the problem into multiple steps. For instance, the first part could be related to understanding the problem, which may require a few days. After that, I like to explore the data, followed by creating a basic baseline solution. Then I continue enhancing this solution by adding a piece at a time. It might be akin to adding Lego bricks one part at a time to create that final masterpiece.

Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

As I mentioned, I mostly like to compete in Analytics competitions, even though occasionally I also try my hand in the regular ones too. I’d like to point out a very intriguing Data Science for Good competition titled Environmental Insights Explorer (https://www.kaggle.com/c/ds4g-environmental-insights-explorer). The task was to use remote sensing techniques to understand environmental emissions instead of calculating emissions factors from current methodologies.

What really struck me was the use case. Our planet is grappling with climate change issues, and this competition touched on this very aspect. While researching for my competition, I was amazed to find the amount of progress being made in this field of satellite imagery and it gave me a chance to understand and dive more deeply into the topic. It gave me a chance to understand how satellites like Landsat, Modis, and Sentinel worked, and how they make the satellite data available. This was a great competition to learn about a field I knew very little about before the competition.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

I will cite some of the mistakes that I made in my initial years on Kaggle.

Firstly, most of the newbies think of Kaggle as a competitions-only platform. If you love competitions, there are plenty here, but Kaggle also has something for people with other specialties. You can write code and share it with others, indulge in healthy discussions, and network. Curate and share good datasets with the community. I initially only used Kaggle for downloading datasets, and it was only a couple of years ago that I actually became active. Now when I look back, I couldn’t have been more wrong. A lot of people get intimidated by competitions. You can first get comfortable with the platform and then slowly start participating in the competitions.

Another important thing that I would like to mention is that many people work in isolation, lose motivation, and quit. Teaming up on Kaggle has many unseen advantages. It teaches you to work in a team, learn from the experiences, and work towards a common goal in a limited time frame.

Do you use other competition platforms? How do they compare to Kaggle?

While most of my current time is spent on Kaggle, in the past I have used Zindi, a data science competition platform focused on African use cases. It’s a great place to access datasets focused on Africa. Kaggle is a versatile platform, but there is a shortage of problem statements from different parts of the world. Of late, we have seen some diversified problems too, like the recently held chaii competition — an NLP competition focusing on Indian languages. I believe similar competitions concentrating on different countries will be helpful for the research and the general data science community as well.

Cross-sectional to this taxonomy of Kaggle competitions, you also have to consider that competitions may have different formats. The usual format is the so-called Simple format where you provide a solution and it is evaluated as we previously described. More sophisticated, the two-stage competition splits the contest into two parts, and the final dataset is released only after the first part has finished and only to the participants of the first part. The two-stage competition format has emerged in order to limit the chance of some competitors cheating and infringing the rules, since the evaluation is done on a completely untried test set that is available for a short time only. Contrary to the original Kaggle competition format, in this case, competitors have a much shorter amount of time and much fewer submissions to figure out any useful patterns from the test set.

For the same reason, the Code competitions have recently appeared, where all submissions are made from a Kaggle Notebook, and any direct upload of submissions is disabled.

For Kagglers at different stages of their competition careers, there are no restrictions at all in taking on any kind of competition. However, we have some suggestions against or in favor of the format or type of competition depending on your level of experience in data science and your computational resources:

  • For complete beginners, the Getting Started or the Playground competitions are good places to begin, since you can easily get more confident about how Kaggle works without facing high competitive pressure. That being said, many beginners have successfully started from Featured and Research competitions, because being under pressure helped them to learn faster. Our suggestion is therefore to decide based on your learning style: some Kagglers need to learn by exploring and collaborating (and the Getting Started or the Playground competitions are ideal for that), others need the heat of a fast-paced competition to find their motivation.
  • For Featured and Research competitions, also take into account that these competitions are often about fringe applications of AI and machine learning and, consequently, you often need a solid background or the willingness to study all the relevant research in the field of application of the competition.

Finally, keep in mind that most competitions require you to have access to computational resources that are often not available to most data scientists in the workplace. This can turn into growing expenses if you use a cloud platform outside the Kaggle one. Code competitions and competitions with time or resource limitations might then be the ideal place to spend your efforts, since they strive to put all the participants on the same resource level.

Submission and leaderboard dynamics

The way Kaggle works seems simple: the test set is hidden to participants; you fit your model; if your model is the best in predicting on the test set, then you score highly and you possibly win. Unfortunately, this description renders the inner workings of Kaggle competitions in an overly simplistic way. It doesn’t take into account that there are dynamics regarding the direct and indirect interactions of competitors, or the nuances of the problem you are facing and of its training and test set.

Explaining the Common Task Framework paradigm

A more comprehensive description of how Kaggle works is actually given by Professor David Donoho, professor of statistics at Stanford University (https://web.stanford.edu/dept/statistics/cgi-bin/donoho/), in his paper 50 Years of Data Science. It first appeared in the Journal of Computational and Graphical Statistics and was subsequently posted on the MIT Computer Science and Artificial Intelligence Laboratory (see http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf).

Professor Donoho does not refer to Kaggle specifically, but to all data science competition platforms. Quoting computational linguist Mark Liberman, he refers to data science competitions and platforms as being part of a Common Task Framework (CTF) paradigm that has been silently and steadily progressing data science in many fields during the last decades. He states that a CTF can work incredibly well at improving the solution of a problem in data science from an empirical point of view, quoting the Netflix competition and many DARPA competitions as successful examples. The CTF paradigm has contributed to reshaping the best-in-class solutions for problems in many fields.

A CTF is composed of ingredients and a secret sauce. The ingredients are simply:

  1. A publicly available dataset and a related prediction task
  2. A set of competitors who share the common task of producing the best prediction for the task
  3. A system for scoring the predictions by the participants in a fair and objective way, without providing hints about the solution that are too specific (or limiting them, at least)

The system works the best if the task is well defined and the data is of good quality. In the long run, the performance of solutions improves by small gains until it reaches an asymptote. The process can be sped up by allowing a certain amount of sharing among participants (as happens on Kaggle by means of discussions, and sharing Kaggle Notebooks and extra data provided by the datasets found in the Datasets section). According to the CTF paradigm, competitive pressure in a competition suffices to produce always-improving solutions. When the competitive pressure is paired with some degree of sharing among participants, the improvement happens at an even faster rate – hence why Kaggle introduced many incentives for sharing.

This is because the secret sauce in the CTF paradigm is the competition itself, which, within the framework of a practical problem whose empirical performance has to be improved, always leads to the emergence of new benchmarks, new data and modeling solutions, and in general to an improved application of machine learning to the problem posed by the competition. A competition can therefore provide a new way to solve a prediction problem, new ways of feature engineering, and new algorithmic or modeling solutions. For instance, deep learning did not simply emerge from academic research, but it first gained a great boost because of successful competitions that signaled its efficacy (we have already mentioned, for instance, the Merck competition, won by Geoffrey Hinton’s team: https://www.kaggle.com/c/MerckActivity/overview/winners).

Coupled with the open software movement, which allows everyone access to powerful analytical tools (such as Scikit-learn, TensorFlow, or PyTorch), the CTF paradigm brings about even better results because all competitors are on the same level at the start. On the other hand, the reliance of a solution to a competition on specialized or improved hardware can limit achievable results, because it can prevent competitors without access to such resources from properly participating and contributing directly to the solution, or indirectly by exercising competitive pressure on the other participants. Understandably, this is the reason why Kaggle started offering cloud services free to participants of its competitions, the Kaggle Notebooks we will introduce in the Computational resources section. It can flatten some differences in hardware-intense competitions (as most deep learning ones are) and increase the overall competitive pressure.

Understanding what can go wrong in a competition

Given our previous description of the CTF paradigm, you may be tempted to imagine that all a competition needs is to be set up on a proper platform, and good results such as positive involvement for participants and outstanding models for the sponsor company will automatically come in. However, there are also things that can go wrong and instead lead to a disappointing result in a competition, both for the participants and the institution running it:

  • Leakage from the data
  • Probing from the leaderboard (the scoring system)
  • Overfitting and consequent leaderboard shake-up
  • Private sharing

You have leakage from data when part of the solution can be retraced in the data itself. For instance, certain variables could be posterior to the target variable, so they reveal something about it. This happens in fraud detection when you use variables that are updated after a fraud happens, or in sales forecasting when you process information relating to the effective distribution of a product (more distribution implies more requests for the product, hence more sales).

Another issue could be that the training and test examples are ordered in a predictable way or that the values of the identifiers of the examples hint at the solution. Examples are, for instance, when the identifier is based on the ordering of the target, or the identifier value is correlated with the flow of time and time affects the probability of the target.

Such solution leakage, sometimes named golden features by competitors (because getting a hint of such nuances in the data can turn into gold prizes for the participants), invariably leads to a solution that is not reusable. This also implies a sub-optimal result for the sponsor, but they at least are able to learn something about leaking features that can affect solutions to their problem.

Another problem is the possibility of probing a solution from the leaderboard. In this situation, you can take advantage of the evaluation metrics shown to you and snoop the solution by repeated submission trials on the leaderboard. Again, in this case the solution is completely unusable in different circumstances. A clear example of this happened in the competition Don’t Overfit II. The winning participant, Zachary Mayers, submitted every individual variable as a single submission, gaining information about the possible weight of each variable that allowed him to estimate the correct coefficients for his model (you can read Zach’s detailed solution here: https://www.kaggle.com/c/dont-overfit-ii/discussion/91766). Generally, time series problems, or other problems where there are systematic shifts in the test data, may be seriously affected by probing, since they can help competitors to successfully define some kind of post-processing (like multiplying their predictions by a constant) that is most suitable for scoring highly on the specific test set.

Another form of leaderboard snooping (that is, getting a hint about the test set and overfitting to it) happens when participants rely more on the feedback from the public leaderboard than their own tests. Sometimes this turns into a complete failure of the competition, causing a wild shake-up – a complete and unpredictable reshuffling of the positions on the final leaderboard. The winning solutions, in such a case, may turn out to be not so optimal for the problem or even just dictated by chance. This has led to the diffusion of techniques analyzing the potential gap between the training set and the public test set. This kind of analysis, called adversarial testing, can provide insight about how much to rely on the leaderboard and whether there are features that are so different between the training and test set that it would be better to avoid them completely.

For an example, you can have a look at this Notebook by Bojan Tunguz: https://www.kaggle.com/tunguz/adversarial-ieee.

Another kind of defense against leaderboard overfitting is choosing safe strategies to avoid submitting solutions that are based too much on the leaderboard results. For instance, since (typically) two solutions are allowed to be chosen by each participant for final evaluation, a good strategy is to submit the best performing one based on the leaderboard, and the best performing one based on your own cross-validation tests.

In order to avoid problems with leaderboard probing and overfitting, Kaggle has recently introduced different innovations based on Code competitions, where the evaluation is split into two distinct stages, as we previously discussed, with participants being completely blind to the actual test data so they are forced to consider their own local validation tests more.

Finally, another possible distortion of a competition is due to private sharing (sharing ideas and solutions in a closed circle of participants) and other illicit moves such as playing through multiple accounts or playing in multiple teams and stealing ideas. All such actions create an asymmetry of information between participants that can be favorable to a few and detrimental to most. Again, the resulting solution may be affected because sharing has been imperfect during the competition and fewer teams have been able to exercise full competitive pressure. Moreover, if these situations become evident to participants (for instance, see https://www.kaggle.com/c/ashrae-energy-prediction/discussion/122503), it can lead to distrust and less involvement in the competition or subsequent competitions.

Computational resources

Some competitions pose limitations in order to render feasible solutions available to production. For instance, the Bosch Production Line Performance competition (https://www.kaggle.com/c/bosch-production-line-performance) had strict limits on execution time, model file output, and memory limit for solutions. Notebook-based (previously known as Kernel-Only) competitions, which require both training and inference to be executed on the Kaggle Notebooks, do not pose a problem for the resources you have to use. This is because Kaggle will provide you with all the resources you need (and this is also intended as a way to put all participants on the same start line for a better competition result).

Problems arise when you have competitions that only limit the use of Notebooks to inference time. In these cases, you can train your models on your own machine and the only limit is then at test time, on the number and complexity of models you produce. Since most competitions at the moment require deep learning solutions, you have to be aware that you will need specialized hardware, such as GPUs, in order to achieve a competitive result.

Even in some of the now-rare tabular competitions, you’ll soon realize that you need a strong machine with quite a number of processors and a lot of memory in order to easily apply feature engineering to data, run experiments, and build models quickly.

Standards change rapidly, so it is difficult to specify a standard hardware that you should have in order to compete at least in the same league as other teams. We can get hints about the current standard by looking at what other competitors are using, either as their own machine or a machine on the cloud.

For instance, HP launched a program where it awarded an HP Z4 or Z8 to a few selected Kaggle participants in exchange for brand visibility. For instance, a Z8 machine has up to 72 cores, 3 TB of memory, 48 TB of storage (a good share by solid storage hard drive standards), and usually dual NVIDIA RTX as the GPU. We understand that this may be a bit out of reach for many; even renting a similar machine for a short time on a cloud instance such as Google’s GCP or Amazon’s AWS is out of the discussion, given the expenses for even moderate usage.

The cloud costs for each competition naturally depend on the amount of data to process and on the number and type of models you build. Free credit giveaways in Kaggle competitions for both GCP and AWS cloud platforms usually range from US $200 to US $500.

Our suggestion, as you start your journey to climb to the top rankings of Kaggle participants, is therefore to go with the machines provided free by Kaggle, Kaggle Notebooks (previously known as Kaggle Kernels).

Kaggle Notebooks

Kaggle Notebooks are versioned computational environments, based on Docker containers running in cloud machines, that allow you to write and execute both scripts and notebooks in the R and Python languages. Kaggle Notebooks:

  • Are integrated into the Kaggle environment (you can make submissions from them and keep track of what submission refers to what Notebook)
  • Come with most data science packages pre-installed
  • Allow some customization (you can download files and install further packages)

The basic Kaggle Notebook is just CPU-based, but you can have versions boosted by an NVIDIA Tesla P100 or a TPU v3-8. TPUs are hardware accelerators specialized for deep learning tasks.

Though bound by a usage number and time quota limit, Kaggle Notebooks give you access to the computational workhorse to build your baseline solutions on Kaggle competitions:

Notebook type

CPU cores

Memory

Number of notebooks that can be run at a time

Weekly quota

CPU

4

16 GB

10

Unlimited

GPU

2

13 GB

2

30 hours

TPU

4

16 GB

2

30 hours

Besides the total runtime, CPU and GPU notebooks can run for a maximum of 12 hours per session before stopping (TPU notebooks for just 9 hours) meaning you won’t get any results from the run apart from what you have saved on disk. You have a 20 GB disk saving allowance to store your models and results, plus an additional scratchpad disk that can exceed 20 GB for temporary usage during script running.

In certain cases, the GPU-enhanced machine provided by Kaggle Notebooks may not be enough. For instance, the recent Deepfake Detection Challenge (https://www.kaggle.com/c/deepfake-detection-challenge) required the processing of data consisting of around 500 GB of videos. That is especially challenging because of the 30-hour time limit of weekly usage, and because of the fact that you cannot have more than two machines with GPUs running at the same time. Even if you can double your machine time by changing your code to leverage the usage of TPUs instead of GPUs (which you can find some guidance for easily achieving here: https://www.kaggle.com/docs/tpu), that may still not prove enough for fast experimentation in a data-heavy competition such as the Deepfake Detection Challenge.

For this reason, in Chapter 3, Working and Learning with Kaggle Notebooks, we are going to provide you with tips for successfully coping with these limitations to produce decent results without having to buy a heavy-performing machine. We are also going to show you how to integrate Kaggle Notebooks with GCP or, alternatively, in Chapter 2, Organizing Data with Datasets, how to move all your work into another cloud-based solution, Google Colab.

Teaming and networking

While computational power plays its part, only human expertise and ability can make the real difference in a Kaggle competition. For a competition to be handled successfully, it sometimes requires the collaborative efforts of a team of contestants. Apart from Recruitment competitions, where the sponsor may require individual participants for a better evaluation of their abilities, there is typically no restriction against forming teams. Usually, teams can be made up of a maximum of five contestants.

Teaming has its own advantages because it can multiply efforts to find a better solution. A team can spend more time on the problem together and different skills can be of great help; not all data scientists will have the same skills or the same level of skill when it comes to different models and data manipulation.

However, teaming is not all positive. Coordinating different individuals and efforts toward a common goal may prove not so easy, and some suboptimal situations may arise. A common problem is when some of the participants are not involved or are simply idle, but no doubt the worst is when someone infringes the rules of the competition – to the detriment of everyone, since the whole team could be disqualified – or even spies on the team in order to give an advantage to another team, as we mentioned earlier.

In spite of any negatives, teaming in a Kaggle competition is a great opportunity to get to know other data scientists better, to collaborate for a purpose, and to achieve more, since Kaggle rules do reward teams over lonely competitors. In fact, for smaller teams you get a percentage of the total that is higher than an equal share. Teaming up is not the only possibility for networking in Kaggle, though it is certainly more profitable and interesting for the participants. You can also network with others through discussions on the forums, or by sharing Datasets and Notebooks during competitions. All these opportunities on the platform can help you get to know other data scientists and be recognized in the community.

There are also many occasions to network with other Kagglers outside of the Kaggle platform itself. First of all, there are a few Slack channels that can be helpful. For instance, KaggleNoobs (https://www.kaggle.com/getting-started/20577) is a channel, opened up in 2016, that features many discussions about Kaggle competitions. They have a supportive community that can help you if you have some specific problem with code or models.

There are quite a few other channels devoted to exchanging opinions about Kaggle competitions and data science-related topics. Some channels are organized on a regional or national basis, for instance, the Japanese channel Kaggler-ja (http://kaggler-ja-wiki.herokuapp.com/) or the Russian community Open Data Science Network (https://ods.ai/), created in 2015, which later opened also to non-Russian speaking participants. The Open Data Science Network doesn’t offer simply a Slack channel but also courses on how to win competitions, events, and reporting on active competitions taking place on all known data science platforms (see https://ods.ai/competitions).

Aside from Slack channels, quite a few local meetups themed around Kaggle in general or around specific competitions have sprung up, some just on a temporary basis, others in a more established form. A meetup focused on Kaggle competitions, usually built around a presentation from a competitor who wants to share their experience or suggestions, is the best way to meet other Kagglers in person, to exchange opinions, and to build alliances for participating in data science contests together.

In this league, a mention should be given to Kaggle Days (https://kaggledays.com/), built by Maria Parysz and Paweł Jankiewicz. The Kaggle Days organization arranged a few events in major locations around the world (https://kaggledays.com/about-us/) with the aim of bringing together a conference of Kaggle experts. It also created a network of local meetups in different countries, which are still quite active (https://kaggledays.com/meetups/).

Paweł Jankiewicz

https://www.kaggle.com/paweljankiewicz

We had the opportunity to catch up with Paweł about his experiences with Kaggle. He is a Competitions Grandmaster and a co-founder of LogicAI.

What’s your favourite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?

Code competitions are my favourite type of competition because working in a limited environment forces you to think about different kinds of budgets: time, CPU, memory. Too many times in previous competitions I needed to utilize even up to 3-4 strong virtual machines. I didn’t like that in order to win I had to utilize such resources, because it makes it a very uneven competition.

How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?

I approach every competition a little bit differently. I tend to always build a framework for each competition that allows me to create as many experiments as possible. For example, in one competition where we needed to create a deep learning convolutional neural network, I created a way to configure neural networks by specifying them in the format C4-MP4-C3-MP3 (where each letter stands for a different layer). It was many years ago, so the configuration of neural networks is probably now done by selecting the backbone model. But the rule still applies. You should create a framework that allows you to change the most sensitive parts of the pipeline quickly.

Day-to-day work has some overlap with Kaggle competitions in terms of modeling approach and proper validation. What Kaggle competitions taught me is the importance of validation, data leakage prevention, etc. For example, if data leaks happen in so many competitions, when people who prepare them are the best in the field, you can ask yourself what percentage of production models have data leaks in training; personally, I think 80%+ of production models are probably not validated correctly, but don’t quote me on that.

Another important difference in day-to-day work is that no one really tells you how to define the modeling problem. For instance:

  1. Should the metric you report or optimize be RMSE, RMSLE, SMAPE, or MAPE?
  2. If the problem is time-based, how can you split the data to evaluate the model as realistically as possible?

And these are not the only important things for the business. You also must be able to communicate your choices and why you made them.

Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

The most challenging and interesting was the Mercari Price Prediction Code competition. It was very different from any other competition because it was limited to 1 hour of computation time and only 4 cores with 16 GB of memory. Overcoming these limitations was the most exciting part of the challenge. My takeaway from this competition was to believe more in networks for tabular data. Before merging with my teammate Konstantin Lopukhin (https://www.kaggle.com/lopuhin), I had a bunch of complicated models including neural networks, but also some other boosting algorithms. After merging, it turned out that Konstantin was using only one architecture which was very optimized (number of epochs, learning rate). Another aspect of this competition that was quite unique was that it wasn’t enough to just average solutions from the team. We had to reorganize our workflow so that we had a single coherent solution and not something quickly put together. It took us three weeks to combine our solutions together.

In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?

Software engineering skills are probably underestimated a lot. Every competition and problem is slightly different and needs some framework to streamline the solution (look at https://github.com/bestfitting/instance_level_recognition and how well their code is organized). Good code organization helps you to iterate faster and eventually try more things.

What’s the most important thing someone should keep in mind or do when they’re entering a competition?

The most important thing is to have fun.

Performance tiers and rankings

Apart from monetary prizes and other material items, such as cups, t-shirts, hoodies, and stickers, Kaggle offers many immaterial awards. Kagglers spend a whole lot of time and effort during competitions (not to mention in developing the skills they use to compete that are, in truth, quite rare in the general population). The monetary prizes usually cover the efforts of the top few Kagglers, if not only the one in the top spot, leaving the rest with an astonishing number of hours voluntarily spent with little return. In the long term, participating in competitions with no tangible results may lead to disaffection and disinterest, lowering the competitive intensity.

Hence, Kaggle has found a way to reward competitors with an honor system based on medals and points. The idea is that the more medals and the more points you have, the more relevant your skills are, leaving you open for opportunities in your job search or any other relevant activity based on your reputation.

First, there is a general leaderboard, that combines all the leaderboards of the individual competitions (https://www.kaggle.com/rankings). Based on the position they attain in each competition, Kagglers are awarded some number of points that, all summed together, provide their ranking on the general leaderboard. At first glance, the formula for the scoring of the points in a competition may look a bit complex:

Nevertheless, in reality it is simply based on a few ingredients:

  • Your rank in a competition
  • Your team size
  • The popularity of the competition
  • How old the competition is

Intuitively, ranking highly in popular competitions brings many points. Less intuitively, the size of your team matters in a non-linear way. That’s due to the inverse square root part of the formula, since the proportion of points you have to give up grows with the number of people involved.

It is still quite favorable if your team is relatively small (2, max 3 people) due to the advantage in wits and computational power brought about by collaboration.

Another point to keep in mind is that points decay with time. The decay is not linear, but as a rule of thumb keep in mind that, after a year, very little is left of the points you gained. Therefore, glory on the general leaderboard of Kaggle is ephemeral unless you keep on participating in competitions with similar results to before. As a consolation, on your profile you’ll always keep the highest rank you ever reach.

More longer-lasting is the medal system that covers all four aspects of competing in Kaggle. You will be awarded medals for Competitions, Notebooks, Discussion, and Datasets based on your results. In Competitions, medals are awarded based on your position on the leaderboard. In the other three areas, medals are awarded based on the upvotes of other competitors (which can actually lead to some sub-optimal situations, since upvotes are a less objective metric and also depend on popularity). The more medals you get, the higher the ranks of Kaggle mastery you can enter. The ranks are Novice, Contributor, Expert, Master, and Grandmaster. The page at https://www.kaggle.com/progression explains everything about how to get medals and how many and what kinds are needed to access the different ranks.

Keep in mind that these ranks and honors are always relative and that they do change in time. A few years ago, in fact, the scoring system and the ranks were quite different. Most probably in the future, the ranks will change again in order to keep the higher ones rarer and more valuable.

Criticism and opportunities

Kaggle has drawn quite a few criticisms since it began. Participation in data science competitions is still a subject of debate today, with many different opinions out there, both positive and negative.

On the side of negative criticism:

  • Kaggle provides a false perception of what machine learning really is since it is just focused on leaderboard dynamics
  • Kaggle is just a game of hyperparameter optimization and ensembling many models just for scraping a little more accuracy (while in reality overfitting the test set)
  • Kaggle is filled with inexperienced enthusiasts who are ready to try anything under the sun in order to get a score and a spotlight in hopes of being spotted by recruiters
  • As a further consequence, competition solutions are too complicated and often too specific to a test set to be implemented

Many perceive Kaggle, like many other data science competition platforms, to be far from what data science is in reality. The point the critics raise is that business problems do not come from nowhere and you seldom already have a well-prepared dataset to start with, since you usually build it along the way based on refining business specifications and the understanding of the problem at hand. Moreover, many critics emphasize that Kagglers don’t learn or excel at creating production-ready models, since a winning solution cannot be constrained by resource limits or considerations about technical debt (though this is not always true for all competitions).

All such criticism is related, in the end, to how Kaggle standings can be compared to other kinds of experience in the eyes of an employer, especially relative to data science education and work experience. One persistent myth is that Kaggle competitions won’t help to get you a job or a better job in data science, and that they do not put you on another plane compared to data scientists that do not participate at all.

Our stance on this is that it is a misleading belief that Kaggle rankings do not have an automatic value beyond the Kaggle community. For instance, in a job search, Kaggle can provide you with some very useful competencies in modeling data and problems and effective model testing. It can also expose you to many techniques and different data/business problems, beyond your actual experience and comfort zone, but it cannot supplement you with everything you need to successfully place yourself as a data scientist in a company.

You can use Kaggle for learning (there is also a section on the website, Courses, devoted to just learning) and for differentiating yourself from other candidates in a job search; however, how this will be considered varies considerably from company to company. Regardless, what you learn on Kaggle will invariably prove useful throughout your career and will provide you a hedge when you have to solve complex and unusual problems with data modeling; by participating in Kaggle competitions, you build up strong competencies in modeling and validating. You also network with other data scientists, which can get you a reference for a job more easily and provide you with another way to handle difficult problems beyond your skills, because you will have access to other people’s competencies and opinions.

Hence, our opinion is that Kaggle functions in a more indirect way to help you in your career as a data scientist, in a variety of different ways. Of course, sometimes Kaggle will help you to be contacted directly as a job candidate based on your successes, but more often Kaggle will provide you with the intellectual skills and experience you need to succeed, first as a candidate and then as a practitioner.

In fact, after playing with data and models on Kaggle for a while, you’ll have had the chance to see enough different datasets, problems, and ways to deal with them under time pressure that when faced with similar problems in real settings you’ll be skilled in finding solutions quickly and effectively.

This latter opportunity for a skill upgrade is why we were motivated to write this book in the first place, and what this book is actually about. You won’t find a guide purely on how to win or score highly in Kaggle competitions, but you absolutely will find a guide about how to compete better on Kaggle and how to get the most back from your competition experiences.

Use Kaggle and other competition platforms in a smart way. Kaggle is not a passepartout – being first in a competition won’t assure you a highly paid job or glory beyond the Kaggle community. However, consistently participating in competitions is a card to be played smartly to show interest and passion in your data science job search, and to improve some specific skills that can differentiate you as a data scientist and not make you obsolete in front of AutoML solutions.

If you follow us through this book, we will show you how.

You have been reading a chapter from
The Kaggle Book
Published in: Apr 2022
Publisher: Packt
ISBN-13: 9781801817479
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image