It all started with Competitions more than 12 years ago. The first competition had just a few participants. With the growing interest in machine learning and the increased community around Kaggle, the complexity of the competitions, the number of participants, and the interest around competitions increased significantly.
To start a competition, the competition host prepares a dataset, typically split between train and test. In the most common form, the train set has labeled data available, while the test set only contains the feature data. The host also adds information about the data and a presentation of the competition objective. This includes a description of the problem to set the background for the competitors. The host also adds information about the metrics used to evaluate the solutions to the competition. The terms and conditions of the competitions are also specified.
Competitors are allowed to submit a limited number of solutions per day and, at the end, the best two solutions (evaluated based on a portion of the test set used to calculate the public score) will be selected. Competitors also have the option to select two solutions themselves based on their own judgment. Then, these two selected solutions will be evaluated on the reserved subset of test data to generate the private score. This will be the final score used to rank the competitors.
There are several types of competitions:
- Featured competitions: The most important are the featured competitions. Currently, featured competitions might reunite several thousand teams, with tens or even hundreds of thousands of solutions submitted. Featured competitions are typically hosted by companies but also sometimes by research organizations or universities, and are usually aimed at solving a difficult problem related to a company or a research topic. The organizer turns to the large Kaggle community to bring their knowledge and skills, and the competitive aspect of the setup accelerates the development of a solution. Usually, a featured competition will also have a significant prize, which will be distributed according to the competition rules to the top competitors. Sometimes, the host will not include a prize but will offer a different incentive, such as recruiting the top competitors to work for them (with high-profile companies, this might be more interesting than a prize), vouchers for using cloud resources, or acceptance of the top solutions to be presented at high-profile conferences. Besides the Featured competitions, there are also Getting Started, Research, Community, Playground, Simulations, and Analytics competitions.
- Getting Started competitions: These are aimed at mostly beginners and tackle easily approachable machine learning problems to help build basic skills. These competitions are restarted periodically and the leaderboard is reset. The most notable ones are Titanic – Machine Learning for Disaster, Digit Recognizer, House Prices – Advanced Regression Techniques, and Natural Language Processing with Disaster Tweets.
- Research competitions: In Research competitions, the themes are related to finding the solution to a difficult scientific problem in various domains such as medicine, genetics, cell biology, and astronomy by applying a machine learning approach. Some of the most popular competitions in recent years were from this category and with the rising use of machine learning in many fields of fundamental and applied research, we can expect that this type of competition will be more and more frequent and popular.
- Community competitions: These are created by Kagglers and are either open to the public or private competitions, where only those invited can take part. For example, you can host a Community competition as a school or university project, where students are invited to join and compete to get the best grades.
Kaggle offers the infrastructure, which makes it very simple for you to define and start a new competition. You have to provide the training and test data, but this can be as simple as two files in CSV format. Additionally, you need to add a submission sample file, which gives the expected format for submissions. Participants in the competition have to replace the prediction in this file with their own prediction, save the file, and then submit it. Then, you have to choose a metric to assess the performance of a machine learning model (no need to define one, as you have a large collection of predefined metrics). At the same time, as the host, you will be required to upload a file with the correct, expected solution to the competition challenge, which will serve as reference against which all competitors’ submissions will be checked. Once this is done, you just need to edit the terms and conditions, choose a start and end date for the competition, write the data description and objectives, and you are good to go. Other options that you can choose from are whether participants can team up or not, and whether joining the competition is open to everybody or just to people who receive the competition link.
- Playground competitions: Around three years ago, a new section of competitions was launched: Playground competitions. These are generally simple competitions, like the Getting Started ones, but will have a shorter lifespan (it was initially one month, but currently it is from one to four weeks). These competitions will be of low or medium difficulty and will help participants gain new skills. Such competitions are highly recommended to beginners but also to competitors with more experience who want to refine their skills in a certain domain.
- Simulation competitions: If the previous types are all supervised machine learning competitions, Simulations competitions are, in general, optimization competitions. The most well known are those around Christmas and New Year (Santa competitions) and also the Lux AI Challenge, which is currently in the third season. Some of the Simulation competitions are also recurrent and will qualify for an additional category: Annual competitions. Examples of such competitions that are of both the Simulations type and Annual are the Santa competitions.
- Analytics competitions: These are different in both the objective and the modality of scoring the solutions. The objective is to perform a detailed analysis of the competition dataset to get insights from the data. The score is based, in general, on the judgment of the organizers and, in some cases, on the popularity of the solutions that compete; in this case, the organizers will grant parts of the prizes to the most popular notebooks, based on the upvotes of Kagglers. In Chapter 5, we will analyze the data from one of the first Analytics competitions and also provide some insights into how to approach this type of competition.
For a long time, competitions required participants to prepare a submission file with the predictions for the test set. No other constraints were imposed on the method to prepare the submissions; the competitors were supposed to use their own computing resources to train models, validate them, and prepare the submission. Initially, there were no available resources on the platform to prepare a submission. After Kaggle started to provide computational resources, where you could prepare your model using Kaggle Kernels (later named Notebooks and now Code), you could submit directly from the platform, but there was no limitation imposed on this. Typically, the submission file will be evaluated on the fly and the result will be displayed almost instantly. The result (i.e., the score according to the competition metric) will be calculated only for a percentage of the test set. This percentage is announced at the start of the competition and is fixed. Also, the subset of test data used during the competition to calculate the displayed score (the public score) is fixed. After the end of the competition, the final score is calculated with the rest of the test data, and this final score (also known as the private score) is the final score for each competitor. The percentage of the test data used during the competition to evaluate the solution and provide the public score could be anything from a few percent to more than 50%. In most competitions, it tends to be less than 50%.
The reason Kaggle uses this approach is to prevent one unwanted phenomenon. Rather than improving their models for enhanced generalization, competitors might be inclined to optimize their solution to predict the test set as perfectly as possible, without considering the cross-validation score on their train data. In other words, the competitors might be inclined to overfit their solution on the test set. By splitting this data and only providing the score for a part of the test set – the public score – the organizers intend to prevent this.
With more and more complex competitions (sometimes with very large train and test sets), some participants with greater computational resources might gain an advantage, while others with limited resources may struggle to develop advanced models. Especially in featured competitions, the goal is often to create robust, production-compatible solutions. However, without setting restrictions on how solutions are obtained, achieving this goal may be difficult, especially if solutions with unrealistic resource use become prevalent. To limit the negative unwanted consequences of the “arms race” for better and better solutions, a few years ago, Kaggle introduced Code competitions. This kind of competition requires that all solutions be submitted from a running notebook on the Kaggle platform. In this way, the infrastructure to run the solution became fully controllable by Kaggle.
Also, not only are the computing resources limited in such competitions but there are also additional constraints: the duration of the run and internet access (to prevent the use of additional computing power through the use of external APIs or other remote computing resources).
Kagglers discovered quite fast that this was a limitation just for the inference part of the solution and an adaptation appeared: competitors started to train offline, large models that would not fit within the limits of computing power and time of run imposed by the Code competitions. Then, they uploaded the offline trained models (sometimes using very large computational resources) as datasets and loaded these models in the inference code that observed the limits for memory and computation time for the Code competitions.
In some cases, multiple models trained offline were loaded as datasets and inference combined these multiple models to create more precise solutions. Over time, Code competitions have become more refined. Some of them will only expose a few rows from the test set and not reveal the size of the real test set used for the public or future private test set. Therefore, Kagglers have to resort to clever probing techniques to estimate the limitations that might be incurred while running the final, private test set, to avoid a case where their code will fail due to surpassing memory or runtime limits.
Currently, there are also Code competitions that, after the active part of the competition (i.e., when competitors are allowed to continue to refine their solutions) ends, will not publish the private score, but will rerun the code with several new sets of test data, and reevaluate the setwo selected solutions against these new datasets, which have never been seen before. Some of these competitions are about the stock market, cryptocurrency valuation, or credit performance predictions and they use real data. The evolution of Code competitions ran in parallel with the evolution of available computational resources on the platform, to provide users with the required computational power.
Some of the competitions (most notably the Featured competitions and the Research competitions) grant ranking points and medals to the participants. Ranking points are used to calculate the relative position of Kagglers in the general leaderboard of the platform. The formula to calculate the ranking points awarded for a competition hasn’t changed since May 2015:
Figure 1.2: Formula for calculating ranking points
The number of points decreases with the square root of the number of teammates in the current competition team. More points are awarded for competitions with a larger number of teams. The number of points will also decrease over time, to keep the ranking up to date and competitive.
Medals are counted to get a promotion in the Kaggle progression system for competitions. Medals for competitions are obtained based on the position at the top of the competition leaderboard. The actual system is a bit more complicated but, generally, the top 10% will get a bronze medal, the top 5% will get a silver medal, and the top 1% will get a gold medal. The actual number of medals granted will be larger with an increased number of participants, but this is the basic principle.
With two bronze medals, you reach the Competition Expert tier. With two silver medals and one gold medal, you reach the Competition Master tier. And with one Solo gold medal (i.e., you obtained this medal without teaming up with others) and a total of five gold medals, you reach the most valuable Kaggle tier: the Competition Grandmaster. Currently, at the time of preparing this book, among the over 12 million users on Kaggle, there are 280 Kaggle Competition Grandmasters and 1,936 Masters.
The ranking system adds points depending on the position of users in the leaderboard, which grants ranking points. The points are not permanent, and, as we can see from Figure 1.2, there is a quite complex formula for points decreasing. If you do not continue to compete and get new points, your points will decrease quite fast and the only thing that will remind you of your past glory is the maximum rank you reached in the past. However, once you achieve a medal, you will always have that medal in your profile, even if your ranking position changes or your points decrease over time.