Learnings from top solutions
In this section we gather aspects from the top solutions that could allow us to rise above the level of the baseline solution. Keep in mind that the leaderboard (both public and private) in this competition were quite tight; this was a combination of a few factors:
- The noisy data - it was easy to get to .89 accuracy by correctly identifying large part of the train data, and then each new correct one allowed for a tiny move upward
- The metric - accuracy can be tricky to ensemble
- Limited size of the data
Pretraining
First and most obvious remedy to the issue of limited data size was pretraining: using more data. The Cassava competition was held a year before as well:
https://www.kaggle.com/competitions/cassava-disease/overview
With minimal adjustments, the data from the 2019 edition could be leveraged in the context of the current one. Several competitors addressed the topic:
- Combined 2019 + 2020 dataset in TFRecords format was released in the forum: https...