If the points that people have over hundred submissions, it can be a clear sign of inconsistency of local validation and leaderboard scores.
On the other hand, if there are people with few submissions in the top, that usually means there should be a non-trivial approach to this competition or it's discovered only by few people.
If leaderboard mostly consists of teams with only one participant, you'll probably have enough chances if you gather a good team.
Research papers
Read scientific articles on the topic of the competition.
This can get you ideas about ML-related things (for example, how to optimize AUC).
Way to get familiar with problem domain (especially useful for feature generation).
Binning numerical features and correlation matrices.
Build a simple (or even primitive) baseline:
Often you can find baseline solutions provided by organizers or in kernels.
Start rather with RF than with GBMs.
At least Random Forest works quite fast and requires almost no tuning of hyperparameters.
Decide on the correct cross-validation scheme:
People have won just by selecting the right way to validate.
Is time important? Time-based validation.
New entities in test? Stratified validation.
Otherwise random K-fold strategy.
Check if validation is stable (i.e. correlates with public LB score).
Debug the full pipeline:
From loading data to writing a submission file.
After trying the problem individually, explore the public kernels and forums:
Other participants have different approaches resulting in diversity.
Proceed from simple to complex:
Add features in bulks (create many features at once).
Perform hyperparameter tuning:
When tuning parameters, first try to make the model to overfit.
Perform ensembling:
Proceed with ensembling only after feature engineering is done.
Select the best on LB and the best submission locally (or the most diverse one).
Working with ideas
Organize ideas in some structure:
What things could work here? What approaches you may want to take?
After you're done, read forums and highlight interesting posts and topics.
Sort ideas into priority order. Most important and promising needs to be implemented first.
Or you may want to organize these ideas into topics.
Ideas about feature generation, validation, metric optimization.
Now pick up an idea and implement it.
Try to understand the reasons why something works or not.
Is there some hidden data structure we didn't notice before? The ability to analyze the work and derive conclusions will get you on the right track to reveal hidden data patterns and leaks.
Data loading
Pay attention to optimal usage of computational resources to save a lot of time later.
Running an experiment often requires a lot of kernel restarts which leads to reloading data:
Do basic preprocessing and convert csv files into hdf5 (pandas) or npy (numpy) for faster loading.
Do not forget that by default data is stored in 64-bit format:
Most of times you can safely downcast it to 32 bits. This will result in a 2-fold memory saving.
Large datasets can be processed in chunks with pandas.
Handling big categories:
Split the dataset by a category and unload to separate files.
Allows performing feature engineering on each category separately.
Feature engineering
The type of problem defines the type of feature engineering:
Even for medium-sized datasets like 100,000 rows you can validate your models with a simple holdout strategy.
Switch to CV only when it is really required:
For example, when you've already hit some limits and can move forward only with some marginal improvements.
Faster evaluation:
Start with fastest models like LightGBM.
Use early stopping to reduce the run time.
Switch to tuning the models and ensembling only when you are satisfied with feature engineering.
Rent a larger server if you're uncomfortable with your computational resources.
Ensembling
Save predictions on internal validation and test sets.
From all the models trained before, make sure you save their predictions.
Sometimes team collaboration is just sending csv files.
Different ways to combine from averaging to stacking:
Small data requires simpler ensembling techniques.
Average a few low-correlated predictions with good scores.
The stacking process requires its own feature engineering.
Most of the times ensembling leads only to a marginal score improvement.
Code organization
Set up a separate environment for each competition.
If your code is hard to read, you will definitely have problems sooner or later:
Keep important code clean.
Use good variable and function names.
Keep your research reproducible:
Fix all random seeds.
Write down exactly how any of the features were generated.
Store the code under a version control system like git.
You may also create a notebook for each submission so they can be compared later.
Reuse code as much as possible:
Especially important to use the same code for train and test stages. For example, features should be prepared and transformed by the same code in order to guarantee that they're produced in a consistent manner.
Move reusable code into separate functions or even separate modules.
We are provided with training and test CSV files (train.csv and test.csv). Split the training set to training and holdout sets, and save those to disk as CSV files with the same structure as input files (train.csv and valid.csv). Then it only takes you to switch the paths to either experiment or produce a submission.
Use a custom library with frequent operations already implemented.