I have been trying to sharpen my skills with using ML models to make predictions and I will share what I has been working well for me.
Recently, I have primarily been working to predict the outcome for a continuous data but I am often using some categorical features in the model.
In Spring of 2023, I competed against 247 graduate students at USC in the final core course for the Applied Data Science MS program and I ended up placing 18th. Given than I am working full time and competing against many students who are not, I think this was a respectable result:
Amongst the top performers who beat me, I think the top differentiator was what features they were able to extract and what creative methods they used to augment the features. For business solutions, this shows the importance of knowing the data well and understanding how the relevant business processes function.
In these kind of competitions boosting models tend to perform best, especially XGBoost, LightGBM, and CatBoost.
I usually find that CatBoost scores the highest when there are any categorical features and LightGBM wins when all the features are continuous. CatBoost seems to be less popular for some reason but it always performs well for me.
After I have setup all the data processing and created all the potentially useful features that I can think of, I run it with one or more of these models.
Next I use Recursive Feature Elimination to try to remove features that do not contribute to the model. I think a lot of people forget to use RFE and it hurts their performance when they try to test features for elimination manually. This is where there can be a lot of trial and error and it is best to use pre-processing as much as possible so that you do not lose too much time. Pickle the model if it is taking too long to train, do not waste any time by being lazy with the code. This process may also cause you to think of new features to add to the model so it can be a long cycle.
After I am more confident in the selected features, I start adjusting the weights and continually compare the results using different models.
Be very careful of overfitting, even adjusting weights in the model can cause it to be overfit if you go too far.