A contestant’s view on the WARN-D machine learning competition: Pros and worrisome cons
Published:
On January 7 2026, the WARN-D machine learning competition went live on Codabench (link). Researchers from around the world were invited to build prediction models to forecast depression onset in young adults. The data came from the WARN-D trial and included baseline questionnaires, ambulatory assessments, smartwatch measurements, and follow-up questionnaires. I was immediately captivated by the competition and eager to put my machine learning skills to work, so I signed up right away. Now that the first phase of the competition has concluded, and before the results are made public, I want to reflect on my experience as a contestant. I also want to reflect on a broader question: Is the competition format really a good fit for building prediction models in mental health?
First and foremost, credit goes to the organizers for taking this unique initiative. It brought together a large number of researchers and made me feel part of a larger community. It was genuinely fun to feel that everyone was “in it together,” each trying to make the best sense of the data and, naturally, hoping to claim the top spot. We know that researchers often use very different analytic strategies and that these differences can lead to different conclusions (Breznau et al., 2022). A competition can be valuable precisely because it makes that variability visible and offers a way to compare which methods perform best for predicting depression onset. Because contestants only had access to the training data, it also created the possibility of evaluating models on unseen test data. Still, the competition also had a few growing pains.
On the practical side, three issues stood out to me. First, the submission system did not work at the beginning, so contestants could not see how well their models performed. Once that issue was resolved, the evaluation turned out to be faulty, which meant that some prediction models received incorrect scores. Later, when that was corrected, the organizers decided to hide the scores altogether. I think that was the right decision, although introducing it later in the competition may have disadvantaged people who joined afterward. Second, the organizers provided a benchmark model as an example for contestants. However, this model did not generate the same kind of predictions that contestants’ models were supposed to generate, so it also had to be corrected. Third, the data itself posed challenges. Initially, missing outcomes were filled in with 0. Any model trained on those labels would therefore be biased toward predicting no depression onset when a participant simply missed a follow-up assessment. In addition, there was extensive missingness in the predictors because of questionnaire skip logic. The organizers addressed some of these problems by releasing a new version of the data later on, although some missingness due to skip logic still remained. I understand that first-time competitions come with unexpected issues, but the number of mid-course changes was frustrating. I had to rebuild parts of my pipeline several times, and at one point I even considered stopping. I wonder whether some contestants actually stopped and what the impact of this would be on the final results.
Beyond these practical issues, I think there are broader questions about whether the competition format is well suited to a dataset like WARN-D, which is still relatively small. The training data consist of 1,088 participants, 8,508 observations, and 6,229 non-missing labels. The test data contain 466 participants and 3,662 observations, many of which will likely have missing outcomes. This is a very different setting from fields such as computer vision, where competitions have often thrived on much larger datasets. I think this matters for at least two reasons. First, small datasets can make results more sensitive to algorithmic variability (Flint et al., 2021; Schader et al., 2024). Many machine learning methods include randomness in the training process. Neural networks, for example, begin with randomly initialized weights. In smaller datasets, these random differences can have a substantial effect on performance, so some models may appear better simply because they got lucky. In that case, the winning model may reflect a fortunate initialization rather than a genuinely superior pipeline. I therefore hope the organizers will examine how uncertain the top-performing models really are. Second, small datasets can make results more sensitive to between-person variability. For a competition to identify the “best” model in a meaningful way, the test data need to be reasonably representative of future data. That can happen when the test set is large enough to capture the relevant variability, or when there is relatively little variability in the first place. Mental health data, however, often contain substantial heterogeneity across individuals, especially in depression, as the organizers have stated before (Fried & Nesse, 2015). In that context, a model may perform best on a small test set not because it generalizes well, but because it happens to fit that particular group especially well. Even without direct leakage from the test set into training, selecting one winner from a large field of competing approaches can still amount to an implicit form of overfitting. I therefore hope the organizers will also evaluate the robustness of the winning models.
Overall, I remain grateful to the organizers for this creative initiative and all the effort they invested in the competition. Though there were practical issues, there is a great deal that future competitions can learn from this experience. At the same time, I think competitions in mental health raise important questions when the available datasets are relatively small. I hope the organizers will be able to address these concerns in their evaluation of the winning models, and that future organizers will continue to develop ways of dealing with them. In the meantime, good luck to all contestants when the results are finally made public!
Nicolas Leenaerts
BAEF Post-doctoral Fellow
Harvard University
References:
Breznau, N., Rinke, E. M., Wuttke, A., Nguyen, H. H. V., Adem, M., Adriaans, J., Alvarez-Benjumea, A., Andersen, H. K., Auer, D., Azevedo, F., Bahnsen, O., Balzer, D., Bauer, G., Bauer, P. C., Baumann, M., Baute, S., Benoit, V., Bernauer, J., Berning, C., … Żółtak, T. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences of the United States of America, 119(44), e2203150119. https://doi.org/10.1073/pnas.2203150119
Flint, C., Cearns, M., Opel, N., Redlich, R., Mehler, D. M. A., Emden, D., Winter, N. R., Leenings, R., Eickhoff, S. B., Kircher, T., Krug, A., Nenadic, I., Arolt, V., Clark, S., Baune, B. T., Jiang, X., Dannlowski, U., & Hahn, T. (2021). Systematic misestimation of machine learning performance in neuroimaging studies of depression. Neuropsychopharmacology, 46(8), 1510–1517. https://doi.org/10.1038/s41386-021-01020-7
Fried, E. I., & Nesse, R. M. (2015). Depression is not a consistent syndrome: An investigation of unique symptom patterns in the STAR*D study. Journal of Affective Disorders, 172, 96–102. https://doi.org/10.1016/j.jad.2014.10.010
Schader, L. M., Song, W., Kempker, R., & Benkeser, D. (2024). Don’t let your analysis go to seed: On the impact of random seed on machine learning-based causal inference. Epidemiology (Cambridge, Mass.), 35(6), 764–778. https://doi.org/10.1097/EDE.0000000000001782

Leave a Comment