Evaluation protocol needs improvement

The competition evaluates 30 possible solutions xi for a single given y, which can be chosen arbitrarily from a large set.

Both factors contribute to having a score that varies a lot, and can reward models that don’t generalize beyond the chosen sample.
Ideally the submission should be scored against a fixed large set of given y provided by the organizers.

I think that having as a target maximizing the coverage is not that useful if the number of samples that can be taken is unlimited. Maybe the toy dataset and evaluation protocol have abstracted away elements necessary for the organizers target use case.

1 Like

Thank you for your observations; they will be taken into account during the decision-making process.

Team Xeek

If by decision-making process you mean final evaluation, then how can the current leaderboard be reliable?

3 Likes

@team can you make final clarifications on this?

Regarding the leaderboard score (degree of score), is there a perfect score that could be attain or a maximun score? Does a score of 1 represent that?

Asking to be able to create a local validation for final submission since the leaderboard validation is not reliable.