The competition evaluates 30 possible solutions xi for a single given y, which can be chosen arbitrarily from a large set.
Both factors contribute to having a score that varies a lot, and can reward models that don’t generalize beyond the chosen sample.
Ideally the submission should be scored against a fixed large set of given y provided by the organizers.
I think that having as a target maximizing the coverage is not that useful if the number of samples that can be taken is unlimited. Maybe the toy dataset and evaluation protocol have abstracted away elements necessary for the organizers target use case.