Data wizard is a tool that helps to prepare the data files for the challenge. It automatically splits a given dataset into the following parts:
Training part will be disclosed to participants and used by them to build their models. It contains the header of the original file and a randomly chosen subset of samples, of a specified size. It contains all attributes of the data, including the decision attribute (label).
Currently, the last column is always used as the decision attribute.
Test public part
This dataset is used as the input for participants' algorithms in order to create the solution to the task. It contains all attributes except the labels, which are to be predicted. Like the training part, it also contains the file header.
Preliminary and final labels
These are the labels that were excluded from the public test part. Target "ground-truth" values the participants' solutions are compared to and will not be revealed. The preliminary labels are used during the preliminary tests, while the final are used to calculate the final scores. The size of each part can be chosen by you.
The division into preliminary and final parts is random and participants will not know where each sample belongs to. If a sample is not selected for a given evaluation file, an empty line is inserted instead, so by inspecting the files you can find out which test records go where.
Output files' format
The format of the training and test public parts is the same as the original file's format. The files containing the labels are compatible with the standard evaluation procedures.