Sure! Roughly, we selected users with at least five different names and withheld the last two entered names for evaluation (have a look at the description of the offline challenge for a detailed description of the selection process).
If you want to obtain comparable training and test scenarios from the public data yourself, you can use the Perl script which was used to split the test and training data on the download page.
And if you are still interested, have a look at the following example for a user with id 23 which illustrates some of the gritty details. Firstly, consider a fictive full user profile within nameling’s query logs:
userId activity name POSIX_time 23 ADD_FAVORITE max 1361099013 23 ENTER_SEARCH carsten 1361099014 23 ENTER_SEARCH jan 1361099015 23 ENTER_SEARCH carsten 1361099016 23 ENTER_SEARCH stephan 1361099017 23 ENTER_SEARCH andreas 1361099018 23 ENTER_SEARCH alromano 1361099019 23 LINK_SEARCH carsten 1361099020 23 ENTER_SEARCH andreas 1361099021 23 ENTER_SEARCH robert 1361099022 23 ENTER_SEARCH max 1361099023 23 LINK_SEARCH oscar 1361099024 23 NAME_DETAILS oscar 1361099025
According to the selection of test names from the user’s full profile, the following part is contained in the training data set:
userId activity name POSIX_time 23 ADD_FAVORITE max 1361099013 23 ENTER_SEARCH carsten 1361099014 23 ENTER_SEARCH jan 1361099015 23 ENTER_SEARCH carsten 1361099016 23 ENTER_SEARCH stephan 1361099017
The test data set contains:
userId name_1 name_2 23 andreas robert
Note, that alromano is not contained in nameling’s list of known names. For the evaluation andreas and robert are selected while all other activities (after 23 ENTER_SEARCH andreas 1361099021
) are discarded.