Why use MAP for evaluation during offline challenge?
In previous publications we have shown that recommending given names is a difficult task. For many users, many recommenders did not produce recommendations containing the test names in top positions. Thus, measures like precision@k make it hard to distinguish between results, especially for lower cut-off-thresholds k. MAP (Mean Average Precision) is a measure that is suitable for (arbitrarily long) ordered lists of recommendations. Like NDCG (Normalized Discounted Cumulative Gain) or AUC (Area Under the Curve) it evaluates the recommendations based on the positions of the left out items within the list of recommendations. It yields good scores when test items appear on the top positions of the recommendations and lower scores if they are ranked further below (unlike precision@k where lower ranked items are cut off and thus do not contribute to the score).
In the offline challenge, for each test user two names have been left out and thus have to be predicted by the recommender systems. The scores depend on the two ranks of those items. The main difference between the measures is how they incorporate these ranks into the score. While AUC yields a linear combination of the two ranks, MAP takes a linear combination of the reciprocal ranks and NDCG a linear combinations of the (logarithmically) smoothed reciprocal ranks. Among these measures, MAP is the one that discriminates the strongest between higher or lower ranks and therefore was chosen for the challenge.
Which data will be used for the leaderboard?
The leaderboard gives you an overview about the latest results from all participating teams. Once every week we check for new submissions and calculate the new board. However, just the latest submission will be used.
This is especially important to avoid cheating. Otherwise, one team could calculate a lot of results (also by guessing) and submit them in order to be the best team with a brute-force-like approach.
Why use k = 1,000 for MAP@k evaluation?
Although we have argued against cut-off-measures like precision@k (often in recommender literature precision@5 or precsion@10) it is reasonable to cut off lists at some point. Compared to many other areas where recommendations are used (e.g., people, movie, book or website recommenders) the time needed to evaluate a recommendation is very short (if you like a name, just click it) and the cost in terms of money or time spent for following a recommendation that turns out bad, is very low. At the same time, finding the perfect name for a child is often a process of months rather than minutes (like for finding the next book to read or deciding which movie to watch) or seconds (deciding which tag to use or which website to visit on the net). Thus it is reasonable to assume that parents-to-be are willing to scroll through lists of names longer than the usual top 10. Additionally, consider that one of the traditional ways of searching for names is to go through first names dictionaries where names are listed unpersonalized, in alphabetical order. In such books usually there are a lot more than 1,000 names that have to be read and therefore it seems reasonable that readers of such books won’t mind studying longer name lists on the web.
When I submit an empty list, the evaluation script returns 0.0014975…
Shouldn’t it return 0?
The script doesn’t return 0 because we adopted the MAP score for addressing the restriction of the evaluation to the top 1000 names.
Firstly, consider a recommender A which recommends a relevant name at position 10 and a second relevant name at position 1000. It’s MAP score becomes (1/10 + 2/1000)/2 = 0.051.
Now consider another recommender B which only recommends one relevant name at position 10. Without adoption, it’s MAP score would be (1/10)/1=0.1.
That is, recommender B would achieve a MAP score which nearly doubles recommender A’s result, although A recommends two and B only one relevant name.
We tackled this case by virtually placing the names which a recommender fails to recommend at positions 1000+i. In case of the empty list, none of the relevant names was recommended, yielding an MAP score of (1/1001+2/1002)/2=0.001497504
Is there a problem to submit the result file with strings in lower case?
The case of a name is ignored during evaluation, using the following UTF-8 aware lower-case conversion in Perl:
$enc = "utf-8"; . . . $inputline = decode($enc, $inputline); $inputline = encode($enc, lc($inputline));
To answer the question: It is perfectly fine to submit the result file with strings in lower case!
Can you give an example on how the test data is derived from the full usage log?
Sure! Roughly, we selected users with at least five different names and withheld the last two entered names for evaluation (have a look at the description of the offline challenge for a detailed description of the selection process).
If you want to obtain comparable training and test scenarios from the public data yourself, you can use the Perl script which was used to split the test and training data on the download page.
And if you are still interested, have a look at the following example for a user with id 23 which illustrates some of the gritty details. Firstly, consider a fictive full user profile within nameling’s query logs:
userId activity name POSIX_time 23 ADD_FAVORITE max 1361099013 23 ENTER_SEARCH carsten 1361099014 23 ENTER_SEARCH jan 1361099015 23 ENTER_SEARCH carsten 1361099016 23 ENTER_SEARCH stephan 1361099017 23 ENTER_SEARCH andreas 1361099018 23 ENTER_SEARCH alromano 1361099019 23 LINK_SEARCH carsten 1361099020 23 ENTER_SEARCH andreas 1361099021 23 ENTER_SEARCH robert 1361099022 23 ENTER_SEARCH max 1361099023 23 LINK_SEARCH oscar 1361099024 23 NAME_DETAILS oscar 1361099025
According to the selection of test names from the user’s full profile, the following part is contained in the training data set:
userId activity name POSIX_time 23 ADD_FAVORITE max 1361099013 23 ENTER_SEARCH carsten 1361099014 23 ENTER_SEARCH jan 1361099015 23 ENTER_SEARCH carsten 1361099016 23 ENTER_SEARCH stephan 1361099017
The test data set contains:
userId name_1 name_2 23 andreas robert
Note, that alromano is not contained in nameling’s list of known names. For the evaluation andreas and robert are selected while all other activities (after 23 ENTER_SEARCH andreas 1361099021
) are discarded.
Can you give an example of how the result files should look like?
The tab separated line for user 23 may look like
23 john folke andreas stephan jack robert ...
How can I derive my own training and test set from the public challenge data?
Due to several constraints, the choice of evaluation data is a bit complicated and described in details on the offline challenge description page.
But you don’t have to implement this process by yourself. Instead, you can use our Perl script from the download page.
The script is not very user friendly (sorry), but should do the job, assuming the list of known names, given in file /path/to/namelist.txt and the public training data in file /path/to/nameling.trainData:
$ cat /path/to/nameling.trainData | ./process_activitylog.pl /path/to/namelist.txt ...writing out name statistics to /tmp/nameling-public.names ...writing out category usage statistics to /tmp/nameling-public.categories ...writing out training data to /tmp/nameling-public.trainData and evaluation data to /tmp/nameling-public.evalData
As indicated, you will find your personal training data in file /tmp/nameling-public.trainData and your evaluation data in /tmp/nameling-public.evalData