Here we provide details about the evaluation of your results for the offline Tasks 1 and 2.
Evaluation
Test data
The test dataset will consist of a bibtex, a bookmark and a tas file in the same format
as the training dataset. However, the tas file does
not contain tags. Instead of a tag it contains the string null
and for each post only one TAS (= line). You can use a sample file created from the
cleaned dump training data to test reading the file.
We will release this dataset 48h before the end of the competition.
We expect from every participant a file which contains one line, for each prediction,
with the content_id of the post (bibtex or bookmark) followed by the list of recommended
tags (the tags are space separated and the two columns content_id and tags are separated
by tab). We consider only the first five tags. Here is an example of the expected format:
content_id tags 123456778 hello world
We also provide an example result file matching the cleaned dump training data.
Evaluation Criterion
We will use the F1-Measure common in Information Retrieval to evaluate the recommendations. Therefore, we first compute for each post in the test data precision and recall by comparing the recommended tags against the tags the user has originally assigned to this post. Then we average precision and recall over all posts in the test data and use the resulting precision and recall to compute the F1-Measure as f1m = (2 * precision * recall) / (precision + recall). For details, we refer to the paper Tag Recommendations in Social Bookmarking Systems.
The number of tags one can recommend is not restricted. However, we will regard the first five tags only.
The comparison of the recommended tags to the true tags of a post will be done according to the following Java function
trueTag.replaceAll("[^0-9\\p{L}]+", "").equalsIgnoreCase( recommendedTag.replaceAll("[^0-9\\p{L}]+", ""));
which means we ignore case of tags and remove all characters which are neither
numbers nor letters (see also java.util.regex.Pattern).
Since we expect all files to be UTF-8 encoded, the above function will
NOT remove umlauts and other non-latin characters! We will also employ
unicode normalization to normal form KC.
Additionally, the test set will not contain the following tags:
imported public system:imported nn system:unfiled
Sample Evaluation Program
This JAR file contains a Java program to calculate the precision, recall, and f1-measure for given result files. Usage of the program is as follows:
where 5 is the maximal number of tags to regard for recommendation
(the default for the challenge), tas_original the path to the
original tas file which includes the tags (this is the file you won't get,
of course), and tas_result the
path to the file which contains your recommendations in a format as
described above. We
provide a (zipped) version of the tas
file from the cleaned dump training data
which you can use to test the evaluator against the training data.
The output is then located in the file tas_result.eval which looks like this:
1 0.4877296962320823 1.0 0.6556697731682538 2 0.7431330239829573 1.0 0.8526406347175292 3 0.8598491823591262 1.0 0.9246439878188943 4 0.918164381330041 1.0 0.957336493437953 5 0.9497167202873263 1.0 0.9742099561492895The file contains for each number of tags (up to 5) a line with the following columns:
- number of regarded tags
- recall
- precision
- F1-measure
Note, that due to UTF-8 normalization, you need Java 6 to run this program. The source code is also available in a source JAR.