ECML PKDD Discovery Challenge 2008IntroductionThis year's discovery challenge presents two tasks in the new area of social bookmarking. One task covers spam detection and the other covers tag recommendations. As we are hosting the social bookmark and publication sharing system BibSonomy, we are able to provide a dataset of BibSonomy for the challenge. A training dataset for both tasks is provided at the beginning of the competition. The test dataset will be released 48 hours before the final deadline. Due to a very tight schedule we cannot grant any deadline extension. The presentation of the results will take place at the ECML/PKDD workshop where the top teams are invited to present their approaches and results. To get started with the tasks we suggest that you make yourself familiar with BibSonomy. A more formal description of the underlying structure which is called folksonomy is given in this paper (pdf here) where also a descriptions of the BibSonomy components are given. Next step is to subscribe to the mailing list rsdc08. We will use the list to distribute news about the challenge or other important information. Furthermore, the list can be used to clarify questions about the dataset and the different tasks. As the welcome message on the list contains information about how to access the dataset, subscribing to this list is essential to participate in the challenge. You can participate at only one of the challenges, or at both challenges. Tasks1. Spam Detection in Social Bookmarking SystemsWith the growing popularity of social bookmarking systems, spammers discovered this kind of service as a playground for their activities. Usually they pursue two goals: On the one hand, they place links in the system to attract people to advertising sites. On the other hand, they increase the PageRank of their sites by placing links in as many popular web 2.0 sites as possible, in order to increase their visibility in Google and other search engines. Usual counter-measures like captchas are not efficient enough to effectively prevent the misuse of the system. In the last year, we were able to collect data of more than 2,000 active users and more than 25,000 spammers by manually labeling spammers and non-spammers. The provided dataset consists of these users and of all their posts. This includes all public information such as the url, the description and all tags of the post. The goal of this challenge is to learn a model which predicts whether a user is a spammer or not. In order to detect spammers as early as possible, the model should make good predictions for a user when he submits his first post. Dataset descriptionA general description of the dataset can be found here. For the spam detection task all provided files are relevant. EvaluationAll participants can use the training dataset to fit the model. The training dataset contains flags that identify users as spammers or non-spammers. The test dataset will have the same format as the training dataset and can be downloaded two days before the end of the competition. It will contain users of a future period. All participants must send a sorted file containing one line, for each user, composed by the user number and a confidence value separated by a tab. The higher the confidence value, the higher the probability that the user is a spammer. The highest confidence should come first. user spam 1234 1 1235 0.987 1236 0.765 1239 0 If no prediction is provided we assume the user is not a spammer. The evaluation criterion is the AUC (the Area under the ROC Curve) value. We compare the submitted spammer predictions of the participants with the manually assigned labels on a user basis. Script to calculate AUC
This script (updated at 2008-07-25) provides an example how we will calculate the AUC value. As input you need two files: one with the user_id and the true class, one sorted file with the user_id
and the confidence value. 2. Tag Recommendation in Social Bookmarking SystemsTo support the user during the tagging process and to facilitate the tagging, BibSonomy includes a tag recommender. This means that when a user finds an interesting web page (or publication) and posts it to BibSonomy, the system offers up to ten recommended tags on the posting page. Have a look at: Post in BibSonomy (It is necessary to have a BibSonomy account ;-) to be able to test it). The goal is to learn a model which effectively predicts the tags a user will use to describe a web page (or publication). Dataset descriptionA general description of the dataset can be found here. For the tag recommendation only the tas, bookmark, and bibtex files are relevant. EvaluationFor this task, only the non-spammer part of the dataset should be used to fit a model. The test dataset will consist of a bibtex, a bookmark and a tas file (but the tas file does not contain tags) as these files contain all information about posts entered into the system. We will release this dataset 48h before the end of the competition. We expect from every participant a file which contains one line, for each prediction, with the content_id of the post (bibtex or bookmark) followed by the list of recommended tags (the tags are space separated and the two columns content_id and tags are separated by tab). We consider only the first ten tags. Here is an example of the expected format: content_id tags 123456778 hello world
The F-measure will be the evaluation criterion. We compute the F-measure on a post basis by comparing the recommended tags with the
tags the user has originally assigned to this post and averaging over all posts of the test dataset. trueTag.replaceAll("[^0-9\\p{L}]+", "").equalsIgnoreCase( recommendedTag.replaceAll("[^0-9\\p{L}]+", ""));which means we ignore case of tags and remove all characters which are neither numbers nor letters (see also java.util.regex.Pattern). Since we expect all files to be UTF-8 encoded, the above function will NOT remove umlauts and other non-latin characters! We will also employ unicode normalization to normal form KC. Additionally, the test set will not contain the following tags: imported public system:imported nn system:unfiled Program to calculate the F1-MeasureThis JAR file contains a Java program to calculate the precision, recall, and f1-measure for given result files. Usage of the program is as follows: usage: java -jar kddchallenge2008-0.0.1.jar\ maxNoOfTags tas_original resultFile1 [resultFile2...resultFileN] The output will be written to resultFile*.eval.where maxNoOfTags is the maximal number of tags to regard for recommendation (this is 10 in the challenge), tas_original the path to the original tas file which includes the tags (this is the file you won't get, of course), and the remaining arguments are be paths to files which contain the recommendations in a format as described above. Each output file contains for each number of tags (up to maxNoOfTags) a line with the following columns:
Note, that due to UTF-8 normalization, you need Java 6 to run this program. The source code is also available in a source JAR. OrganizationImportant Dates Google Calendar iCalendar (iCal/Outlook)
We are pleased to announce that a discovery challenge will be organized in conjunction with the Web 2.0 Mining workshop; the joint workshop and challenge will be on September 15th. DatasetTo access the challenge dataset please subscribe to the rsdc08 mailing list. The welcome message will contain all information to access the dataset (Dataset description here). Test DatasetsThe tests datasets for the challenge are now online:
ResultsMore than 150 participants registered at the mailing list and thus had a look at the dataset. We got 18 result submissions - 13 of them for the spam detection task and 5 for the tag recommendation task. 13 participants additionally submitted a paper - 11 of them were accepted. We computed the AUC and F1-Measure values with the programs described above. Below you can find the results including the team name for the three best teams of each task. Spam Detection Task
Tag Recommendation Task
Submission instructionsTo submit your result files, use our submission form. The paper submissions must be submitted to the EasyChair submission system in PDF format. Although not required for the initial submission, we recommend to follow the format guidelines of ECML/PKDD (Springer LNCS -- LaTeX Style File), as this will be the required format for accepted papers. The workshop proceedings will be distributed during the workshop. We plan to issue a post workshop publication of selected papers by Springer Lecture Notes. Workshop ChairsTo contact us please send a mail to rsdc08-info@cs.uni-kassel.de.
|