The blue social bookmark and publication sharing system.
For research purposes we offer a dataset of the BibSonomy
database in form of an SQL dump to interested
people. Before you get access to the dataset, you have to
sign up our license
agreement and send it as a scanned file (in pdf, jpg
or png format) via email to our office. Alternatively you
may send the document via FAX, see number on
our contact
page.
Additionally, we would like to ask you to
subscribe to the
BibSonomy-Research mailing list
. Upon receipt of your faxed license agreement, we
will approve the subscription request and in the welcome
mail you will get instructions on how to access the
dataset.
On this page you can download the dumps as compressed tar archive. A README describing the format of the files is contained in each archive. Please note that the easiest way to work with the dumps is by using a MySQL database. Detailed information on the table structure can be found below on this page.
We are quite interested in results you got with the help of this dataset. Therefore, please inform us about your publications. Concerning citing this data in publications, please refer to the following reference (adapting the date):
Knowledge and Data Engineering Group, University of Kassel: Benchmark Folksonomy Data from BibSonomy, version of June 30th, 2007.
If you want to refer to the system, please use the following publication:
Dominik Benz, Andreas Hotho, Robert Jäschke, Beate Krause, Folke Mitzlaff, Christoph Schmitz, and Gerd Stumme. The Social Bookmark and Publication Management System BibSonomy. The VLDB Journal, 19(6):849-875, Dec. 2010. [BibTeX]
The dataset has been created using the mysqldump command of a MySQL database. The CREATE statements for the corresponding tables (each file = one table) can be found in the file tables.sql, together with the LOAD DATA statements which insert the data into the database. For the latter to work you must adapt the paths to the datafiles at the end of tables.sql.
The dataset consists of four files:
These are tab-separated files, where each line represents a row and the fields of each row are delimited by a tabulator. Please note that the fields themselves can contain line breaks which are quoted by MySQL. The best way to load the data into a MySQL database is by using the LOAD DATA statement. If you have problems in reading or understanding the data, please have a look at our FAQ.
The fields of each row correspond to the following columns:
Tag Assignments: Fact table; who attached which tag to which resource/content
Dimension table for bookmark data
Dimension table for BibTeX data
Tag-tag relations of users
We also offer a new dataset containing data of the http requests recorded in our web server logs. To get information regarding the data please contact us.
You can use the provided SQL script:
mysql -u <username> -p -D <databasename> < tables.sql
This script assumes, that the corresponding data files are located in the '/tmp' directory, are readable for everyone and that the MySQL-user has according user privileges.
Assure that '/tmp/bibtex', '/tmp/bookmark' and '/tmp/tas' are readable for everyone and that the MySQL-user has the FILE privilege:
GRANT FILE ON <databasename>.* TO '<username>'@'localhost' IDENTIFIED BY '<password>';
Assume you want to get all information for the post with content_id 42. First, get the user, all tags, content_type and date from the tas table:
SELET * from tas where content_id='42';
Now, depending on the post's content_type, get further details from the bibtex or the bookmark table. In our case:
SELECT * from bookmark where content_id='42';
- I get errors like 'ERROR 1406 (22001): Data too long for column 'annote' at row 43542'
Ensure that the charset of your database, tables, connections is UTF-8. We modified the tables.sql script lately to use UTF-8 wherever possible. However, it might be necessary to modify your database server configuration.
Yes! The content_id's represent posts and each post belongs to exactly one user. The content_id's do not represent resources - this is done by the hashes (url_hash for bookmarks, simhash[0-2] for publication references). So if you need some overlap between posts (i.e., find posts with the same resource) use the hashes. For publication references there are two relevant hashes: simhash2 (intra hash) which is unique among one user (i.e., each user has at most one post with simhash2) and pretty strict (changing the journal name changes the hash); and simhash1 (inter hash) which is pretty sloppy and provides overlap between resources (resources with the same title, author, year have the same simhash1).