Dataset for Learning Analytics

The AFEL Dataset for Learning Analytics is in itself a collection of datasets that are useful for performing analytics in online/social learning contexts. It is distilled from the content of the AFEL Data Catalogue, including some user-centered data generated and anonymised by the AFEL project, and excluding others that are not freely redistributable.

The datasets in this collection can be downloaded individually as dumps in RDF format. This page provides the links to each of the corresponding snapshots as of November 2018. The collection aggregates nearly 49B distinct RDF triples, obtained both by refactoring existing linked datasets made available by members of the AFEL project, and by reengineering third-party datasets that were originally not on RDF (e.g. Coursera, OU Analyse, Outline Maps).

All dumps are provided as BZipped N-Quads or N-Triples (serialisation format for RDF with or without named graph indications, respectively), except where otherwise noted.

AFEL App Evaluation Results

Results of the initial evaluation of the AFEL App and anonymised feedback from its early adopters. The evaluation came in two forms:

  • An initial online questionnaire assessing the attractiveness of the AFEL application to potential users.
  • A lab-based evaluation of the mobile application and recommender services, with over 70 users conducting specific learning tasks with Didactalia and exploring the corresponding learning activity.

When citing the dataset please use the following reference:

López-Sola, S., Holtz, P., Yenikent, S., Thalmann, S., Fessl, A., Veas, E., Gadiraju, U., and d’Aquin, M. First evaluation of the adoption and benefit of analytics in social environments. AFEL project deliverable 5.9, August 2018.
Source: AFEL, Didactalia 171,078 triples dump (388k) license: CC BY-NC-SA

Coursera MOOC Discussion Thread

Anonymized versions of the discussion threads from the forums of 60 Coursera Massive Open Online Courses (MOOCs), for a total of about 100,000 threads.

When citing the dataset please use the following reference:

Rossi, L.A. and Gnawali, O. Language independent analysis and classification of discussion threads in Coursera MOOC forums. IEEE International Conference on Information Reuse and Integration (IRI), August 2014.
Source: Data repository on GitHub 4,927,697 triples dump (38m) | alignments license

DBLP – Computer Science Bibliography

Linked Data export of open bibliographic information on major journals and proceedings in computer science.

Source: L3S, DBLP 199,824,967 triples dump (765m) license

Didactalia anonymised user data

Generated activity data from 1000 to 3000 anonymous users of the Didactalia learning platform. The users are extracted from real activity data over a time period of 12 and 24 weeks. The RDF dataset also includes basic metadata extracted from the AFEL index of Didactalia resources, including the titles and tags of each resource.

Source: AFEL, Didactalia 255,124 triples dump (3m) license: CC BY-NC-SA

LAK Dataset

The LAK Dataset makes publicly available machine-readable versions of research sources from the Learning Analytics and Educational Data Mining communities.

Source: Linked Data for Learning Analytics community 90,968 triples dump (6m) | alignments license: other (open)

LRMI Resource metadata

A collection of online learning resources annotated in accordance with the Learning Resource Metadata Initiative (LRMI) and collected between 2013 and 2015.
Datasets are provided as one set of N-Quads per year.

Source: ITD-CNR 115,113,763 triples dump (1.2g)

Open University courses

Online courses, material and learning opportunities provided by The Open University.
When using or redistributing the dataset, please cite the attribution to The Open University

Source: The Open University 1,110,249 triples dump (4.5m) license: CC BY 3.0

OU Analyse

Anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses held at The Open University.

When citing the dataset please use the following reference:

Kuzilek, J., Hlosta, M., Herrmannova, D., Zdrahal, Z. and Wolff, A. OU Analyse: Analysing At-Risk Students at The Open University. Learning Analytics Review, no. LAK15-1, March 2015, ISSN: 2057-7494.
Source: The Open University 54,584,125 triples dump (302m) | alignments license: CC BY 4.0

Outline Maps (Slepé mapy)

Data used for modelling quizzes for adaptive learning of geography as published by Slepé mapy (Outline maps in English). Data are taken from a snapshot as of May 2015.

Source: Adaptive Learning, University of Masaryk 70,711,263 triples dump (385m) | alignments license: ODBL


An RDF corpus of anonymized data for a large collection of annotated tweets spanning a 4-year period (January 2013 – September 2017). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are rendered in RDF. Twitter usernames are encrypted and the actual content of tweets is not included, but can be retrieved from Twitter itself through the tweet IDs.

Source: TweetsKB 48,207,277,042 triples Data homepage (with Zenodo links) license: CC BY 4.0

Web of Know How

A Linked Data framework for human tasks and procedures – re-engineered data from WikiHow and SnapGuide.
When using or redistributing the dataset, please cite the attribution to WikiHow, SnapGuide and the Web of KnowHow project.

Source: Web of Know-How 23,073,020 triples dump (641m) license: CC BY-NC 4.0