Skip to content

Datasets Registry

The datarec/registry folder is the source of truth for built-in datasets. It defines what datasets exist, which versions are available, where to download resources, and how to interpret them. This registry powers the dataset builder and the reproducible pipeline system.

On This Page

Folder structure

datarec/registry/
├─ datasets/        # Dataset-level metadata (name, description, versions, citation)
├─ versions/        # Version-specific sources/resources and schemas
└─ metrics/         # Precomputed dataset characteristics (generated)

datasets/

Each file describes a dataset at a high level: - description and source - citation (used in docs and references) - versions list - latest_version

These files are used to validate dataset names and versions.

versions/

Each file defines a specific version of a dataset. It contains: - sources: how to download data (URLs, archives, checksums) - resources: what to extract and how to parse it - optional schema definitions for interactions or content

The schema drives how RawData/DataRec columns are interpreted.

metrics/

YAML files generated by DataRec with dataset characteristics (e.g., sparsity, users/items, density). They are produced by datarec/registry/utils.py.

How DataRec uses the registry

  1. User requests a dataset name/version.
  2. Registry metadata validates the request.
  3. Version file provides sources and resources.
  4. Dataset builder prepares and loads the data into DataRec.

Add a new dataset

  1. Create metadata in datarec/registry/datasets/<name>.yml.
  2. Create a version file in datarec/registry/versions/<name>_<version>.yml.
  3. (Optional) compute metrics with compute_dataset_characteristics in datarec/registry/utils.py.

Minimal resource example

resources:
  interactions:
    type: interactions
    format: csv
    required: true
    source_name: main_source
    filename: ratings.csv
    schema:
      user: user_id
      item: item_id
      rating: rating
      timestamp: timestamp

Tips

  • Keep latest_version aligned with the newest version.
  • Use checksums for reproducibility.
  • If the schema changes, introduce a new version file.