Datasets Registry
The datarec/registry folder is the source of truth for built-in datasets. It defines
what datasets exist, which versions are available, where to download resources, and
how to interpret them. This registry powers the dataset builder and the reproducible
pipeline system.
On This Page
Folder structure
datarec/registry/
├─ datasets/ # Dataset-level metadata (name, description, versions, citation)
├─ versions/ # Version-specific sources/resources and schemas
└─ metrics/ # Precomputed dataset characteristics (generated)
datasets/
Each file describes a dataset at a high level:
- description and source
- citation (used in docs and references)
- versions list
- latest_version
These files are used to validate dataset names and versions.
versions/
Each file defines a specific version of a dataset. It contains:
- sources: how to download data (URLs, archives, checksums)
- resources: what to extract and how to parse it
- optional schema definitions for interactions or content
The schema drives how RawData/DataRec columns are interpreted.
metrics/
YAML files generated by DataRec with dataset characteristics (e.g., sparsity,
users/items, density). They are produced by datarec/registry/utils.py.
How DataRec uses the registry
- User requests a dataset name/version.
- Registry metadata validates the request.
- Version file provides sources and resources.
- Dataset builder prepares and loads the data into
DataRec.
Add a new dataset
- Create metadata in
datarec/registry/datasets/<name>.yml. - Create a version file in
datarec/registry/versions/<name>_<version>.yml. - (Optional) compute metrics with
compute_dataset_characteristicsindatarec/registry/utils.py.
Minimal resource example
resources:
interactions:
type: interactions
format: csv
required: true
source_name: main_source
filename: ratings.csv
schema:
user: user_id
item: item_id
rating: rating
timestamp: timestamp
Tips
- Keep
latest_versionaligned with the newest version. - Use checksums for reproducibility.
- If the schema changes, introduce a new version file.