DataRec Data Formats
This document describes the dataset formats supported by DataRec, with a short spec for each structure and the matching I/O helpers. The Read/Write entries refer to the DataRec functions that load a file into RawData (Read) or export RawData back to that format (Write).
Index
Field conventions
Common fields across datasets:
user: user identifier (string or integer)item: item identifier (string or integer)rating: preference value (integer/float)timestamp: temporal signal (string or integer)
Sequences
Sequence datasets store per-user ordered lists.
Tabular (inline)
- One row per user
itemcontains a delimiter-separated sequence (semicolon)- Optional
timestampcolumn for aligned sequences - Read: read_sequence_tabular_inline
- Write: write_sequence_tabular_inline
Example (interactions):
Example (timestamp):
Tabular (wide)
- One row per user
- Each item is a separate column
- Read: read_sequence_tabular_wide
- Write: write_sequence_tabular_wide
Example:
Tabular (implicit)
- First value is
user - Remaining columns are items
- Optional headerless variant for tabular data
- Read: read_sequence_tabular_implicit
- Write: write_sequence_tabular_implicit
Example:
JSON (mapping)
- Top-level object keyed by
user - Value is an ordered list of events
- Read: read_sequences_json
- Write: write_sequences_json
Example (interactions):
Example (ratings):
Example (timestamp):
{
"0": [
{ "item": 1, "rating": 1, "timestamp": "001" },
{ "item": 2, "rating": 1, "timestamp": "022" }
]
}
JSON (mapping, item-only)
- Top-level object keyed by
user - Value is an ordered list of item ids (scalars only)
- Read: read_sequences_json_items
- Write: write_sequences_json_items
Example (interactions):
JSON (array)
- Top-level array
- Each entry contains
userandsequence - Read: read_sequences_json_array
- Write: write_sequences_json_array
Example (interactions):
Example (timestamp):
[
{
"user": "0",
"sequence": [
{ "item": 1, "rating": 1, "timestamp": "001" },
{ "item": 2, "rating": 1, "timestamp": "022" }
]
}
]
Transactions
Transaction datasets store one event per row/object.
Tabular
- One row per event
- Optional headerless variant for tabular data
- Read: read_transactions_tabular
- Write: write_transactions_tabular
Example (ratings):
Example (timestamp):
JSON
- Top-level array
- One object per event
- Read: read_transactions_json
- Write: write_transactions_json
Example (ratings):
JSONL
- One JSON object per line
- Read: read_transactions_jsonl
- Write: write_transactions_jsonl
Example (interactions):
Example (timestamp):
{"user": 0, "item": 1, "rating": 1, "timestamp": "001"}
{"user": 0, "item": 2, "rating": 1, "timestamp": "022"}
Blocks (text)
- Block format with an explicit block id header
- Modes:
- Item-wise blocks:
<ITEM_ID>:then events - User-wise blocks:
<USER_ID>:then events - Event layouts:
idid,ratingid,rating,timestamp- Date is kept as string; reader is streaming
- Read: read_transactions_blocks
- Write: write_transactions_blocks
Example (item-wise, id):
Example (item-wise, id,rating):
Example (item-wise, id,rating,timestamp):
Example (user-wise, id):
Example (user-wise, id,rating):
Example (user-wise, id,rating,timestamp):
Extending this catalog
To add a new dataset format:
- Define the new format structure (fields and serialization).
- Provide a minimal example that exercises the loader.
- Add a new subsection under the appropriate data type.