Skip to content

DataRec Data Formats

This document describes the dataset formats supported by DataRec, with a short spec for each structure and the matching I/O helpers. The Read/Write entries refer to the DataRec functions that load a file into RawData (Read) or export RawData back to that format (Write).

Index

Field conventions

Common fields across datasets:

  • user: user identifier (string or integer)
  • item: item identifier (string or integer)
  • rating: preference value (integer/float)
  • timestamp: temporal signal (string or integer)

Sequences

Sequence datasets store per-user ordered lists.

Tabular (inline)

Example (interactions):

user    item
0   1;2;3
1   1;2;4

Example (timestamp):

user    item    timestamp
0   1;2;3   000
1   1;2;4   010

Tabular (wide)

Example:

user    item
0   1   2   3
1   1   2   4

Tabular (implicit)

Example:

user    item
5   1   2   3   6   7
3   1   2   4

JSON (mapping)

Example (interactions):

{
  "0": [
    { "item": 1 },
    { "item": 2 }
  ]
}

Example (ratings):

{
  "0": [
    { "item": 1, "rating": 1 },
    { "item": 2, "rating": 1 }
  ]
}

Example (timestamp):

{
  "0": [
    { "item": 1, "rating": 1, "timestamp": "001" },
    { "item": 2, "rating": 1, "timestamp": "022" }
  ]
}

JSON (mapping, item-only)

Example (interactions):

{
  "0": [1, 2, 3],
  "1": [4]
}

JSON (array)

Example (interactions):

[
  {
    "user": "0",
    "sequence": [
      { "item": 1 },
      { "item": 2 }
    ]
  }
]

Example (timestamp):

[
  {
    "user": "0",
    "sequence": [
      { "item": 1, "rating": 1, "timestamp": "001" },
      { "item": 2, "rating": 1, "timestamp": "022" }
    ]
  }
]

Transactions

Transaction datasets store one event per row/object.

Tabular

Example (ratings):

user    item    ratings
0   1   1
0   2   1

Example (timestamp):

user    item    ratings timestamp
0   1   1   001
0   2   1   022

JSON

Example (ratings):

[
  { "user": 0, "item": 1, "rating": 1 },
  { "user": 0, "item": 2, "rating": 1 }
]

JSONL

Example (interactions):

{"user": 0, "item": 1}
{"user": 0, "item": 2}

Example (timestamp):

{"user": 0, "item": 1, "rating": 1, "timestamp": "001"}
{"user": 0, "item": 2, "rating": 1, "timestamp": "022"}

Blocks (text)

  • Block format with an explicit block id header
  • Modes:
  • Item-wise blocks: <ITEM_ID>: then events
  • User-wise blocks: <USER_ID>: then events
  • Event layouts:
  • id
  • id,rating
  • id,rating,timestamp
  • Date is kept as string; reader is streaming
  • Read: read_transactions_blocks
  • Write: write_transactions_blocks

Example (item-wise, id):

1:
10
20

Example (item-wise, id,rating):

1:
10,4
20,3

Example (item-wise, id,rating,timestamp):

1:
10,4,2005-01-01
20,3,2005-01-02

Example (user-wise, id):

10:
1
2

Example (user-wise, id,rating):

10:
1,4
2,3

Example (user-wise, id,rating,timestamp):

10:
1,4,2005-01-01
2,3,2005-01-02

Extending this catalog

To add a new dataset format:

  1. Define the new format structure (fields and serialization).
  2. Provide a minimal example that exercises the loader.
  3. Add a new subsection under the appropriate data type.