The VoxTube Dataset

The VoxTube dataset is delivered in the form of YouTube URLs and corresponding meta information per video containing filtered segments with human speech.

Updated 02.2024: HuggingFace datasets implementation of a VoxTube is available here

Meta file example and description

Meta information is stored in a per-channel manner in resources/meta/*.json files:

{
    "video_id_0": [
        [segment1_start, segment1_end],
        [segment2_start, segment2_end],
        ...,
        [segmentN_start, segmentN_end]
    ],
    ...
    "video_id_N": [
        [segment1_start, segment1_end],
        ...,
        [segmentN_start, segmentN_end]
    ]
}

where the name of .json file is an id of a YouTube channel, json keys are ids of YouTube videos and each segmentX_start and segmentX_end are timestamps in seconds. For example:

# cat VoxTube/resources/meta/UC__gC1TbqcY5j_owWKKUEUQ.json
{
    "LYdLsl4zJj0": [
        [114.0, 118.0],
        [78.0, 82.0],
        [172.0, 176.0],
        [302.0, 306.0],
        [372.0, 376.0],
        ...,
        [204.0, 208.0]
    ],
    "4arwR9j58BY": [
        [114.0, 118.0],
        [220.0, 224.0],
        [154.0, 158.0],
        ...,
        [342.0, 346.0]
    ],
    ...
}

Segments examples

Please see below the examples of dataset samples obtained using the provided metadata.

spk_id	video_id	timestamps
UC–EryqEbhW-VtG80N21TdA	0GSmioPWEQo	[138, 142]
UC–EryqEbhW-VtG80N21TdA	0GSmioPWEQo	[324, 328]
UC–EryqEbhW-VtG80N21TdA	a_CZzxUqKrY	[272, 276]
UCzy4jKI1KXgv8NpYzP2Ezaw	4K03k8nVgp4	[476, 480]
UCzy4jKI1KXgv8NpYzP2Ezaw	4K03k8nVgp4	[108, 112]
UCzy4jKI1KXgv8NpYzP2Ezaw	K4zDtpU435c	[218, 222]

Dataset downloading

The following snippets show how to download the VoxTube data using the meta .json files.

Pre-requisites

Install ffmpeg and libsndfile1:

sudo apt-get update && sudo apt-get upgrade
sudo apt-get install ffmpeg libsndfile1

Download required .json files by cloning the VoxTube repo:
```
git clone https://github.com/IDRnD/VoxTube.git
```

Install Python yt-dlp library:

cd VoxTube/examples
python3 -m pip install -r requirements.txt

Example usage

Note that in default example script each audio is converted to 16 kHz sampling frequency .wav file and is split into 4-seconds segments.

cd VoxTube/examples

# example of one speaker downloading using meta .json file
python3 load_example.py ../resources/meta/UC-9GWCoQoMr_ey6AMhClStQ.json <DATASET_ROOT>

# example of downloading the whole dataset in N parallel jobs
# WARNING: you might run into HTTP Error 429 if there are too many requests
# (parallel jobs) used, decrease -j parameter in this case
python3 load_all_examples.py -r <DATASET_ROOT> -j N

Main page