Skip to the content.

The VoxTube dataset is delivered in the form of YouTube URLs and corresponding meta information per video containing filtered segments with human speech.

Updated 02.2024: HuggingFace datasets implementation of a VoxTube is available here

Meta file example and description

Meta information is stored in a per-channel manner in resources/meta/*.json files:

{
    "video_id_0": [
        [segment1_start, segment1_end],
        [segment2_start, segment2_end],
        ...,
        [segmentN_start, segmentN_end]
    ],
    ...
    "video_id_N": [
        [segment1_start, segment1_end],
        ...,
        [segmentN_start, segmentN_end]
    ]
}

where the name of .json file is an id of a YouTube channel, json keys are ids of YouTube videos and each segmentX_start and segmentX_end are timestamps in seconds. For example:

# cat VoxTube/resources/meta/UC__gC1TbqcY5j_owWKKUEUQ.json
{
    "LYdLsl4zJj0": [
        [114.0, 118.0],
        [78.0, 82.0],
        [172.0, 176.0],
        [302.0, 306.0],
        [372.0, 376.0],
        ...,
        [204.0, 208.0]
    ],
    "4arwR9j58BY": [
        [114.0, 118.0],
        [220.0, 224.0],
        [154.0, 158.0],
        ...,
        [342.0, 346.0]
    ],
    ...
}

Segments examples

Please see below the examples of dataset samples obtained using the provided metadata.

spk_id video_id timestamps audio
UC–EryqEbhW-VtG80N21TdA 0GSmioPWEQo [138, 142] Speaker UC--EryqEbhW-VtG80N21TdA, example 1
UC–EryqEbhW-VtG80N21TdA 0GSmioPWEQo [324, 328] Speaker UC--EryqEbhW-VtG80N21TdA, example 2
UC–EryqEbhW-VtG80N21TdA a_CZzxUqKrY [272, 276] Speaker UC--EryqEbhW-VtG80N21TdA, example 3
UCzy4jKI1KXgv8NpYzP2Ezaw 4K03k8nVgp4 [476, 480] Speaker UCzy4jKI1KXgv8NpYzP2Ezaw, example 1
UCzy4jKI1KXgv8NpYzP2Ezaw 4K03k8nVgp4 [108, 112] Speaker UCzy4jKI1KXgv8NpYzP2Ezaw, example 2
UCzy4jKI1KXgv8NpYzP2Ezaw K4zDtpU435c [218, 222] Speaker UCzy4jKI1KXgv8NpYzP2Ezaw, example 3

Dataset downloading

The following snippets show how to download the VoxTube data using the meta .json files.

Pre-requisites

Example usage

Note that in default example script each audio is converted to 16 kHz sampling frequency .wav file and is split into 4-seconds segments.

cd VoxTube/examples

# example of one speaker downloading using meta .json file
python3 load_example.py ../resources/meta/UC-9GWCoQoMr_ey6AMhClStQ.json <DATASET_ROOT>

# example of downloading the whole dataset in N parallel jobs
# WARNING: you might run into HTTP Error 429 if there are too many requests
# (parallel jobs) used, decrease -j parameter in this case
python3 load_all_examples.py -r <DATASET_ROOT> -j N

Main page