The VoxTube dataset is delivered in the form of YouTube URLs and corresponding meta information per video containing filtered segments with human speech.
Updated 02.2024: HuggingFace datasets implementation of a VoxTube is available here
Meta file example and description
Meta information is stored in a per-channel manner in resources/meta/*.json
"video_id_0": [
[segment1_start, segment1_end],
[segment2_start, segment2_end],
[segmentN_start, segmentN_end]
"video_id_N": [
[segment1_start, segment1_end],
[segmentN_start, segmentN_end]
where the name of .json file is an id of a YouTube channel, json keys are ids of YouTube videos and each segmentX_start and segmentX_end are timestamps in seconds. For example:
# cat VoxTube/resources/meta/UC__gC1TbqcY5j_owWKKUEUQ.json
"LYdLsl4zJj0": [
[114.0, 118.0],
[78.0, 82.0],
[172.0, 176.0],
[302.0, 306.0],
[372.0, 376.0],
[204.0, 208.0]
"4arwR9j58BY": [
[114.0, 118.0],
[220.0, 224.0],
[154.0, 158.0],
[342.0, 346.0]
Segments examples
Please see below the examples of dataset samples obtained using the provided metadata.
Dataset downloading
The following snippets show how to download the VoxTube data using the meta .json files.
- Install ffmpeg and libsndfile1:
sudo apt-get update && sudo apt-get upgrade sudo apt-get install ffmpeg libsndfile1
- Download required .json files by cloning the VoxTube repo:
git clone
- Install Python yt-dlp library:
cd VoxTube/examples python3 -m pip install -r requirements.txt
Example usage
Note that in default example script each audio is converted to 16 kHz sampling frequency .wav file and is split into 4-seconds segments.
cd VoxTube/examples
# example of one speaker downloading using meta .json file
python3 ../resources/meta/UC-9GWCoQoMr_ey6AMhClStQ.json <DATASET_ROOT>
# example of downloading the whole dataset in N parallel jobs
# WARNING: you might run into HTTP Error 429 if there are too many requests
# (parallel jobs) used, decrease -j parameter in this case
python3 -r <DATASET_ROOT> -j N