A large, high-quality video dataset of URL links to approximately 650000 Youtube video clips that cover 700 human action classes.
45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications
Contains 8732 urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc.
An audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube
11,827 videos related to 180 different tasks, which were all collected from YouTube
A large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities
A Large-Scale Video Benchmark for Human Activity Understanding
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity.
The largest collection of poses which focuses on very challenging and realistic tasks of human-centric analysis in various crowd & complex events, including subway getting on/off, collision, fighting, and earthquake escape
The YFCC100M is the largest publicly and freely usable multimedia collection, containing around 99.2 million photos and 0.8 million videos from Flickr, all of which were shared under one of the various Creative Commons licenses
UMDFaces is a face dataset divided into two parts: Still Images - 367,888 face annotations for 8,277 subjects and Video Frames - Over 3.7 million annotated video frames from over 22,000 videos of 3100 subjects.
A large-scale video dataset, featuring clips from movies with detailed captions.
AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises
A creative commons speech dataset targeting acoustically challenging and reverberant environments with robust labels and truth data for transcription, denoising, and speaker identification.
A simple audio or speech data which consists of recordings of spoken English digits
A large-scale dataset of short videos with textual descriptions sourced from the web