- Turing Post
- Posts
- 5 Large-Scale Datasets for AI Research
5 Large-Scale Datasets for AI Research
These Diverse and Challenging Sets of Data Will Help AI Systems Learn and Grow
SA-1B (image)
SA-1B consists of 11M diverse, high-resolution, and privacy-protecting images collected and licensed from a third-party photo company. The images are photos taken from a camera, i.e. not artwork. The images vary in subject matter. Common themes of the images include locations, objects, and scenes. The dataset includes 1.1B high-quality segmentation masks collected with the Segment Anything Data Engine.
OIG-moderation (text)
OIG-moderation is a diverse dataset of dialogue that may be related to NSFW subject matters, abuse eliciting text, privacy violation eliciting instructions, depression or related content, hate speech, and other similar topics. The dataset consists of the [prosocial], [anthropic redteam], and subsets of [English Wikipedia] datasets along with other public datasets and data created or contributed by volunteers. To regularize the dataset there also are "regular" OIG instructions, which include Q/A instructions, coding instructions, and similar types of queries.
OIG-43M (text)
The Open Instruction Generalist (OIG) dataset is a large open-source instruction dataset that currently contains ~43M instructions. OIG is one of many chatbot datasets that LAION, along with its volunteers, Ontocord, Together and other members of the open-source community, have released and is intended to create equal access to chatbot technology.
Flan Collection (text)
โFlan 2022โ, combines prior collections from FLAN, P3/T0, and Natural Instructions with new dialog, program synthesis, and complex reasoning tasks.
LAION-5B (text and image)
LAION-5B is a dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world.
Every day we post helpful lists and bite-sized explanations on our Twitter. Please join us there!
5 open-source datasets used to train LLMs
1. SA-1B (image)
2. OIG-moderation (text)
3. OIG-43M (text)
4. Flan Collection (text)
5. LAION-5B (text and image)Links ๐งต
โ TuringPost (@TheTuringPost)
5:00 PM โข May 15, 2023
Reply