LAION
The Large-scale Artificial Intelligence Open Network (LAION) is a German non-profit with a stated goal "to make large-scale machine learning models, datasets and related code available to the general public".[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.[2][3]
![]() | |
Type | Non-profit |
---|---|
Industry | Artificial intelligence |
Founder |
|
Website | laion![]() |
In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party.[4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set.[5]
On April 15, 2023, LAION and contributors released to public an open source AI assistant chatbot OpenAssistant.
Image datasets
LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img>
tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.[6] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.[7]
The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.[8] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.[6] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.[9]
A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.[10] As of its release, it was the largest freely available dataset of image-caption pairs in existence.[6] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.[11]
Example entry
.jpg.webp)
Below is an example of the metadata associated with one entry in the LAION-5B dataset. The image content itself, shown at right, is not stored in the dataset, but is only linked to via the URL field:[12]
- URL
- https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg/275px-Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg
- Text
- Ammodorcas clarkei The book of antelopes (1894).jpg
- Width
- 275 (measured in pixels)
- Height
- 311
- Similarity
- 0.34972 (cosine similarity between the image and caption, as measured using CLIP. Any pairs having similarity values less than 0.3 were discarded from the dataset.)
- Pwatermark
- 0.30022 (estimated probability that this image bears a watermark, as determined by an AI model)
- Punsafe
- 0.0000001688 (estimated probability that this image is "not safe for work", as determined by an AI model)
- Aesthetic
- 6.02298 (estimated score that a human rater would assign the aesthetics of this image, on a scale from 1 to 10)
References
- "About". LAION.ai. Retrieved 26 September 2022.
- Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica.
- Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database". Bloomberg News. Retrieved 24 April 2023.
- "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135". CourtListener. Retrieved 2023-02-08.
- "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead". Vice. Retrieved 2023-05-04.
- Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ.
- Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica.
- Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. Retrieved 26 September 2022.
- Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487.
{{cite journal}}
: Cite journal requires|journal=
(help) - Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog.
- Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch.
- "image 17024". LAION Aesthetic 6+ dataset explorer. Retrieved 26 September 2022.