LAION

LAION
Type	Non-profit
Industry	Artificial intelligence
Founder	Christoph Schuhmann; Jenia Jitsev; Richard Vencu; Robert Kaczmarczyk; Theo Coombes; Mehdi Cherti; Aarush Katta; Jan Ebert;
Website	laion.ai

The Large-scale Artificial Intelligence Open Network (LAION) is a German non-profit with a stated goal "to make large-scale machine learning models, datasets and related code available to the general public".[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.[2][3]

In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party.[4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set.[5]

On April 15, 2023, LAION and contributors released to public an open source AI assistant chatbot OpenAssistant.

Image datasets

LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers. The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img> tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.[6] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.[7]

The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.[8] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.[6] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.[9]

A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.[10] As of its release, it was the largest freely available dataset of image-caption pairs in existence.[6] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.[11]

Example entry

An example of one of the billions of images in the LAION-5B dataset

Below is an example of the metadata associated with one entry in the LAION-5B dataset. The image content itself, shown at right, is not stored in the dataset, but is only linked to via the URL field:[12]

URL: https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg/275px-Ammodorcas_clarkei_The_book_of_antelopes_%281894%29.jpg
Text: Ammodorcas clarkei The book of antelopes (1894).jpg
Width: 275 (measured in pixels)
Height: 311
Similarity: 0.34972 (cosine similarity between the image and caption, as measured using CLIP. Any pairs having similarity values less than 0.3 were discarded from the dataset.)
Pwatermark: 0.30022 (estimated probability that this image bears a watermark, as determined by an AI model)
Punsafe: 0.0000001688 (estimated probability that this image is "not safe for work", as determined by an AI model)
Aesthetic: 6.02298 (estimated score that a human rater would assign the aesthetics of this image, on a scale from 1 to 10)

References

"About". LAION.ai. Retrieved 26 September 2022.
Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica.
Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database". Bloomberg News. Retrieved 24 April 2023.
"Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135". CourtListener. Retrieved 2023-02-08.
"A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead". Vice. Retrieved 2023-05-04.
Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ.
Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica.
Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. Retrieved 26 September 2022.
Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487. {{cite journal}}: Cite journal requires |journal= (help)
Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog.
Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch.
"image 17024". LAION Aesthetic 6+ dataset explorer. Retrieved 26 September 2022.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[About-1] "About". LAION.ai. Retrieved 26 September 2022.

[Ars-Trained-2] Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica.

[BB_teacher-3] Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database". Bloomberg News. Retrieved 24 April 2023.

[4] "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135". CourtListener. Retrieved 2023-02-08.

[5] "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead". Vice. Retrieved 2023-05-04.

[Infoq-5b-6] Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ.

[Ars-medical-7] Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica.

[Laion-400m-blog-8] Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. Retrieved 26 September 2022.

[imagen-paper-9] Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487. {{cite journal}}: Cite journal requires |journal= (help)

[Laion-5b-blog-10] Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog.

[tc-sai-11] Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch.

[12] "image 17024". LAION Aesthetic 6+ dataset explorer. Retrieved 26 September 2022.