OpenAI, the influential artificial intelligence research lab behind groundbreaking tools like ChatGPT and Sora, has found itself in hot water following a recent interview with its Chief Technology Officer, Mira Murati.
The interview, conducted by Wall Street Journal reporter Joanna Stern, focused on OpenAI’s latest image, or rather video, generation system, Sora.
Concerns center around the potential misuse of copyrighted work to train AI models and the lack of transparency from OpenAI regarding its data practices.
Sora’s training data is in question
At the heart of the controversy lies the issue of training data, the massive datasets used to train AI models.
When asked about the sources of data utilized for Sora, Murati provided the standard response: the model had been trained on “publicly available and licensed data“.
However, further probing revealed hesitation and uncertainty on Murati’s part about the specific details of this dataset.
This response has raised red flags among artists, photographers, and intellectual property experts. AI image generation systems depend heavily on ingesting vast quantities of images, many of which may be protected by copyright. The lack of clarity around Sora’s training data raises questions about whether OpenAI has adequately safeguarded the rights of content creators.
Shutterstock usage admitted later on
Adding fuel to the fire was Murati’s initial refusal to address whether Shutterstock images were a component of Sora’s training dataset. Only after the interview, in a footnote added by the Wall Street Journal, did Murati confirm the use of Shutterstock’s image library.
This confirmation contradicts OpenAI’s public-facing stance of “publicly available and licensed data” and suggests an attempt to conceal potentially problematic sourcing practices.
Shutterstock and OpenAI formed a partnership granting OpenAI rights to use Shutterstock’s image library in training image generation models like DALL-E 2 and potentially Sora.
In return, Shutterstock contributors (the photographers and artists whose images are on the platform) receive compensation when their work is used in the development of these AI models.
A PR nightmare unfolds
It’s safe to say that most public relations folks would not consider this interview to be a PR masterpiece.
Murati’s lack of clarity comes at a sensitive time for OpenAI, already facing major copyright lawsuits, including a significant one filed by the New York Times.
The public is scrutinizing practices like OpenAI’s alleged secret use of YouTube videos for model training, as previously reported by The Information. With stakeholders ranging from artists to politicians demanding accountability, Murati’s avoidance only fuels the fire.
OpenAI’s opaque approach is backfiring spectacularly, transforming the Sora interview into a PR disaster.
OpenAI CTO Mira Murati says Sora was trained on publicly available and licensed data pic.twitter.com/rf7pZ0ZX00
— Tsarathustra (@tsarnick) March 13, 2024
Transparency is not the most discussed topic for nothing
This incident underscores a critical truth: unveiling the truth is paramount in the world of AI. OpenAI’s stumbling responses have severely undermined public trust and intensified questions about its ethical practices. The Sora controversy highlights the growing chorus demanding greater accountability within the AI industry.
Murati’s reluctance to disclose the specifics of Sora’s training data breeds mistrust and sets a dangerous precedent.
Without the clarity artists, creators, and the public are demanding, ethical debates and the potential for legal action will only intensify.
There are no angels in this land
While much of the current scrutiny falls squarely on OpenAI, it’s crucial to remember they’re not the only player in the game.
Facebook AI Research’s LLaMA model and Google’s Gemini have also faced allegations of problematic training data sources.
This isn’t surprising, as Business Insider reports that Meta has already admitted to using Instagram and Facebook posts to train its AI models. Additionally, Google’s control over vast swaths of the internet gives them unparalleled access to potential training data, raising similar ethical concerns about consent and copyright.
The situation with OpenAI’s Sora is just one piece of a larger puzzle. The entire AI development field is facing scrutiny regarding its data practices and the potential ethical implications.
Featured image credit: Freepik.