Search the Database of Pirated Books AI Trained On While Trump Kills Your Local Library

Search the Database of Pirated Books AI Trained On While Trump Kills Your Local Library

  • 21.03.2025 18:00
  • gizmodo.com
  • Keywords: AI, Books, Meta, OpenAI

Meta used pirated books from LibGen to train its AI, sparking concerns among authors and scientists. Meanwhile, Trump's administration cuts funding for public libraries, limiting access to literature.

Meta NewsMeta ServicesMETAsentiment_dissatisfied

Estimated market influence

Meta

Meta

Negativesentiment_dissatisfied
Analyst rating: Strong buy

Meta has used pirated books to train its AI, which includes millions of works from LibGen. This action has implications for copyright laws and relationships with authors.

OpenAI

Neutralsentiment_neutral
Analyst rating: N/A

While OpenAI initially used datasets from LibGen, they have moved away from these sources as of 2023. Their current models do not rely on pirated content.

Context

Analysis: Business Insights and Market Implications

Overview

The text highlights the ethical and legal concerns surrounding AI development by major tech companies like Meta, which have reportedly used pirated books from platforms like LibGen to train their AI models. Additionally, it draws attention to the U.S. government's potential reduction in funding for public libraries under President Trump, further threatening access to literature.


Key Facts and Data Points

  • LibGen Database:

    • Contains nearly 7.5 million books and 81 million academic papers.
    • Used by scientists and researchers to bypass expensive publisher fees.
    • Described as a "shadow library" due to its illicit nature but recognized for its benefits in scientific progress.
  • Meta's AI Training:

    • Meta used LibGen to train its AI, including downloading books via torrents.
    • A senior researcher at Meta, Melanie Kambadur, emphasized the importance of books over web data for training AI.
    • Internal documents reveal discussions about licensing vs. pirating content.
  • OpenAI's Response:

    • OpenAI stated that its current models (e.g., ChatGPT) were not trained using LibGen datasets.
    • The company has faced legal challenges related to copyright usage, claiming "fair use."
  • Public Libraries Funding:

    • Trump's administration issued an executive order targeting the Institute of Museum and Library Services (IMLS), a key funding source for public libraries.
    • Federal grants support digital services like Libby and Hoopla, which provide e-books and audiobooks to users.
  • Author Reactions:

    • Authors like Michael Chabon, Michael Livingston, and Aliette de Bodard expressed dissatisfaction over their works being used without permission.
    • This raises ethical concerns about the devaluation of creative content.

Market Trends and Business Impact

  • AI Development Costs:

    • Companies like Meta are prioritizing cost-effective solutions by leveraging pirated content instead of licensing books.
    • This approach risks long-term damage to relationships with authors, publishers, and the broader creative community.
  • Public Library Funding Cuts:

    • Reduced federal funding could lead to scaling back or elimination of digital services provided by libraries.
    • Users may face longer wait times for e-books and limited access to certain titles.
  • Shift in Content Access:

    • The reliance on pirated content creates a paradox where free access to literature is being exploited by tech companies, while public institutions that support literacy are being undermined.

Competitive Dynamics

  • Tech vs. Traditional Publishing:

    • Big tech firms like Meta and OpenAI are competing with traditional publishers over control of intellectual property.
    • The use of pirated content gives these companies a competitive edge but raises ethical concerns.
  • Regulatory Uncertainty:

    • The legality of AI training on copyrighted material remains unclear, creating regulatory risks for companies involved.

Long-Term Effects and Regulatory Implications

  • Erosion of Literary Ecosystem:

    • Piracy and reduced library funding could harm the publishing industry and limit public access to books.
    • This may lead to a decline in cultural and educational opportunities for communities relying on libraries.
  • Potential Legal Repercussions:

    • Authors and publishers may pursue legal action against companies using their works without authorization.
    • Regulatory bodies may impose stricter guidelines on AI training data sourcing.

Ethical Considerations

  • Value of Intellectual Property:

    • The use of pirated content challenges the value system around intellectual property, potentially devaluing creative work.
    • This could discourage authors and publishers from investing in new content.
  • Equity in Access:

    • While AI companies benefit from free access to books, public libraries face funding cuts, creating a disparity in opportunities for marginalized communities.

Strategic Considerations

  • Public Relations Risks:

    • Companies like Meta and OpenAI risk damaging their brand reputation due to ethical concerns over pirated content use.
    • Public backlash could lead to scrutiny from regulators and consumers.
  • Potential Alternatives:

    • Licensing agreements or partnerships with publishers could provide a more sustainable and ethical alternative for AI training data.
    • Collaborations with libraries might help preserve public access to literature while supporting content creators.