Why AI Can’t Tell Who’s Content It Is Stealing?
Aug 13, 2024
Discover why AI struggles to identify content ownership and the implications for creators and publishers in our latest article. Learn how this issue impacts content integrity and what can be done to protect your work.
The Fight for Attribution
Real-world problems with generative AI highlight the critical challenges of attribution. For instance, artist Greg Rutkowski has seen his distinct style replicated by AI models without permission, leading to lost professional recognition. Similarly, AI-generated news summaries by Google faced backlash for lacking proper source attribution, undermining user trust. Additionally, Getty Images sued Stability AI for allegedly using millions of its images without permission, emphasizing the need for fair compensation.
Consider a hypothetical scenario where an AI model generates a comprehensive report using medical research data. If the AI cannot attribute specific findings to their original studies, it could lead to issues of credibility and trust in the medical community. Another example could be an AI-generated book that draws heavily on existing literature without proper attribution, potentially leading to legal disputes and ethical concerns over intellectual property rights.
Attribution in Gen AI is fundamentally limited and practically challenging.
Fundamental Limitations
These issues underline the fundamental limitations of attributing generated content to specific training data.
Intertwined Learning and Representation: Large language models learn representations of words, phrases, and contexts through intertwined layers of neurons. These layers process inputs in a distributed manner, meaning each layer's output depends on many previous layers' activations. Each layer's neurons contribute to the final output in a highly interdependent way. This complexity obscures direct relationships between specific training data and generated content. The model encodes information about language patterns, not specific instances, making it difficult to attribute parts of generated text to exact sources (representation learning). A single generated sentence might draw from millions of data points, blending learned patterns rather than replicating specific texts.
Diffused Learning Patterns: Generative AI models diffuse learning across their entire structure, meaning that no single piece of training data can be isolated as the source of a specific output. The model learns language patterns in a distributed fashion, where each pattern influences many outputs and each output is influenced by many patterns (pattern diffusion). This diffusion blurs the lines of attribution, as outputs are amalgamations of countless learned patterns rather than direct copies of training data (attention blur). For example, a product review generated by an AI might incorporate language styles and facts from thousands of reviews, making direct attribution to specific sources impossible.
Practical Challenges
Massive Scale of Parameters: Generative AI models like GPT-3 have an enormous number of parameters—175 billion in GPT-3’s case. Each parameter plays a role in transforming input data into meaningful output. This scale presents a significant challenge for attribution. During training, the model adjusts billions of weights to minimize the error in its predictions. Each data point influences many parameters in complex ways. When generating text, a similar number of weights are activated. This collective activation makes it nearly impossible to trace back specific outputs to particular inputs. Imagine a model that adjusts around 1 billion weights for a single piece of text. Identifying which weights were adjusted by which training data points becomes a computationally prohibitive task.
Computational Prohibitiveness: Attempting to attribute generated content to specific training data points requires substantial computational resources. The vast number of parameters and the intricate ways they interact make this task nearly infeasible. Tracing back a response to its training data involves examining billions of parameters and their interactions, demanding significant computational power and time. Even with advanced computational resources, the time and cost associated with detailed attribution are prohibitive for practical applications. Considering a model like GPT-3, the complexity of reverse-engineering parameter influences across its 175 billion parameters is beyond current computational capabilities for most real-world scenarios.
Even Fractional Attribution is Not Feasible: Even attempting to attribute a fraction of generated content to specific training data points is not technically feasible due to the same issues of scale and complexity. The intertwined nature of parameter adjustments means that even partial attribution would require extensive computational resources and detailed analysis of model behavior. Efforts to attribute fractions of content still face the issue of diffuse learning, where even small outputs are influenced by vast swathes of data. Attempting to trace even a small part of a generated sentence back to specific training data is computationally prohibitive and lacks the precision needed for meaningful attribution.
Give Control to Creators: A Path Forward
Giving content creators control over how their content is used by AI companies can help address these challenges:
Explicit Permissions and Logging: When creators give explicit permission for their content to be used, it becomes feasible to log and track this data at a granular level. Each piece of content can be tagged with metadata detailing its source and permissions, creating a transparent record. This helps mitigate the issue of computational prohibitiveness by narrowing down the specific sources used in training. For example, a journalist who opts in their articles for AI training would have each article access conditional based on inclusion of metadata e.g., their name and usage terms, allowing for tracking and attribution. This also helps mitigate the issue of computational prohibitiveness by narrowing down the specific sources used in training.
Granular Access Control: With creators specifying which parts of their content can be used, platforms can maintain detailed logs of what data was accessed and when. This reduces the complexity of tracing back responses to their origins, addressing the challenge of the vast number of weights involved. For example, a photographer grants permission for only certain images to be used. Each access to these images is logged and additional meta-data added during access by AI for training, making it straightforward to trace any AI-generated work back to the specific images and their creator. This also reduces the complexity of tracing back responses to their origins, addressing the challenge of the vast number of weights involved.
Automated Tracking and Auditing: By implementing advanced monitoring and auditing systems, platforms can automatically track the usage of content in AI models. These systems can generate reports that attribute AI-generated outputs to the specific data sources used in training, addressing the challenge of the distance between response and training data. For example, a music streaming service logs every instance of its songs being accessed by AI for training. Automated reports can then show which songs contributed to the generated music, ensuring proper attribution and compensation.
Tracking AI Crawlers: The First Step
For content owners to achieve meaningful attribution, the first step is to track AI crawler activity on their content. Understanding which AI models are accessing and using their data is crucial:
Monitoring Access: Implementing systems to monitor and log AI crawler activity helps content owners know who is using their data. For example, a news website could use sophisticated tracking software to log every instance of an AI crawler accessing its articles. This data can then be used to manage permissions and ensure proper attribution.
Control Mechanisms: By controlling and monitoring access, content owners can establish a basis for attribution and ensure their data is used ethically and transparently. They need easy to use control mechanisms that can be embedded in their content without disruption. A content management platform could deploy APIs to automatically approve or deny AI crawler requests based on predefined criteria, ensuring that only authorized crawlers have access to specific data.For example, adding a tag in the web pages that serve content to selectively enforce access for AI crawlers only after their attribution or pricing conditions are met.
Providing control over how content is used to content creators offers a path forward, ensuring fair use, proper compensation, and maintaining the quality of AI training data. By first implementing systems to track AI crawler activity, content owners can begin to address these limitations and foster a more ethical, transparent, and sustainable AI ecosystem. Contact us to know how we can help.