Researchers have discovered the phenomenon of "ghost personas," where large language models systematically create fictional authors and experts that are beginning to appear en masse online and in scientific repositories.

image

What Happened

During the study, it was established that models such as Claude, Gemini, and GPT possess specific correlated sets of names (name priors) that serve as a kind of digital fingerprint. For example, the name pair Elena Vasquez and Marcus Chen is characteristic of Claude. This has led to large-scale pollution of scientific platforms: more than 1,600 fake entries were identified in the Zenodo repository, which received real DOI identifiers, making them extremely difficult to filter automatically.

Context

The problem is exacerbated by the fact that AI-generated content with valid metadata (such as DOIs) is automatically indexed by scientific aggregators. This creates a closed loop where synthetic data enters the training sets of new models and the knowledge bases of RAG systems, poisoning the overall information environment.

Why It Matters for the Industry

For the AI industry, this represents a systemic threat to data verification. Developers need to implement additional layers of observability and specialized mechanisms to check for model "name fingerprints" in data preparation pipelines to avoid model quality degradation caused by training on synthetic junk.

Why It Matters for Users

For readers and researchers, the presence of specific correlating names in articles or posts becomes a new signal for detecting fully synthetic content. Traditional authenticity verification methods, such as the presence of a DOI, can no longer serve as a guarantee that a real person stands behind the material.

What Is Not Yet Known / Limitations

The focus of the problem ranges from engineering risks (training poisoning) to legal issues, and further research into content protection methods is required.

Sources

Author

Look at AI, Editorial Team