I think you misunderstood my proposal LLMs are good a producing embeddings which...

I think you misunderstood my proposal

LLMs are good a producing embeddings which are latent representations of the content in the text. That content for research papers includes things like authorship, research directions, and citations to other papers.

When you fine-tune a model that generates such embeddings with a labeled dataset representing fraud (consisting of say 1000s of samples), the resulting model will produce different embeddings which can be clustered.

The clusterings will be different between the model with the fraudulent information and without the fraudulent information.

Now using this embedding generation model, you (may) have a way to discern what truly significant research looks like, versus research that has been tainted from excess regurgitation of untrustworthy data.