업데이트:

1. A survey on LLM Generated text detection(2024)

Motivation

Why should we learn how to detect an LLM generated text?

  • It can spread an erroneous knowledge
  • risk of malicious exploitation (some fraud stuffs, spam production…)
  • MAD(Model Autophagy disorder : Alemohammad et al.) - It can lead to a reduction in quality and diversity in subsuquent model.

Also, current detection technologies by algorithm and by human are both unreliable.

Problem Statements and Characteristics

\[D(x) = \begin{cases} 1 & \text{if } x \text{ is generated by LLMs} \\ 0 & \text{if } x \text{ is written by human} \end{cases}\]

Basically, it is binary classification problem.


Disparities between human and LLM generated text

  • Length & Vocabulary: LLM outputs are typically ~2x longer, but with more limited vocabulary.
  • Stylistic Features:
    • More grammatical simplicity
    • Frequent use of passive voice and narrative style
  • Structural Traits:
    • Stronger organization, logic, formality, and objectivity
    • More comprehensive and detailed responses
  • Content Quality:
    • Less biased or harmful content
    • Occasionally illogical or factually incorrect details (“hallucinations”)
  • Linguistic Patterns:
    • Higher frequency of nouns, verbs, determiners, adjectives, auxiliaries, conjunctions, particles
    • Lower frequency of adverbs and punctuation
  • Emotional Tone:
    • Lower emotional intensity
    • Clearer, more neutral, often with a positive bias

ex)

Human : I think ~ , I believe ~

AI : That is known as~, This is the fact that ~


Text generation mechanism is predicting subsequent tokens sequentially, not instantaneously.

The quality of decoded text is intrinsically tied to the decoding strategy.


Related works and investigation so far

Tang, Chuang,and Hu (2023) present another survey, categorizing detection methods into black-box detection and white-box detection

  1. Black-box Detection
    • Detects text without access to the model’s internal structure or parameters.
    • Works by analyzing text content and style alone.
    • +: Can be applied in real-world settings without model access.
    • -: Accuracy may drop as LLMs produce more human-like text.
  2. White-box Detection
    • Uses internal model information, such as probabilities, parameters, or training data.
    • +: Higher accuracy and more precise detection.
    • -: Requires access to the model, which is often limited.

Corpus

High quality datasets(labeled data) are pivotal for advancing research in LLM generated text detection.

But, procuring such high quality labeled data demands lot of resources.

Detection datasets

Potential datasets

Detection datasets(corpus) are meant to be used in LLM generated text detection.

On the other hand, potential datasets are meant to be used in regular tasks(Q&A, machine translation, story generation…) but, it found out that it can be used in LLM generated text detection research.

Since it is designed for regular tasks, it only includes human generated texts, we have to append the LLM generated text on the same tasks.


Remaining challenges

  • Multiple types of attack
  • Multi domains
  • Multiple LLMs & outdated LLMs
  • Multilingual

A noticeable trend among researchers is the tendency to utilize datasets originally designed for other tasks as human-written texts, and produce LLM-generated texts base on them for training detectors. This approach arises from the limitations of existing datasets or benchmarks in comprehensively addressing diverse research perspectives

  1. Problem: High-quality datasets for LLM detection are scarce and difficult to create from scratch.
  2. Workaround: Researchers repurpose existing datasets that were originally made for other tasks (e.g., Q&A, summarization).
  3. New Issue: Since this recycled data is tailored to a specific purpose, detectors develop a “narrow perspective” and fail to perform well in diverse, real-world scenarios.

Advances in Research*

We will have a look in 4 types of detectors

  1. Watermarking method
  2. Statistics-Based method
  3. Neural based method
  4. human assisted method

Watermarking Technology

The fundamental idea behind watermarking is a “Lock and Key” system. The method used to embed the watermark (the lock) is always paired with a specific algorithm to detect it (the key). This key is secret and is almost always held exclusively by the LLM provider (e.g., Google, OpenAI). You cannot detect a watermark without the corresponding key.


Data-Driven Watermarking (The “Backdoor” Method)

These methods typically rely on backdoor insertion, where a small number of watermarked samples are added to the dataset, allowing the model to implicitly learn a secret function set by the defender

The watermark is embedded into the model itself during the initial training phase. The model is taught to produce a hidden signal in response to a secret trigger, effectively creating a backdoor

  • How to Detect: The provider, who knows the secret trigger, can test a suspect LLM. They input the trigger phrase. If the model outputs the expected watermarked pattern, it is confirmed to be their model. This isn’t for checking random text but for proving a model was stolen.

Model-Driven Watermarking (The “Real-time” Method)

Model-Driven methods embed watermarks directly into the LLMs by manipulating the logits output distribution or token sampling during the inference process.

The watermark is applied in real-time **as the text is being generated (during the inference process). It works by influencing how the model chooses its next word at each step.

It manipulates the word selection process without changing the model’s core parameters. For example, using a secret key, the algorithm generates a “green list” of statistically preferred words for the model to choose from. The model is gently nudged to pick words from this list.


Post-Processing Watermarking (The “Steganography” Method)

The watermark is added to the text after it has been fully generated by the LLM. This is done by a separate module, similar to how steganography hides data in an image.

It modifies the finished text in ways that are invisible or very subtle to a human reader.

  • change whitespace by space-bar into whitespace by UNICODE.
  • replacing certain words with synonyms according to a secret, predefined rule( ‘happy’ to ‘joyful’)

Real world Process

  1. Situation: A university professor suspects a student’s essay was written by Gemini.
  2. Process: The university sends the text to Google’s official “Watermark Verification Service.”
  3. Verification: Google uses its internal, private detector and the secret key for Gemini to analyze the text.
  4. Result: Google returns a simple result to the university, such as: “This text has a 99.8% probability of being generated by our model.” The university never sees or uses the secret key itself.

Only the LLM provider(Google,OpenAI, etc) keeps the secret key to detect.


Statistics based Method


Linguistics Features Statistics

  • specific words/phrase Repetition
  • limited vocabulary
  • low lexical diversity
  • predictable and too perfect grammer structure

White-Box Statistics

White box : direct access to the source model. It uses zero-shot method.

Logits : a raw score before applying softmax function. It becames probability between $[0,1]$

  1. Logits-Based : If most of the vocabulary is from top probability candidates, then it’s very likely generated from LLM.
  2. Perturbed-Based : taking a sentence and slightly altering it (perturbing it), then observing the change in its log-probability score. original sentence must the “perfect and optimal” so, slight change can make a significant difference.
  3. Intrinsic dimension : Mathematical property that AI-generated text occupies a simpler and lower-dimensional space than human-written text.

Black-Box Statistics

This method uses an external AI as a “detection tool” because the internal information of the suspect AI is inaccessible

  • Rewrite/Edit task
  • Continuation task

If it is written by AI, then there must be few things to fix.

But it is written by human, then there must be tremendous changes.


Neural based Methods

Linguistic Features Based Classifiers

It follows the classic machine learning techniques.

  1. extract some features and turn it into real-valued vectors.(how many stopwords, length, etc)
  2. train a classification model

In fact, you don’t need an LLM model in this method but still, it shows an high accuracy.

(recent study achieved an 97% accuracy extracting 21 textual feature, Aich et al.)


Model Features Based classifiers

(You need an white-box LLM)

These classifiers are not only capable of detecting texts generated by LLMs but can also be employed for text provenance tracing.

  1. Put the target text into the white-box LLM
  2. extract every internal logits
  3. using this information, calculate log liklihood* and perplexity**
  4. Based on this numbers, we can train the model.

It is very powerful and effective method. You can even know which specific LLM model creates the text.

But, you have to get an complete white-box LLM, so you can’t use this method on close-source model like GPT.

*Log Liklihood

It’s the sum of the log probabilities for each word appearing in the text.

The higher the value, the more natural and predictable the model considers the sentence.

  1. “The cat sat on the mat”
  2. $P(”The”) = 0.01 , P(”cat” “The”) = 0.05$, …and so on
  3. The final log liklihood is the sum of the logs of these probabilities. $log(0.01) + log(0.05) + …$

AI-written text will be high log liklihood and vice versa.

** Perplexity

A measure of the model’s uncertainty or confusion.

High log liklihood = Easy to predict = low PPL

Low log liklihood = hard to predict = High PPL


Pre-training classifiers

According to (Qiu et al. 2020), pre-trained LMs have proven to be powerful in natural language understanding, which is crucial for enhancing various tasks in NLP, with text categorization being particularly noteworthy. Esteemed pre-trained models, such as BERT (Devlin et al. 2019a), Roberta (Liu et al. 33 2019), and XLNet (Yang et al. 2019), have exhibited superior performance relative to their counterparts in traditional statistical machine learning and deep learning when applied to the text categorization tasks within the GLUE benchmark (Wang et al. 2019).

Idea : using some trained models which is good at NL understanding such as BERT, RoBERTa, fine tune these models on a specific task : LLM generated text detection

In-domain fine tuning is all you need

add training data( human-written, AI-written) to the model.

+: It achieves high accuracy ( over $95%$)

-: Suffers from overfitting.

Methods to solve the overfitting problem

  1. Contrastive learning : Effective for low-data scenarios. It teaches the model to pull similar samples (e.g., human-human) closer together and push dissimilar samples (human-AI) further apart in its internal representation, creating a clearer decision boundary.
  2. Adversarial learning : Intentionally training the detector on sophisticated, “attack” samples (e.g., paraphrased AI text). This makes the detector more familiar with difficult cases and improves its robustness.
  3. Features-Enhanced : A hybrid approach. It provides a neural network detector (like RoBERTa) with additional linguistic features (e.g., stylistic scores, sentiment) as extra hints, improving overall performance and robustness by covering areas the neural network might miss.

LLMs as detectors

A recent reevaluation by Bhattacharjee and Liu (2023) on modern LLMs like ChatGPT and GPT-4 found that neither could reliably identify text generated by various LLMs.

Solution : ICL(In-context Learning)

Through ICL, existing LLMs can adeptly handle different tasks without needing additional fine-tuning.

Although an LLM’s raw detection performance is poor, it improves dramatically when given “hints.” Instead of just asking, “Is this AI-written?”, you provide illustrative examples directly within the prompt.

Just asking “Is this AI-written?” (X)

Instead, we can give it some hint : “This is human-written article and this is AI-written, based on this data, is this AI-written?” (O)


Human-assisted method

Mixed detection is ideal:

While algorithms focus on statistical patterns, humans are better at spotting intuitive clues

AI : using visualization tool like GLTR, it can give human an clear hint

Human : mainly focus on what human’s good at. Logical errors, lack of details, etc


Evaluation Metrics*

img from Spot intelligence Confusion matrix can help effectively evaluate the performance of classification.

  • TP, TN, FP, FN
  • Accuracy
  • Precision
  • Recall
  • Human Recall
  • LLm Recall
  • Avg Recall
  • FPR
  • TNR
  • FNR
  • F1
  • AUROC

Important Issues

Out of distribution

  • Cross-domain
  • Cross-lingual
  • Cross-LLM

Potential Attacks

  • Paraphrase attack
  • prompt attack
  • Training threat models

Real world data issues

real world data is quite messy. there are many texts that are not purely generated by LLMs.

and this is very hard to distinguish.

Model size on detectors

  • the bigger the model is, it is very likely to seem that is generated by human.
  • the bigger the model is, may reduce the generalization ability

Lack of Effective evaluation framework

  • All of the researchers using different standards of evaluation

Future research direction

  • Building robust detectors with attacks
  • enhancing the efficieny of zero shot detectors (black box zero shot > white box zero shot)
  • optimizing detectors for low resource environments

카테고리:

업데이트:

댓글남기기