Watermarking LLMs

Written by Lenox Herman

Motivation

As LLMs (large language models) continue to be used and grow in modern times, the need for delineating where text originates becomes more apparent. For instance, a teacher might want to determine if a student has used an LLM to complete an assignment. On the flip side, organizations like OpenAI might need a way to verify whether a given output can genuinely be attributed to their models. In order to do this, LLMs use watermarking to determine whether a block of text was human or machine-generated.

Let’s look at the example, “Turmeric lemon cookies are…” In order to calculate the next word, LLMs embed each word to create a probability distribution to determine the next word.

Red/Green Listing

This is a watermarking method where words are predetermined as “Red” and “Green” words. Here, a small constant is added to all green words before the softmax is used to calculate the probability distribution. The result is that green words are slightly more likely to be selected when generating text.

To detect Red/Green watermarking on LLM-generated text, you would run a P-test. Consider the example:

“Watermarking is a very fun topic to cover especially when Teal is teaching it.”

Here we assume that the probability of green words appearing in natural text is 50/50, or 1/2.

To detect Red/Green watermarking in LLM-generated text, a statistical hypothesis test can be applied. Suppose the probability of selecting a green word in natural text is 50% (i.e., \(P(\text{Green}) = 0.5\)). We can analyze the observed frequency of green words in \(m\) trials (words in the text) to determine if the text is likely watermarked.

The probability of observing exactly \(K\) green words in \(m\) trials follows a binomial distribution:

P (k success in m trials) = [ (Pr())^K (1 - Pr())^{m-K}, ]

where \(m\) \(K\) is the binomial coefficient.

By comparing the observed frequency of green words to the expected frequency under the null hypothesis (50/50 distribution), we can infer whether the text is watermarked.

The full summation form of this probability is: P (k success in m trials) = [ _{l=k}^m p^l (1 - p)^{m-l} ]

where \(p\) is the probability of success (e.g., selecting a green word).

Modifying the probability distribution in watermarking slightly affects the output quality. Selecting a word that is not the highest probability increases the risk of hallucinations since LLMs are auto-generative and build text sequentially.

Additionally, if someone gains access to the model, they could potentially reverse-engineer the watermarking process by distinguishing red words from green words. For example, the word “delve” could be identified as a green word generated by ChatGPT, providing evidence of watermarking being used.

Distortion-Free Watermarking

To solve the issue of modifying the probability distribution, distortion-free watermarking provides a method to embed a watermark without altering the output probabilities significantly. For example, consider the sentence:

“Vermont kale in the winter! ___”

In this approach:

The tokens within a designated window, let’s say the last three words—“in”, “the”, “winter” for this example—are converted into numerical values and summed to create a seed.
This seed is a multiplier to generate a random yet replicable sampling from the probability distribution of the next word without introducing distortion.

To detect this watermark, one would:

Access the prompt or the preceding context (e.g., “Vermont kale in the winter!”).
Extract the last three tokens, regenerate the seed, and sample from the given probability distribution.
Compare the generated probability distribution with the observed text.

This approach ensures the watermark is detectable without modifying the core probability distribution of the language model; however, it is still an unreasonable solution because it requires the model to have access to the prompt as well.

Exponential Minimum Sampling

Exponential Minimum Sampling addresses the core challenges of watermarking: it does not require access to the prompt and avoids altering the probability distribution. Here, a pseudo-random vector \(X\) is generated using a seed derived from a sliding window of the given text. This \(X\) vector has continuous values between 0 and 1, with a length equal to the dimensionality of the vocabulary, \(|v|\).

The next token is selected based on minimizing the cost function:

\(\text{Cost} = -\frac{\log(x_i)}{\text{len(token\_ids)}},\)

where \(x_i\) are entries from the \(X\) vector. This method ensures that the probability of selecting a specific token \(i^*\) corresponds directly to its true likelihood \(P(i^*)\) in the original probability distribution.

Pseudo-code for Detection:

cost = 0
for token_id in token_ids:
    seed = calculate_seed(token_id, window)
    x_vector = generate_x_vector(seed)
    cost += -log(x_vector[token_id]) / len(token_ids)

The probability of selecting the next word can be expressed as:

\(\text{next} = \arg\min_i \frac{-\log x_i}{p_i},\) where \(x \sim \text{Uniform}(0, 1)^|v|\).

Additionally we know the probobility of selecting the next word, i, is just P(i) because:

\(P\left(-\frac{\log x_i}{p_i} \geq t\right) = P\left(x_i \leq e^{-p_i t}\right) = e^{-p_i t}.\)

Thus, the probability density of \(-\frac{\log x_i}{p_i}\) is: \(P\left(-\frac{\log x_i}{p_i} = u\right) = p_i \cdot e^{-p_i u}.\)

The cumulative probability of the selected word being \(\arg\min_i \frac{-\log x_i}{p_i}\) is: \(P\left(i^* = \arg\min_i \frac{-\log x_i}{p_i}\right) = \int_{u=t}^\infty P\left(-\frac{\log x_i}{p_i} = u\right) \prod_{j \neq i} P\left(-\frac{\log x_j}{p_j} > t\right) \, du.\)

Simplifying further: \(P\left(i^* = \arg\min_i \frac{-\log x_i}{p_i}\right) = p_i \int_{u=t}^\infty e^{-p_i u} \prod_{j \neq i} e^{-p_j u} \, du.\)

Expanding the product term:

\(P\left(i^* = \arg\min_i \frac{-\log x_i}{p_i}\right) = p_i \int_{u=t}^\infty e^{-\left(\sum_{j} p_j\right) u} \, du.\)

Finally, solving the integral:

\(P\left(i^* = \arg\min_i \frac{-\log x_i}{p_i}\right) = p_i \cdot \left[-\frac{1}{\sum_{j} p_j} e^{-\left(\sum_{j} p_j\right) u}\right]_t^\infty,\)

which simplifies to:

\(P\left(i^* = \arg\min_i \frac{-\log x_i}{p_i}\right) = p_i.\)

The seed for this process is \(x\).