How is LLM generated content moderated?

Question

I'm looking for references (articles) about how LLM generated content is moderated. From a technical point of view, what makes the difference between a so-called "uncensored" LLM such as Pygmalion 7B and what would be a "censored" one? Does a LLM always generate text that gets moderated later, or is it pre-trained / fine-tuned to generate moderated content by default?

Is LLM like ChatGPT able to generate content that needs to be moderated? I am thinking about this from training point of view, is it possible that LLM generates sensitive or illegal content from training set not containing such? — harism, Jul 21 '23 at 21:52

Chinmay · Answer 1 · 2023-07-21T23:21:14.173

TL;DR; There is no model inherently censored, as big models have big training data which can't be all modified to not include explicit content. Also, we don't want to not-include the explicit content lest we want to take away the abstractions that the NN might learn from that explicit content also. There are multiple layers to causal reasoning involved to design commercially available products like OpenAI's models.

That's an interesting one. Let's first differentiate between different terms being used. A foundational model is a model that is pre-trained on a set of training data and can be fine-tuned to perform downstream tasks, whereas consequentially a fine-tuned model is one that can perform specific tasks like summarization, topic extraction, NER, etc.

Now, let's talk about the GPT-3 family for a bit which houses both the Davinci model and ChatGPT (GPT-3.5) model. If you read about it a little bit, OpenAI lets you fine-tune Davinci but not ChatGPT. Davinci was fine-tuned to perform better generations/completions. On the other hand, ChatGPT was fine-tuned to perform humanesque chat conversations and consequentially started outputting more uncensored content (because of-course human texts tend to get witty).

Then comes the concept of causal reasoning. You can understand it most easily as a conditional statement which is an added layer over the output of the language model. Essentially it forces the model to take certain actions given certain scenarios. One example might be a prompt saying to use a different selection strategy (it becomes greedy at temperature=0) if the next generated token leads to a vulgar word/slang. Another might be pre-added instructions to model input requesting it to return 'I can't respond to this query as it pertains to copyright information'. It's basically an elaborate if-else statement that works with language models.

Coming back to our example of OpenAI's models, while not completely uncensored, DaVinci is susceptible to producing more uncensored content than ChatGPT as evidenced in their blog asking users to use a layer of causal reasoning to best meet their market requirements and that OpenAI will not be held liable to any explicit content generated.

UPDATE (21 July 2023) - According to this UC Berkley study, this causal reasoning might be actually reducing the effectiveness of ChatGPT due to increasing limitations on the content that it can produce.

Are you saying it is not doable to prevent sensitive training data, using amount of data as an excuse, from entering the learning process? — harism, Jul 21 '23 at 23:58
I think it's more practicality than the amount of data that I mean here. Of course, you can engineer methods to reduce bias (which of course is necessary), but it shouldn't come at a cost of decreased associations which might in itself be a bias. Ultimately, sensitive information like vulgar words to socially unacceptable ideas should be included in the training data and the model should later be fine-tuned to behave a certain way. We want a foundational model to be able to understand as much as it can. — Chinmay, Jul 22 '23 at 00:23
It is a practicality for sure, and I can only image how much training data there is available and used, when it comes to more larger LLM. Still I wonder maybe social media companies like Facebook have means to recognise offensive language from the stream of text they keep receiving. And if this is doable, why not filter offensive language from LLM training similarly too, just before it enters the training loop. — harism, Jul 22 '23 at 00:38

score 2 · Answer 2 · answered Jul 22 '23 at 02:15

The better formulation of this question is “How do LLMs get aligned with human values?”. Which humans and what values is an interesting societal question. Let’s say, for the purpose of this conversation, it’s with a set of humans who don’t like to hear curse words.

There are 4 main forms of achieving alignment. The layer at which each is implemented is shown in [brackets].

[model] Pre-train your model on the data set you want to learn from. If you don’t want curse words, strip all the training data of curse words laden content, and the model is better aligned with your “no curse words” values.
[model] Fine-tuning & reinforcement learning w/ human feedback (RLHF). If you have a base model that is good in performance but needs alignment, you can take a smaller set of data than what was used to pre-train, and do a post-train step. This fine-tuning data is often very carefully curated to achieve specific outcomes or match a specific data distribution. Fine-tuning on that data often “re-wires” the neurons (via updating the weights of the neurons) to reduce the curse words. Reinforcement learning can further achieve alignment using a more complicated process of creating a reward model that the base model is further tuned on.
[application layer] At the prompt layer, few-shot learning and system prompts can be used to prevent the model from using certain words, by way of giving examples. This is also called “in-context learning” and can be used for many outcomes, alignment is one of them.
Post-generation of the model, the final result can be passed through another model or ensemble of models. These are typical classifier models that are looking for specific things. You could put any kind of “model” in this step from something naive as keyword filtering, to more sophisticated models like “hate speech”, “violence” or “sexually explicit material’ detection. The best practice is to pass prompt inputs (from users) through moderation systems, but the outputs from the LLM can also be checked.

How is LLM generated content moderated?

2 Answers2