Fine-tuning LLM or prompting engineering?

Question

For some type of chatbot, like a customized chatbot, may it be better to fine-tune the LM for the domain-specific data and the type of question instead of relying on the prompt as prompts have to be processed by the LM every time?

Another reason is that how can the system add the correct prompt for each question, not to say there are many new questions which don't have any existing prompts? So I feel prompting is good, but not practical. Do you agree or do I miss some key points?

Andy · Accepted Answer · 2023-06-21T18:30:26.890

This is still very much an open question, but from how the research looks now it appears that a combination of prompting, output handling, and possibly fine-tuning, will be necessary to achieve consistent behavior from LLMs. As an example of why this is, the AMA paper found that within prompt ensembles, accuracy varied by up to 10%. This has wide-reaching implications, as 10% is a HUGE level of variation in behavior, and by their very nature any system is going to processing prompt ensembles -- your prompt, varied by the input.

Another issue is that it appears that failure modes of LLMs are... more robust than we'd like, meaning that there are some wrong answers that LLMs will reliably return and small variations in the prompt have no meaningful positive effect.

Again, though, this is still an open area. Completely different, interesting approaches like sequential monte carlo steering of LLM output may offer better results overall, and there are definitely countless undiscovered techniques.

noe · Answer 2 · 2023-06-21T08:53:55.500

5

There are many factors to take into account to answer that question. For instance, you may not have the engineering resources to set up a hosted open-source LLM. In this case, you would be forced to go with a third-party API and prompting.

If you don't have any constraints, the answer can only be obtained by testing. Open-source LLMs are inferior to ChatGPT (GPT-3.5 and GPT-4), so they may or may not be enough for your task.

The "art" of designing a good prompt for an LLM is called "prompt engineering". It takes a lot of trial and error. There are some useful tricks, like adding "Be concise" if you want short answers. You may even need to design more generic "prompt templates" to use in different cases by just setting some placeholders to specific values.

There are many reasons prompting is not practical:

Not fault-proof.
Consumes tokens from the input, hence increasing the API call price and reducing the amount of output tokens the LLM can generate.

There are also pro's:

You do not need to retrain a model for it to perform some task.

Fine tuning is very very inconvenient:

not fault-proof either
not available (for now) for proprietary models (GPT-*, Claude).
needs a lot of resources
needs more engineering abilities
it may lead to catastrophic forgetting

So, to answer your question: It depends.

edited Jun 21 '23 at 08:53

answered Jun 21 '23 at 08:48

noe

26,410
1
46
76

What I am missing on the use of prompt for in-context learning is how it is done for a real life chat where user can change the type of tasks thus the framework needs to add appropriate type of prompt for each question? Is prompting currently used in ChatGPT? How is the prompt selected correctly? – Frank Jun 21 '23 at 08:54
In ChatGPT there is probably a single "system" prompt instructing the model to comply with the use request and establishing some safeguards for undesired outputs. Apart from that, there is no other "prompt", just chat. – noe Jun 21 '23 at 08:55
Another very silly question on using OpenAI models. When we use OpenAI models, when we call the model via API, is the processing done by a OpenAI's machine or by my machine? I assume it's OpenAI's machine. This is different to way we use all the other library where we directly install the code and run it by our CPU/GPU. – Frank Jun 21 '23 at 08:59
The processing is done by OpenAI's machines. – noe Jun 21 '23 at 09:22
OpenAI removed the ability to fine-tune GPT-* models? What happened to all of their existing customers hitting custom endpoints? – Andy Jun 21 '23 at 16:50
2

Sorry, my statement was not precise, as I was referring specifically to GPT-3.5 and GPT-4 models. Amended statement: models gpt-turbo-3.5* (i.e. GPT-3.5) and gpt-4* (i.e. GPT-4) were never fine-tuneable; older GPT models like davinci are fine-tuneable. Here is the list of fine-tuneable models. – noe Jun 21 '23 at 17:02
1

I'd just add one bit that text-davinci-003 is considered GPT-3.5 (continued training on code-davinci-002), so perhaps >=3.5-turbo – Andy Jun 21 '23 at 17:20
@Andy yes, you're totally right. – noe Jun 21 '23 at 18:25
@noe circling back, no I was wrong! davinci-003 is not fine-tuneable, just davinci. – Andy Jun 29 '23 at 14:54
@Andy thanks for checking that! – noe Jun 29 '23 at 15:13

score 0 · Answer 3 · answered Aug 11 '23 at 18:26

For some type of chatbot, like a customized chatbot, may it be better to fine-tune the LM for the domain-specific data and the type of question instead of relying on the prompt as prompts have to be processed by the LM every time?

Prompt, because according to the LIMA: Less Is More for Alignment paper:

These results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

Fine-tuning LLM or prompting engineering?

3 Answers3