Aligning Language Models with User Intent: The Power of Fine-Tuning with Human Feedback

2 min readJan 15, 2023

Large language models (LMs) have become an essential tool in natural language processing (NLP) for tasks such as text generation, language translation, and question answering. However, making LMs bigger does not necessarily make them better at understanding and following a user’s intent. In fact, large LMs can sometimes generate outputs that are untruthful, toxic, or simply not helpful to the user. These models are not aligned with their users, and this can lead to serious consequences. In this blog, we will explore an avenue for aligning LMs with user intent on a wide range of tasks by fine-tuning with human feedback.

The process of fine-tuning LMs with human feedback involves using a set of labeler-written prompts and prompts submitted through the OpenAI API to collect a dataset of labeler demonstrations of the desired model behavior. This dataset is then used to fine-tune GPT-3 using supervised learning. Additionally, we collect a dataset of rankings of model outputs, which we use to further fine-tune the supervised model using reinforcement learning from human feedback. The resulting models are called InstructGPT.

In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning LMs with human intent.

In conclusion, fine-tuning LMs with human feedback is a powerful approach for aligning them with user intent. By collecting and incorporating human feedback, we can significantly improve the performance of LMs on a wide range of tasks while also reducing the risk of untruthful, toxic, or unhelpful outputs. This is an important step in the development of LMs that can be trusted and used to improve our daily lives.

sc Hughes

Join Medium with my referral link - SC Hughes

Read every story from SC Hughes (and thousands of other writers on Medium). Your membership fee directly supports SC…

medium.com

Aligning Language Models with User Intent: The Power of Fine-Tuning with Human Feedback

Join Medium with my referral link - SC Hughes

Read every story from SC Hughes (and thousands of other writers on Medium). Your membership fee directly supports SC…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by SC Hughes

Responses (1)