There's a good article on HuggingFace about this subject, but tl;dr:
By interacting with the AI and fine-tuning it towards a particular use-case, you can build good models. If you can build good models, you can be rewarded by the Bittensor Network. Part of the reason it's so competitive here is because the information has long been understood by the people with long-standing, but they're busy. Anybody who works within this ecosystem can tell you two things:
- Things move quickly.
- Things break and require human input.
Reinforcement Learning from Human Feedback (RLHF) is a challenging concept that involves multiple-model training processes and different stages of deployment. In this summary, we break down the training process into three core steps:
- Pretraining a language model (LM)
- Gathering data and training a reward model
- Fine-tuning the LM with reinforcement learning
As a starting point, RLHF uses a language model that has already been pretrained with classical pretraining objectives. Examples of such models include OpenAI's GPT-3, Anthropic's transformer models, and DeepMind's 280 billion parameter model Gopher. This initial model can be fine-tuned on additional text or conditions, but it's not strictly necessary.
There is no clear answer on the "best" model for the starting point of RLHF, as the design space of options in RLHF training is not thoroughly explored.
To train a reward model, you need to gather or generate data that represents human preferences. The process typically involves the following steps:
Collecting demonstrations: Obtain samples of human-generated responses to various input prompts. These demonstrations provide a baseline for what the AI system should be able to achieve.
Creating comparison data: Generate multiple possible responses for a given input using the pretrained language model. Have human raters evaluate and rank these responses based on their quality, relevance, and adherence to other desired criteria.
Aggregating rankings: Aggregate the rankings provided by human raters to create a more robust and reliable dataset for training the reward model.
Training the reward model: Train a reward model using the collected demonstrations and comparison data. This model will be used to evaluate the quality of the AI system's responses during the fine-tuning stage.
Here's a more detailed breakdown of each step:
- Create a set of input prompts that represent a wide range of tasks and situations.
- Have human experts provide responses to these prompts, following guidelines that emphasize the desired qualities in the AI system's responses.
- For each input prompt, generate multiple potential responses using the pretrained language model.
- Present these responses to human raters, who will rank them based on quality, relevance, and other criteria specified in the guidelines.
- Collect rankings from multiple human raters to create a more reliable dataset.
- Apply statistical techniques, such as the Bradley-Terry model, to aggregate individual rankings into a single, consensus-based ranking.
- Use the collected demonstrations and aggregated rankings to train a reward model that can evaluate the quality of the AI system's responses.
- The reward model will be used during the reinforcement learning stage to fine-tune the AI system's responses based on human preferences.
By following these steps, you can gather and generate data that accurately represents human preferences and use it to train a reward model for RLHF.
After gathering data and training the reward model, the next step is to fine-tune the pretrained language model using reinforcement learning. This process involves:
- Sampling responses: Generate multiple candidate responses for a given input prompt using the pretrained language model.
- Evaluating responses: Use the trained reward model to rank and score the candidate responses based on their quality and adherence to human preferences.
- Updating the language model: Apply reinforcement learning algorithms, such as Proximal Policy Optimization (PPO), to update the language model's parameters based on the reward model's evaluations.
Here's a more detailed breakdown of each step:
- For a given input prompt, use the pretrained language model to generate a set of candidate responses.
- Apply techniques such as temperature scaling, top-k sampling, or nucleus sampling to control the diversity and quality of the generated responses.
- Use the trained reward model to evaluate and score each candidate response based on its quality, relevance, and alignment with human preferences.
- These scores will serve as the basis for updating the language model's parameters through reinforcement learning.
- Apply reinforcement learning algorithms, such as Proximal Policy Optimization (PPO), to update the language model's parameters based on the reward model's evaluations.
- This process involves optimizing the language model's policy to maximize the expected reward, which in this case is the score assigned by the reward model.
By following these steps, you can fine-tune the pretrained language model to better align with human preferences and improve its overall performance. This process can be iterated multiple times to further refine the AI system's behavior.