Reinforcement learning from human feedback

In machine learning, reinforcement learning from human feedback (RLHF) or reinforcement learning from human preferences is a technique that trains a "reward model" directly from human feedback and uses the model as a reward function to optimize an agent's policy using reinforcement learning (RL) through an optimization algorithm like Proximal Policy Optimization.[1][2] The reward model is trained in advance to the policy being optimized to predict if a given output is good (high reward) or bad (low reward). RLHF can improve the robustness and exploration of RL agents, especially when the reward function is sparse or noisy.[3]

Human feedback is collected by asking humans to rank instances of the agent's behavior.[4][5][6] These rankings can then be used to score outputs, for example with the Elo rating system.[2]

RLHF has been applied to various domains of natural language processing, such as conversational agents, text summarization, and natural language understanding.[7] Ordinary reinforcement learning, where agents learn from their own actions based on a "reward function", is difficult to apply to natural language processing tasks because the rewards are often not easy to define or measure, especially when dealing with complex tasks that involve human values or preferences. RLHF can enable language models to provide answers that align with these complex values, to generate more verbose responses, and to reject questions that are either inappropriate or outside the knowledge space of the model.[8] Some examples of RLHF-trained language models are OpenAI's ChatGPT and its predecessor InstructGPT,[5][9] as well as DeepMind's Sparrow.[10]

RLHF has also been applied to other areas, such as the development of video game bots. For example, OpenAI and DeepMind trained agents to play Atari games based on human preferences.[11][12] The agents achieved strong performance in many of the environments tested, often surpassing human performance.[13]

Challenges and limitations

One major challenge of RLHF is the scalability and cost of human feedback, which can be slow and expensive compared to unsupervised learning. The quality and consistency of human feedback can also vary depending on the task, the interface, and the individual preferences of the humans. Even when human feedback is feasible, RLHF models may still exhibit undesirable behaviors that are not captured by human feedback or exploit loopholes in the reward model, which brings to light the challenges of alignment and robustness.[14]

See also

References

  1. Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey (2019). "Fine-Tuning Language Models from Human Preferences". arXiv:1909.08593. {{cite journal}}: Cite journal requires |journal= (help)
  2. Lambert, Nathan; Castricato, Louis; von Werra, Leandro; Havrilla, Alex. "Illustrating Reinforcement Learning from Human Feedback (RLHF)". huggingface.co. Retrieved 4 March 2023.
  3. MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
    • Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). doi:10.1609/aaai.v32i1.11485. S2CID 4130751.
    • Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862. {{cite journal}}: Cite journal requires |journal= (help)
  4. Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155. {{cite journal}}: Cite journal requires |journal= (help)
  5. Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica. Retrieved 4 March 2023.
  6. Abhishek, Gupta (5 February 2023). "Getting stakeholder engagement right in responsible AI". VentureBeat. Retrieved 4 March 2023.
  7. Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155. {{cite journal}}: Cite journal requires |journal= (help)
    • Nisan, Stiennon; Long, Ouyang; Jeffrey, Wu; Daniel, Ziegler; Ryan, Lowe; Chelsea, Voss; Alec, Radford; Dario, Amodei; F., Christiano, Paul (2020). "Learning to summarize with human feedback". Advances in Neural Information Processing Systems. 33.
  8. Wiggers, Kyle (24 February 2023). "Can AI really be protected from text-based attacks?". TechCrunch. Retrieved 4 March 2023.
  9. Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes. Retrieved 4 March 2023.
  10. Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375. {{cite journal}}: Cite journal requires |journal= (help)
  11. "Learning from human preferences". openai.com. Retrieved 4 March 2023.
  12. "Learning through human feedback". www.deepmind.com. Retrieved 4 March 2023.
  13. Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". Advances in Neural Information Processing Systems. Curran Associates, Inc. 30. Retrieved 4 March 2023.
  14. Christiano, Paul. "Thoughts on the impact of RLHF research". Retrieved 4 March 2023.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.