-
Train a GPT-like model to “understand langauge”. This could be based on a data-set of prompts and expected responses.
-
Sample several outputs from the model for a given prompt. Have human labeler rank the outputs. Train yet another transformer based model (the “reward model”) that can predict this rank/“goodness” of the answer based on the human labeled answers.
-
Stack the reward model (RM) at the end of GPT and use it to generate the loss function. This is then used to fine-tune the GPT while keeping the RM frozen. And thus you get ChatGPT.
References
[1] https://www.youtube.com/watch?v=_MPJ3CyDokU
[1] “Learning to summarize from human feedback” https://arxiv.org/pdf/2009.01325.pdf