Train a GPT-like model to “understand langauge”. This could be based on a data-set of prompts and expected responses.
Sample several outputs from the model for a given prompt. Have human labeler rank the outputs. Train yet another transformer based model (the “reward model”) that can predict this rank/“goodness” of the answer based on the human labeled answers.
Stack the reward model (RM) at the end of GPT and use it to generate the loss function. This is then used to fine-tune the GPT while keeping the RM frozen. And thus you get ChatGPT.
 “Learning to summarize from human feedback” https://arxiv.org/pdf/2009.01325.pdf