Support decision tasks by providing reward wrapper for Gym-like RL environment #7347

MA-Wenhui · 2025-03-17T12:07:03Z

What does this PR do?

Support decision tasks, by providing an Environment reward wrapper, such as a Gymnasium-like RL environment.

When training ppo model for decision task, other than using a reward model, one could use his own environment system code to generate rewards by specifying following params in yaml:

reward_model: your_env_package.reward_model_wrapper.SampleRewardModel
reward_model_type: env

The SampleRewardModel is within user's environment package, it does:

parses the Query and Response
executes action, interacts with environment
returns reward

The SampleRewardModel example is :

class RewardModel:
    def __init__(self):
        self.env = Environment()

    def __call__(self, query_texts, response_texts):
        # do environment actions and get reward
        return self.get_rewards(query_texts, response_texts)

Before submitting

[ yes ] Did you read the contributor guideline?
[ yes ] Did you write any new necessary tests?
A new ppo example sample yaml

…uch as a Gymnasium-like RL environment.

MaWenhui and others added 2 commits March 17, 2025 19:56

Support decision tasks, by providing an Environment reward wrapper, s…

ad737d5

…uch as a Gymnasium-like RL environment.

Merge branch 'hiyouga:main' into support-ppo-rl-env-reward

1159dc7

hiyouga added the pending This problem is yet to be addressed label Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support decision tasks by providing reward wrapper for Gym-like RL environment #7347

Support decision tasks by providing reward wrapper for Gym-like RL environment #7347

MA-Wenhui commented Mar 17, 2025 •

edited

Loading

Support decision tasks by providing reward wrapper for Gym-like RL environment #7347

Are you sure you want to change the base?

Support decision tasks by providing reward wrapper for Gym-like RL environment #7347

Conversation

MA-Wenhui commented Mar 17, 2025 • edited Loading

What does this PR do?

Before submitting

MA-Wenhui commented Mar 17, 2025 •

edited

Loading