Skip to content

Support decision tasks by providing reward wrapper for Gym-like RL environment #7347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

MA-Wenhui
Copy link

@MA-Wenhui MA-Wenhui commented Mar 17, 2025

What does this PR do?

Support decision tasks, by providing an Environment reward wrapper, such as a Gymnasium-like RL environment.

When training ppo model for decision task, other than using a reward model, one could use his own environment system code to generate rewards by specifying following params in yaml:

reward_model: your_env_package.reward_model_wrapper.SampleRewardModel
reward_model_type: env

The SampleRewardModel is within user's environment package, it does:

  1. parses the Query and Response
  2. executes action, interacts with environment
  3. returns reward

The SampleRewardModel example is :

class RewardModel:
    def __init__(self):
        self.env = Environment()

    def __call__(self, query_texts, response_texts):
        # do environment actions and get reward
        return self.get_rewards(query_texts, response_texts)

Before submitting

  • [ yes ] Did you read the contributor guideline?
  • [ yes ] Did you write any new necessary tests?
    A new ppo example sample yaml

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants