Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Desired reward behavior: differentiate between no reward and zero reward #120

Closed
katja-hofmann opened this issue Jun 28, 2016 · 1 comment · Fixed by #137
Closed

Desired reward behavior: differentiate between no reward and zero reward #120

katja-hofmann opened this issue Jun 28, 2016 · 1 comment · Fixed by #137
Assignees
Labels
Milestone

Comments

@katja-hofmann
Copy link
Member

Current behavior: r = zero even if no rewards were produced

"{"reward":0.0}"

wanted:

"{"reward":[]} # no rewards were produced
"{"reward":[0, 3.4, 0, 5]} # rewards in the order in which they were generated

or:

{1: [], 2: [3], 3: [0, 0, 100]}

Here, in the XML / mission spec, the user would define an id for each reward handler. So for example, 1 could be rewardFromTouchingBlockType, 2 could be some other reward handler.

@timhutton timhutton added this to the Bison milestone Jun 30, 2016
@timhutton timhutton added the P2 label Jun 30, 2016
@timhutton timhutton self-assigned this Jul 1, 2016
@timhutton
Copy link
Contributor

timhutton commented Jul 1, 2016

Suggestion:

Each RewardProducer has a new optional int attribute dimension, default=0.

TimestampedFloat is renamed to TimestampedFloats and contains a map of dimension:float.

A reward message is sent only if one of the rewards has been triggered.

The existing parameter reward.value returns the reward for dimension 0, if there is one.

This way, most of the sample code and xml files are unchanged. If the user wants multi-dimensional rewards (which several people have asked for) then we support it. If the user wants to separate rewards by their RewardProducer then we support that too. Discrete agents with single-dimension rewards become simpler, since they only need to check if a reward has been received, not that it is non-zero. It allows a reward of zero for taking a step, which is currently problematic in tabular_q_learning.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants