Desired reward behavior: differentiate between no reward and zero reward #120

katja-hofmann · 2016-06-28T17:01:37Z

Current behavior: r = zero even if no rewards were produced

"{"reward":0.0}"

wanted:

"{"reward":[]} # no rewards were produced
"{"reward":[0, 3.4, 0, 5]} # rewards in the order in which they were generated

or:

{1: [], 2: [3], 3: [0, 0, 100]}

Here, in the XML / mission spec, the user would define an id for each reward handler. So for example, 1 could be rewardFromTouchingBlockType, 2 could be some other reward handler.

The text was updated successfully, but these errors were encountered:

timhutton · 2016-07-01T10:01:16Z

Suggestion:

Each RewardProducer has a new optional int attribute dimension, default=0.

TimestampedFloat is renamed to TimestampedFloats and contains a map of dimension:float.

A reward message is sent only if one of the rewards has been triggered.

The existing parameter reward.value returns the reward for dimension 0, if there is one.

This way, most of the sample code and xml files are unchanged. If the user wants multi-dimensional rewards (which several people have asked for) then we support it. If the user wants to separate rewards by their RewardProducer then we support that too. Discrete agents with single-dimension rewards become simpler, since they only need to check if a reward has been received, not that it is non-zero. It allows a reward of zero for taking a step, which is currently problematic in tabular_q_learning.py.

timhutton added this to the Bison milestone Jun 30, 2016

timhutton added the P2 label Jun 30, 2016

timhutton self-assigned this Jul 1, 2016

timhutton mentioned this issue Jul 4, 2016

Only send rewards if triggered #137

Merged

timhutton closed this as completed in #137 Jul 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Desired reward behavior: differentiate between no reward and zero reward #120

Desired reward behavior: differentiate between no reward and zero reward #120

katja-hofmann commented Jun 28, 2016

timhutton commented Jul 1, 2016 •

edited

Loading

Desired reward behavior: differentiate between no reward and zero reward #120

Desired reward behavior: differentiate between no reward and zero reward #120

Comments

katja-hofmann commented Jun 28, 2016

timhutton commented Jul 1, 2016 • edited Loading

timhutton commented Jul 1, 2016 •

edited

Loading