Reward of CryptoTradingEnv is not single step #40

QuantHao · 2021-05-26T07:40:47Z

QuantHao
May 26, 2021

Hi, @praveen-palanisamy

In CryptoTradingEnv, reward is computed based on ALL action steps, instead of current (single) action

Tensorflow-2-Reinforcement-Learning-Cookbook/Chapter04/crypto_trading_env.py

Line 92 in 31f8376

reward = self.account_value - self.opening_account_balance # Profit (loss)

Do you think compute reward by self.account_value - previous_account_value might be more appropriate?

QuantHao · 2021-05-27T03:13:10Z

QuantHao
May 27, 2021
Author

Same issue applies to Chapter 5.

0 replies

praveen-palanisamy · 2021-05-27T04:01:49Z

praveen-palanisamy
May 27, 2021
Maintainer

The reward is intentionally set to be the profit/loss (current_balance - opening_balance) on the account controlled by the Agent so that the Agent learns a policy that maximizes the total return (profit).
The Agent is expected to learn the relationship between the series of actions (trades) it executes in an episode and the resulting reward (profit/loss).

Why do you think this is an issue (other than not being strictly Markovian)?
(Converting this to a discussion)

0 replies

QuantHao · 2021-05-28T02:34:52Z

QuantHao
May 28, 2021
Author

Hi, @praveen-palanisamy

I think there are two options to design rewards:

assign a reward to EACH action per time step, similar to CartPool, thus this reward (eg. self.account_value - previous_account_value) corresponds to action taken at time t ONLY.
assign a reward to ALL actions token along the episode at the end of episode time T, similar to AlphaGO, while assign ZERO reward to intermediate time steps.

Your current setting current_balance - opening_balance seems belongs to the second classification. If yes, NON-ZERO reward should be assigned only when done is True. Am I correct?

2 replies

praveen-palanisamy May 28, 2021
Maintainer

The current reward function used in the recipe rewards the agent at every time step based on the current state (profit/loss) of the account controlled by the agent.

There are many ways to design the reward function and you have listed two. My comments w.r.t each of them are below:

This is the closest to the existing reward function in the recipe with the difference that you pointed out in your first post (reward is not just based on the trade executed at the current time step). This is a good option to try and compare with the existing reward function and it will be interesting to see if one helps the agent learn faster/better policy than the other.
You are correct. If we wanted to reward the agent only when the episode ends, we would calculate and return a non-zero reward only when done is True. But, will make the reward signal very sparse and therefore the agent will find it hard to learn a good policy and will likely require 10s to 100s of millions if not billions of interactions with the environment.

QuantHao May 28, 2021
Author

I see. Appreciate your comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward of CryptoTradingEnv is not single step #40

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reward of CryptoTradingEnv is not single step #40

QuantHao May 26, 2021

Replies: 3 comments · 2 replies

QuantHao May 27, 2021 Author

praveen-palanisamy May 27, 2021 Maintainer

QuantHao May 28, 2021 Author

praveen-palanisamy May 28, 2021 Maintainer

QuantHao May 28, 2021 Author

QuantHao
May 26, 2021

Replies: 3 comments 2 replies

QuantHao
May 27, 2021
Author

praveen-palanisamy
May 27, 2021
Maintainer

QuantHao
May 28, 2021
Author

praveen-palanisamy May 28, 2021
Maintainer

QuantHao May 28, 2021
Author