-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added optional double down action to blackjack #1529
Conversation
Extends the game from just hit/stick to the 3rd action of doubling down, which means doubling the bet and getting only 1 additional card
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, looks good, requesting a couple of minor changes (in the comments), and also maybe a simple test (say, initialize with double down, make action 2, ensure reward is in {-2, 0, 2}) set). Thanks!
1) Moved comments to docstring and updated docstring 2) Removed if self.double_down since not needed with assert 3) Made rewards always floats 4) Made the natural blackjack result in an instant win (when natural=True, this is a standard casino rule) instead of allowing the dealer to get to 21 to result in a draw. Now only a draw if dealer also has natural blackjack. 5) Test file with output below showing all rewards are {-2, 0, 2} import gym from gym.envs.registration import register ENV_NAME = "BlackjackMax-v0" DOUBLE = 2 HIT = 1 STICK = 0 register(id='BlackjackMax-v0', entry_point='blackjack1:BlackjackEnv1') class Player: def __init__(self): self.env = gym.make(ENV_NAME, natural = True, double_down = True) self.state = self.env.reset() def play_action(self, blackjack_state): return DOUBLE if __name__ == "__main__": agent = Player() new_state = agent.state for i in range(100): while True: print('state', new_state) action = agent.play_action(new_state) new_state, reward, done, _ = agent.env.step(action) print('new_state', new_state) print('reward', reward) if done: new_state = agent.env.reset() print('===New hand===') break state (13, 10, False) new_state (18, 10, False) reward 2.0 ===New hand=== state (13, 8, False) new_state (15, 8, False) reward -2.0 ===New hand=== state (13, 7, False) new_state (23, 7, False) reward -2.0 ===New hand=== state (12, 10, False) new_state (22, 10, False) reward -2.0 ===New hand=== state (13, 4, False) new_state (14, 4, False) reward 2.0 ===New hand=== state (17, 1, True) new_state (21, 1, True) reward 2.0 ===New hand=== state (13, 5, False) new_state (20, 5, False) reward 2.0 ===New hand=== state (6, 9, False) new_state (15, 9, False) reward -2.0 ===New hand=== state (14, 9, True) new_state (19, 9, True) reward 2.0 ===New hand=== state (15, 7, True) new_state (15, 7, False) reward -2.0 ===New hand=== state (13, 10, False) new_state (21, 10, False) reward 2.0 ===New hand=== state (13, 9, False) new_state (15, 9, False) reward -2.0 ===New hand=== state (19, 10, False) new_state (23, 10, False) reward -2.0 ===New hand=== state (11, 1, False) new_state (20, 1, False) reward 2.0 ===New hand=== state (16, 1, False) new_state (26, 1, False) reward -2.0 ===New hand=== state (9, 10, False) new_state (20, 10, True) reward 0.0 ===New hand=== state (13, 4, True) new_state (13, 4, False) reward -2.0 ===New hand=== state (18, 6, False) new_state (21, 6, False) reward 2.0 ===New hand=== state (12, 3, False) new_state (14, 3, False) reward -2.0 ===New hand=== state (20, 3, False) new_state (21, 3, False) reward 2.0 ===New hand=== state (16, 10, False) new_state (26, 10, False) reward -2.0 ===New hand=== state (17, 10, False) new_state (20, 10, False) reward -2.0 ===New hand=== state (11, 9, False) new_state (21, 9, False) reward 2.0 ===New hand=== state (14, 3, False) new_state (15, 3, False) reward -2.0 ===New hand=== state (21, 1, True) new_state (21, 1, True) reward 1.5 ===New hand=== state (14, 6, False) new_state (17, 6, False) reward 2.0 ===New hand=== state (13, 8, False) new_state (21, 8, False) reward 2.0 ===New hand=== state (21, 10, True) new_state (21, 10, True) reward 1.5 ===New hand=== state (16, 2, False) new_state (20, 2, False) reward 2.0 ===New hand=== state (20, 6, False) new_state (25, 6, False) reward -2.0 ===New hand=== state (13, 9, True) new_state (17, 9, True) reward 2.0 ===New hand=== state (21, 10, True) new_state (21, 10, True) reward 1.5 ===New hand=== state (20, 10, False) new_state (21, 10, False) reward 2.0 ===New hand=== state (9, 5, False) new_state (14, 5, False) reward -2.0 ===New hand=== state (19, 2, False) new_state (21, 2, False) reward 2.0 ===New hand=== state (17, 10, True) new_state (17, 10, False) reward -2.0 ===New hand=== state (14, 1, False) new_state (20, 1, False) reward 2.0 ===New hand=== state (13, 7, False) new_state (23, 7, False) reward -2.0 ===New hand=== state (21, 7, True) new_state (21, 7, True) reward 1.5 ===New hand=== state (20, 2, False) new_state (21, 2, False) reward 0.0 ===New hand=== state (8, 8, False) new_state (14, 8, False) reward 2.0 ===New hand=== state (12, 10, False) new_state (19, 10, False) reward 0.0 ===New hand=== state (20, 5, False) new_state (30, 5, False) reward -2.0 ===New hand=== state (14, 6, False) new_state (22, 6, False) reward -2.0 ===New hand=== state (17, 9, True) new_state (12, 9, False) reward -2.0 ===New hand=== state (20, 8, False) new_state (24, 8, False) reward -2.0 ===New hand=== state (12, 8, False) new_state (16, 8, False) reward -2.0 ===New hand=== state (16, 5, False) new_state (18, 5, False) reward 2.0 ===New hand=== state (16, 10, False) new_state (18, 10, False) reward 2.0 ===New hand=== state (19, 8, False) new_state (27, 8, False) reward -2.0 ===New hand=== state (13, 4, False) new_state (17, 4, False) reward -2.0 ===New hand=== state (12, 4, False) new_state (14, 4, False) reward 2.0 ===New hand=== state (18, 9, False) new_state (19, 9, False) reward 0.0 ===New hand=== state (17, 4, False) new_state (21, 4, False) reward 2.0 ===New hand=== state (17, 7, True) new_state (12, 7, False) reward 2.0 ===New hand=== state (11, 8, False) new_state (13, 8, False) reward 2.0 ===New hand=== state (20, 3, False) new_state (24, 3, False) reward -2.0 ===New hand=== state (16, 2, False) new_state (23, 2, False) reward -2.0 ===New hand=== state (18, 4, False) new_state (25, 4, False) reward -2.0 ===New hand=== state (11, 7, False) new_state (21, 7, False) reward 2.0 ===New hand=== state (20, 10, False) new_state (23, 10, False) reward -2.0 ===New hand=== state (12, 10, False) new_state (15, 10, False) reward -2.0 ===New hand=== state (17, 10, True) new_state (17, 10, False) reward -2.0 ===New hand=== state (14, 2, False) new_state (24, 2, False) reward -2.0 ===New hand=== state (17, 10, True) new_state (14, 10, False) reward 2.0 ===New hand=== state (11, 10, False) new_state (21, 10, False) reward 2.0 ===New hand=== state (17, 6, False) new_state (23, 6, False) reward -2.0 ===New hand=== state (10, 1, False) new_state (20, 1, False) reward -2.0 ===New hand=== state (8, 10, False) new_state (18, 10, False) reward -2.0 ===New hand=== state (17, 9, True) new_state (17, 9, False) reward -2.0 ===New hand=== state (6, 2, False) new_state (10, 2, False) reward -2.0 ===New hand=== state (15, 6, False) new_state (17, 6, False) reward 2.0 ===New hand=== state (5, 10, False) new_state (8, 10, False) reward -2.0 ===New hand=== state (14, 8, False) new_state (24, 8, False) reward -2.0 ===New hand=== state (20, 10, False) new_state (29, 10, False) reward -2.0 ===New hand=== state (14, 10, False) new_state (24, 10, False) reward -2.0 ===New hand=== state (18, 10, False) new_state (28, 10, False) reward -2.0 ===New hand=== state (12, 8, False) new_state (22, 8, False) reward -2.0 ===New hand=== state (11, 5, False) new_state (20, 5, False) reward -2.0 ===New hand=== state (12, 2, False) new_state (17, 2, False) reward -2.0 ===New hand=== state (15, 10, False) new_state (17, 10, False) reward -2.0 ===New hand=== state (11, 3, False) new_state (12, 3, False) reward 2.0 ===New hand=== state (21, 10, True) new_state (21, 10, True) reward 1.5 ===New hand=== state (15, 9, True) new_state (16, 9, True) reward -2.0 ===New hand=== state (6, 3, False) new_state (12, 3, False) reward 2.0 ===New hand=== state (21, 10, True) new_state (21, 10, True) reward 1.5 ===New hand=== state (10, 2, False) new_state (20, 2, False) reward 2.0 ===New hand=== state (8, 3, False) new_state (14, 3, False) reward -2.0 ===New hand=== state (10, 5, False) new_state (13, 5, False) reward -2.0 ===New hand=== state (13, 1, False) new_state (15, 1, False) reward 2.0 ===New hand=== state (11, 3, False) new_state (21, 3, False) reward 2.0 ===New hand=== state (18, 10, False) new_state (28, 10, False) reward -2.0 ===New hand=== state (16, 9, True) new_state (12, 9, False) reward -2.0 ===New hand=== state (15, 4, True) new_state (15, 4, False) reward -2.0 ===New hand=== state (17, 3, True) new_state (17, 3, False) reward 2.0 ===New hand=== state (14, 7, False) new_state (21, 7, False) reward 2.0 ===New hand=== state (16, 6, False) new_state (24, 6, False) reward -2.0 ===New hand=== state (11, 6, False) new_state (21, 6, False) reward 2.0 ===New hand=== state (20, 6, False) new_state (22, 6, False) reward -2.0 ===New hand=== state (13, 2, False) new_state (18, 2, False) reward 2.0 ===New hand===
@pzhokhov I made those changes and a couple of other small ones in the new commit. I showed a test of 100 double_down actions to show the reward is always {-2, 0, 2} (except the case of natural Blackjack when natural=True, which gives an immediate reward of 1.5). |
Do we really want to extend these built-in environments instead of having people make their own environments outside the gym repo @pzhokhov? |
Closing per #2259 |
That was a typo sorry, I'm closing this after discussion with @cpnota because this is intended to match the simplified blackjack game in Barto and Sutton |
Extends the game from just hit/stick to the 3rd action of doubling down, which means doubling the bet and getting only 1 additional card