Tree search refactor #18

sidnarayanan · 2024-09-09T22:23:50Z

Our old TreeSearchRollout implicitly managed a tree as a list of trajectories, using string IDs to infer edges. That was brittle and hard to work with.

This PR adds a TransitionTree object (essentially a nx.DiGraph wrapper) to make it easier to manipulate trees.

ldp/data_structures.py

tests/test_rollouts.py

jamesbraza · 2024-09-09T23:07:24Z

tests/test_rollouts.py

+    # Then go up the tree
+    assert tree.get_transition(f"{root_id}:0:0").value == pytest.approx(
+        1.9, rel=0.001
+    )  # 1 + 0.9 * avg(0, 1, 2)


What is avg(0, 1, 2)?

Also, instead of 1.9, why not just put pytest.approx(1 + 0.9)

I mean average. At this node:

The current reward is 1

The discount factor is 0.9

The expected future return is the mean of 0, 1, 2 (the values checked on the preceding lines).

So the value estimate should be 1 + 0.9 * average(0, 1, 2). I didn't want to write this out for every node I check here, since I left a comment here:

ldp/ldp/data_structures.py

Lines 259 to 271 in c85ec23

if children := list(self.tree.successors(step_id)):

# V_{t+1}(s') = sum_{a'} p(a'|s') * Q_{t+1}(s', a')

# Here we assume p(a'|s') is uniform.

# TODO: don't make that assumption where a logprob is available

v_tp1 = sum(

self.get_transition(child_id).value for child_id in children

) / len(children)

else:

v_tp1 = 0.0

# Q_t(s_t, a_t) = r_{t+1} + gamma * V_{t+1}(s_{t+1})

# (we are assuming the environment is deterministic)

step.value = step.reward + discount_factor * v_tp1

Ah okay. Imo sometimes it's a bit more understandable in tests to use a formula over a value. You could do: 1 + 0.9 * (0 + 1 + 2) / 3, but feel free to ignore

Good idea, will implement before merging

sidnarayanan requested review from whitead, kwanUm, jamesbraza, Ryan-Rhys and albertbou92 September 9, 2024 22:23

jamesbraza reviewed Sep 9, 2024

View reviewed changes

ldp/data_structures.py Outdated Show resolved Hide resolved

ldp/data_structures.py Outdated Show resolved Hide resolved

ldp/data_structures.py Outdated Show resolved Hide resolved

tests/test_rollouts.py Show resolved Hide resolved

jamesbraza reviewed Sep 9, 2024

View reviewed changes

jamesbraza approved these changes Sep 9, 2024

View reviewed changes

sidnarayanan added 3 commits September 10, 2024 09:23

adding proper tree data structure for tree search

3e6f1f1

pr comments

7a93dca

make test value calculations more explicit

ff1f648

sidnarayanan force-pushed the tree-search branch from c85ec23 to ff1f648 Compare September 10, 2024 16:23

sidnarayanan merged commit 2bd6745 into main Sep 10, 2024
5 of 6 checks passed

sidnarayanan deleted the tree-search branch September 10, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree search refactor #18

Tree search refactor #18

sidnarayanan commented Sep 9, 2024

jamesbraza Sep 9, 2024

sidnarayanan Sep 9, 2024

jamesbraza Sep 9, 2024

sidnarayanan Sep 10, 2024

	if children := list(self.tree.successors(step_id)):
	# V_{t+1}(s') = sum_{a'} p(a'\|s') * Q_{t+1}(s', a')
	# Here we assume p(a'\|s') is uniform.
	# TODO: don't make that assumption where a logprob is available
	v_tp1 = sum(
	self.get_transition(child_id).value for child_id in children
	) / len(children)
	else:
	v_tp1 = 0.0

	# Q_t(s_t, a_t) = r_{t+1} + gamma * V_{t+1}(s_{t+1})
	# (we are assuming the environment is deterministic)
	step.value = step.reward + discount_factor * v_tp1

Tree search refactor #18

Tree search refactor #18

Conversation

sidnarayanan commented Sep 9, 2024

jamesbraza Sep 9, 2024

Choose a reason for hiding this comment

sidnarayanan Sep 9, 2024

Choose a reason for hiding this comment

jamesbraza Sep 9, 2024

Choose a reason for hiding this comment

sidnarayanan Sep 10, 2024

Choose a reason for hiding this comment