Hi-Phy: A Benchmark for Hierarchical Physical Reasoning

Cheng Xue*, Vimukthini Pinto*, Chathura Gamage*, Peng Zhang, Jochen Renz
School of Computing
The Australian National University
Canberra, Australia
{cheng.xue, vimukthini.inguruwattage, chathura.gamage}@anu.edu.au
{p.zhang, jochen.renz}@anu.edu.au

Reasoning about the behaviour of physical objects is a key capability of agents operating in physical worlds. Humans are very experienced in physical reasoning while it remains a major challenge for AI. To facilitate research addressing this problem, several benchmarks have been proposed recently. However, these benchmarks do not enable us to measure an agent's granular physical reasoning capabilities when solving a complex reasoning task. In this paper, we propose a new benchmark for physical reasoning that allows us to test individual physical reasoning capabilities. Inspired by how humans acquire these capabilities, we propose a general hierarchy of physical reasoning capabilities with increasing complexity. Our benchmark tests capabilities according to this hierarchy through generated physical reasoning tasks in the video game Angry Birds. This benchmark enables us to conduct a comprehensive agent evaluation by measuring the agent's granular physical reasoning capabilities. We conduct an evaluation with human players, learning agents, and heuristic agents and determine their capabilities. Our evaluation shows that learning agents, with good local generalization ability, still struggle to learn the underlying physical reasoning capabilities and perform worse than current state-of-the-art heuristic agents and humans. We believe that this benchmark will encourage researchers to develop intelligent agents with advanced, human-like physical reasoning capabilities.

* equal contribution

link of the paper: https://arxiv.org/abs/2106.09692

1.Hierarchy

Humans and AI approaches learn much better when examples are presented in a meaningful order with increasing complexity than when examples are presented randomly. We thereby propose a hierarchy for physical reasoning that enables an agent to start with increasing complexity to facilitate training and evaluating agents to work in the real physical world.

Our hierarchy consists three levels and fifteen capabilities:

Level 1: Understanding the instant effect of the first force applied to objects in an environment as a result of an agent's action.

Level 1 capabilities:
1.1: Understanding the instant effect of objects in an environment when an agent applies a single force.
1.2: Understanding the instant effect of objects in an environment when an agent applies multiple forces.

Level 2: Understanding objects movement in the environment after a force is applied.

Level 2 capabilities:
2.1: Understanding that objects in the environment may roll.
2.2: Understanding that objects in the environment may fall.
2.3: Understanding that objects in the environment may slide.
2.4: Understanding that objects in the environment may bounce.

Level 3: Performing in tasks that require capabilities 1) human developed in infancy, 2) required in robotics to develop agents that work alongside people, and 3) currently fall short in reinforcement learning.

Level 3 capabilities:
3.1: Understanding relative weight of objects.
3.2: Understanding relative height of objects.
3.3: Understanding relative width of objects.
3.4: Understanding shape difference of objects.
3.5: Understanding how to perform non-greedy actions.
3.6: Understanding structural weak points/stability.
3.7: Understanding how to clear a path towards the goal.
3.8: Understanding how to perform actions with adequate timing.
3.9: Understanding how to use tools.

Please refer to the paper for more details how and why we attributed the capabilities in this way.

2. Hi-Phy in Angry Birds

Based on the proposed hierarchy, we develop Hi-Phy benchmark in Angry Birds. Hi-Phy contains tasks from 65 task templates belonging to the fifteen capabilities. The goal of an agent is to destroy all the pigs (green-coloured objects) in the tasks by shooting a given number of birds from the slingshot. Shown below are fifteen example tasks in Hi-Phy representing the fifteen capabilities and the solutions for those tasks.

Task	Description
	1.1: Understanding the instant effect of objects in an environment when an agent applies a single force. A force is needed to be applied to destroy the pig.
	1.2: Understanding the instant effect of objects in an environment when an agent applies multiple forces. Multiple forces are needed to be applied to destroy the pig.
	2.1: Understanding that objects in the environment may roll. The circular object is needed to be rolled on to the pig, that is unreachable for the bird from the slingshot, causing the pig to be destroyed.
	2.2: Understanding that objects in the environment may fall. The circular object is needed to be fallen on to the pig causing the pig to be destroyed.
	2.3: Understanding that objects in the environment may slide. The square object is needed to be slid to push the pig, that is unreachable for the bird from the slingshot, causing the pig to be destroyed.
	2.4: Understanding that objects in the environment may bounce. The bird is needed to be bounced off the platform (dark-brown object) to hit and destroy the pig.
	3.1: Understanding relative weight of objects. The small circular block is lighter than the big circular block. Out of the two blocks, the small circular block can only be rolled to reach the pig and destroy.
	3.2: Understanding relative height of objects. The square block on top of the taller rectangular block will not fall through the gap due to the height of the rectangular block. Hence the square block on top of the shorter rectangular block needs to be toppled to fall through the gap and destroy the pig.
	3.3: Understanding relative width of objects. The bird cannot go through the lower entrance which has a narrow opening. Hence the bird is needed to be shot to the upper entrance to reach the pig and destroy.
	3.4: Understanding shape difference of objects. The circular block on two triangle blocks can be rolled down by breaking a one triangle block and the circular block on two square blocks cannot be rolled down by breaking a one square block. Hence, with the given single bird, the triangle block needs to be destroyed to roll the circle and causing the pig to be destroyed.
	3.5: Understanding how to perform non-greedy actions. Greedy action is to destroy the highest number of pigs in a single bird shot. If the two pigs resting on the circular block is destroyed, then the circle will roll down and block the entrance to reach the below pig. Hence, the below pig is needed to be destroyed first and then the upper two pigs.
	3.6: Understanding structural weak points/stability. The bird is needed to be shot at the weak point of the structure to break the stability and destroy the pigs. Shooting elsewhere does not destroy the pigs with a single bird.
	3.7: Understanding how to clear a path towards the goal. First, the rectangle block is needed to be positioned correctly to open the path for the circular block to reach the pig. Then the circular block is needed to be rolled to destroy the pig.
	3.8: Understanding how to perform actions with adequate timing. First, the two circular objects are needed to be rolled to the ramp. Then, after the first circle passes the prop and before the second circle reaches the prop, the prop needs to be destroyed to fall the second circle onto the lower pig.
	3.9: Understanding how to use tools. The blue bird (considered as a tool) splits into three other birds when it is tapped in the flight as opposed to the red bird without such ability. The blue bird is needed to be tapped at the correct position to reach the two separated pigs that cannot be destroyed with a single bird.

Sceenshots of the 65 task templates are shown below. x.y.z represents the z^th task template of the y^th capability of the x^th hierarchy level.


1.1.1	1.1.2	1.1.3

1.2.1	1.2.2	1.2.3

1.2.4	1.2.5	2.1.1

2.1.2	2.1.3	2.1.4

2.1.5	2.2.1	2.2.2

2.2.3	2.2.4	2.2.5

2.3.1	2.3.2	2.3.3

2.3.4	2.4.1	2.4.2

2.4.3	3.1.1	3.1.2

3.1.3	3.1.4	3.1.5

3.2.1	3.2.2	3.2.3

3.2.4	3.3.1	3.3.2

3.3.3	3.3.4	3.4.1

3.4.2	3.4.3	3.4.4

3.5.1	3.5.2	3.5.3

3.5.4	3.5.5	3.6.1

3.6.2	3.6.3	3.6.4

3.6.5	3.7.1	3.7.2

3.7.3	3.7.4	3.7.5

3.8.1	3.8.2	3.9.1

3.9.2	3.9.3	3.9.4

3.9.5	3.9.6

3. Task generator

We develop a task generator that can generate tasks for the task templates we designed.

To run the task generator:
1. Go to tasks/task_generator
2. Copy the task templates that you want to generate tasks into the input (level templates can be found in tasks/task_templates)
3. Run the tak generator providing the number of tasks as an argument
```
   python generate_tasks.py <number of tasks to generate>
```
1. Generated tasks will be available in the output

4. Tasks generated for baseline analysis

We generated 100 tasks from each of the 65 task templates for the baseline analysis. The generated tasks can be found in tasks/generated_tasks.zip. After extracting this file, the generatd tasks can be found located in the folder structure:
    generated_tasks/
        -- index of the hierarchy level/
            -- index of the capability/
                -- index of the template/
                    -- task files named as hierarchyLevelIndex_capabilityIndex_templateIndex_taskIndex.xml

5. Baseline Agents and the Framework

Tested environments:

Ubuntu: 18.04/20.04
Python: 3.9
Numpy: 1.20
torch: 1.8.1
torchvision: 0.9.1
lxml: 4.6.3
tensorboard: 2.5.0
Java: 13.0.2/13.0.7

Before running agents, please:

Go to buildgame and unzip Linux.zip
Go to task/generated_tasks and unzip generated_tasks.zip

5.1 How to run heuristic agents

Run Java heuristic agents: Datalab and Eagle Wings:
1. Go to Utils and in terminal run
```
python PrepareTestConfig.py
```
2. Go to buildgame/Linux, in terminal run
```
java -jar game_playing_interface.jar
```
3. Go to Agents/HeuristicAgents/ and in terminal run Datalab
```
java -jar datalab_037_v4_java12.jar 1
```
  or Eagle Wings
```
java -jar eaglewings_037_v3_java12.jar 1
```
Run Random Agent and Pig Shooter:
1. Go to Agents/
2. In terminal, after grant execution permission run Random Agent
```
./TestPythonHeuristicAgent.sh RandomAgent
```
  or Pig Shooter
```
./TestPythonHeuristicAgent.sh PigShooter
```

5.2 How to run DQN Baseline

Go to Agents/
In terminal, after grant execution permission, train the agent for within capability training
```
./TrainLearningAgent.sh within_capability
```
and for within template training
```
./TrainLearningAgent.sh within_template
```
Models will be saved to Agents/LearningAgents/saved_model
To test learning agents, go the folder Agents:
1. test within template performance, run
```
python TestAgentOfflineWithinTemplate.py
```
1. test within capability performance, run
```
python TestAgentOfflineWithinCapability.py
```

5.3 How to develop your own agent

We provide a gym-like environment. For a simple demo, which can be found at demo.py

from SBAgent import SBAgent
from SBEnvironment.SBEnvironmentWrapper import SBEnvironmentWrapper

# for using reward as score and 50 times faster game play
env = SBEnvironmentWrapper(reward_type="score", speed=50)
level_list = [1, 2, 3]  # level list for the agent to play
dummy_agent = SBAgent(env=env, level_list=level_list)  # initialise agent
dummy_agent.state_representation_type = 'image'  # use symbolic representation as state and headless mode
env.make(agent=dummy_agent, start_level=dummy_agent.level_list[0],
         state_representation_type=dummy_agent.state_representation_type)  # initialise the environment

s, r, is_done, info = env.reset()  # get ready for running
for level_idx in level_list:
    is_done = False
    while not is_done:
        s, r, is_done, info = env.step([-100, -100])  # agent always shoots at -100,100 as relative to the slingshot

    env.current_level = level_idx+1  # update the level list once finished the level
    if env.current_level > level_list[-1]: # end the game when all game levels in the level list are played
        break
    s, r, is_done, info = env.reload_current_level() #go to the next level

5.4 Outline of the Agent Code

The ./Agents folder contains all the relevant source code of our agents. Below is the outline of the code (this is a simple description. Detailed documentation in progress):

Client:
1. agent_client.py: Includes all communication protocols.
final_run: Place to store tensor board results.
HeuristicAgents
1. datalab_037_v4_java12.jar: State-of-the-art java agent for Angry Birds.
2. eaglewings_037_v3_java12.jar: State-of-the-art java agent for Angry Birds.
3. PigShooter.py: Python agent that shoots at the pigs only.
4. RandomAgent.py: Random agent that choose to shoot from $x \in (-100,-10)$ and $y \in (-100,100)$.
5. HeuristicAgentThread.py: A thread wrapper to run multi-instances of heuristic agents.
LearningAgents
1. RLNetwork: Folder includes all DQN structures that can be used as an input to DQNDiscreteAgent.py.
2. saved_model: Place to save trained models.
3. LearningAgent.py: Inherited from SBAgent class, a base class to implement learning agents.
4. DQNDiscreteAgent.py: Inherited from LearningAgent, a DQN agent that has discrete action space.
5. LearningAgentThread.py: A thread wrapper to run multi-instances of learning agents.
6. Memory.py: A script that includes different types of memories. Currently, we have normal memory, PrioritizedReplayMemory and PrioritizedReplayMemory with balanced samples.
SBEnvironment
1. SBEnvironmentWrapper.py: A wrapper class to provide gym-like environment.
StateReader: Folder that contains files to convert symbolic state representation to inputs to the agents.
Utils:
1. Config.py: Config class that used to pass parameter to agents.
2. GenerateCapabilityName.py: Generate a list of names of capability for agents to train.
3. GenerateTemplateName.py: Generate a list of names of templates for agents to train.
4. LevelSelection.py: Class that includes different strategies to select levels. For example, an agent may choose to go to the next level if it passes the current one, or only when it has played the current level for a predefined number of times.
5. NDSparseMatrix.py: Class to store converted symbolic representation in a sparse matrix to save memory usage.
6. Parameters.py: Training/testing parameters used to pass to the agent.
7. PrepareTestConfig.py: Script to generate config file for the game console to use for testing agents only.
8. trajectory_planner.py: It calculates two possible trajectories given a directly reachable target point. It returns None if the target is non-reachable by the bird
demo.py: A demo to showcase how to use the framework.
SBAgent.py: Base class for all agents.
MultiAgentTestOnly.py: To test python heuristic agents with running multiple instances on one particular template.
TestAgentOfflineWithinCapability.py: Using the saved models in LearningAgents/saved_model to test agent's within capability performance on test set.
TestAgentOfflineWithinTemplate.py: Using the saved models in LearningAgents/saved_model to test agent's within template performance on test set.
TrainLearningAgent.py: Script to train learning agents on particular template with defined mode.
TestPythonHeuristicAgent.sh: Bash Script to test heuristic agent's performance on all templates.
TrainLearningAgent.sh: Bash Script to train learning agents on all templates/capabilities.

6. Framework

6.1 The Game Environment

The coordination system
- in the science birds game, the origin point (0,0) is the bottom-left corner, and the Y coordinate increases along the upwards direction, otherwise the same as above.
- Coordinates ranging from (0,0) to (640,480).

6.2 Symbolic Representation Data Structure

Symbolic Representation data of game objects is stored in a Json object. The json object describes an array where each element describes a game object. Game object categories, and their properties are described below:
- Ground: the lowest unbreakable flat support surface
  - property: id = 'object [i]'
  - property: type = 'Ground'
  - property: yindex = [the y coordinate of the ground line]
- Platform: Unbreakable obstacles
  - property: id = 'object [i]'
  - property: type = 'Object'
  - property: vertices = [a list of ordered 2d points that represents the polygon shape of the object]
  - property: colormap = [a list of compressed 8-bit (RRRGGGBB) colour and their percentage in the object]
- Trajectory: the dots that represent the trajectories of the birds
  - property: id = 'object [i]'
  - property: type = 'Trajectory'
  - property: location = [a list of 2d points that represents the trajectory dots]
- Slingshot: Unbreakable slingshot for shooting the bird
  - property: id = 'object [i]'
  - property: type = 'Slingshot'
  - property: vertices = [a list of ordered 2d points that represents the polygon shape of the object]
  - property: colormap = [a list of compressed 8-bit (RRRGGGBB) colour and their percentage in the object]
- Red Bird:
  - property: id = 'object [i]'
  - property: type = 'Object'
  - property: vertices = [a list of ordered 2d points that represents the polygon shape of the object]
  - property: colormap = [a list of compressed 8-bit (RRRGGGBB) colour and their percentage in the object]
- all objects below have the same representation as red bird
- Blue Bird:
- Yellow Bird:
- White Bird:
- Black Bird:
- Small Pig:
- Medium Pig:
- Big Pig:
- TNT: an explosive block
- Wood Block: Breakable wooden blocks
- Ice Block: Breakable ice blocks
- Stone Block: Breakable stone blocks
Round objects are also represented as polygons with a list of vertices
Symbolic Representation with noise
- If noisy Symbolic Representation is requested, the noise will be applied to each point in vertices of the game objects except the ground, all birds and the slingshot
- The noise for 'vertices' is applied to all vertices with the same amount within 5 pixels
- The colour map has a noise of +/- 2%.
- The colour is the colour map compresses 24 bit RGB colour into 8 bit
  - 3 bits for Red, 3 bits for Green and 2 bits for Blue
  - the percentage of the colour that accounts for the object is followed by colour
  - example: (127, 0.5) means 50% pixels in the objects are with colour 127
- The noise is uniformly distributed
- We will later offer more sophisticated and adjustable noise.

6.3 Communication Protocols

</tbody>

Message ID	Request	Format (byte[ ])	Return	Format (byte[ ])
1-10	Configuration Messages
1	Configure team ID Configure running mode	[1][ID][Mode] ID: 4 bytes Mode: 1 byte COMPETITION = 0 TRAINING = 1	Four bytes array. The first byte indicates the round; the second specifies the time limit in minutes; the third specifies the number of available levels	[round info][time limit][available levels] Note: in training mode, the return will be [0][0][0]. As the round info is not used in training, the time limit will be 600 hours, and the number of levels needs to be requested via message ID 15
2	Set simulation speed speed$\in$[0.0, 50.0] Note: this command can be sent at anytime during playing to change the simulation speed	[2][speed] speed: 4 bytes	OK/ERR	[1]/[0]
11-30	Query Messages
11	Do Screenshot	[11]	Width, height, image bytes Note: this command only returns screenshots without symbolic representation	[width][height][image bytes] width, height: 4 bytes
12	Get game state	[12]	One byte indicates the ordinal of the state	[0]: UNKNOWN [1] : MAIN_MENU [2]: EPISODE_MENU [3]: LEVEL_SELECTION [4]: LOADING [5]: PLAYING [6]: WON [7]: LOST
14	Get the current level	[14]	four bytes array indicates the index of the current level	[level index]
15	Get the number of levels	[15]	four bytes array indicates the number of available levels	[number of level]
23	Get my score	[23]	A 4 bytes array indicating the number of levels followed by ([number_of_levels] * 4) bytes array with every four slots indicates a best score for the corresponding level	[number_of_levels][score_level_1]....[score_level_n] Note: This should be used carefully for the training mode, because there may be large amount of levels used in the training. Instead, when the agent is in winning state, use message ID 65 to get the score of a single level at winning state
31-50	In-Game Action Messages
31	Shoot using the Cartesian coordinates [Safe mode*]	[31][fx][fy][dx][dy][t1][t2] focus_x : the x coordinate of the focus point focus_y: the y coordinate of the focus point dx: the x coordinate of the release point minus focus_x dy: the y coordinate of the release point minus focus_y t1: the release time t2: the gap between the release time and the tap time. If t1 is set to 0, the server will execute the shot immediately. The length of each parameter is 4 bytes	OK/ERR	[1]/[0]
32	Shoot using Polar coordinates [Safe mode*]	[32][fx][fy][theta][r][t1][t2] theta: release angle r: the radial coordinate The length of each parameter is 4 bytes	OK/ERR	[1]/[0]
33	Sequence of shots [Safe mode*]	[33][shots length][shot message ID][Params]...[shot message ID][Params] Maximum sequence length: 16 shots	An array with each slot indicates good/bad shot. The bad shots are those shots that are rejected by the server	For example, the server received 5 shots, and the third one was not executed due to some reason, then the server will return [1][1][0][1][1]
41	Shoot using the Cartesian coordinates [Fast mode**]	[41][fx][fy][dx][dy][t1][t2] The length of each parameter is 4 bytes	OK/ERR	[1]/[0]
42	Shoot using Polar coordinates [Fast mode**]	[42][fx][fy][theta][r][t1][t2] The length of each parameter is 4 bytes	OK/ERR	[1]/[0]
43	Sequence of shots [Fast mode**]	[43][shots length][shot message ID][Params]...[shot message ID][Params] Maximum sequence length: 16 shots	An array with each slot indicates good/bad shot. The bad shots are those shots that are rejected by the server	For example, the server received 5 shots, and the third one was not executed due to some reason, then the server will return [1][1][0][1][1]
34	Fully Zoom Out	[34]	OK/ERR	[1]/[0]
35	Fully Zoom In	[35]	OK/ERR	[1]/[0]
51-60	Level Selection Messages
51	Load a level	[51][Level] Level: 4 bytes	OK/ERR	[1]/[0]
52	Restart a level	[52]	OK/ERR	[1]/[0]
61-70	Science Birds Specific Messages
61	Get Symbolic Representation With Screenshot	[61]	Symbolic Representation and corresponding screenshot	[symbolic representation byte array length][Symbolic Representation bytes][image width][image height][image bytes] symbolic representation byte array length: 4 bytes image width: 4 bytes image height: 4 bytes
62	Get Symbolic Representation Without Screenshot	[62]	Symbolic Representation	[symbolic representation byte array length][Symbolic Representation bytes]
63	Get Noisy Symbolic Representation With Screenshot	[63]	noisy Symbolic Representation and corresponding screenshot	[symbolic representation byte array length][Symbolic Representation bytes][image width][image height][image bytes]
64	Get Noisy Symbolic Representation Without Screenshot	[64]	noisy Symbolic Representation	[symbolic representation byte array length][Symbolic Representation bytes]
65	Get Current Level Score	[65]	current score Note: this score can be requested at any time at Playing/Won/Lost state This is used for agents that take intermediate score seriously during training/reasoning To get the winning score, please make sure to execute this command when the game state is "WON"	[score] score: 4 bytes
* Safe Mode: The server will wait until the state is static after making a shot.
** Fast mode: The server will send back a confirmation once a shot is made. The server will not do any check for the appearance of the won page.

7.Human Player Data

The human player data on Hi-Phy is given in human_player_data.zip. This includes summarized data for 20 players. Each .csv file is for a player and the following are the columns.

levelIndex: The index assigned to the task
levelName: The name of the task
attempts: Number of attempts taken to solve the task
total_thinking_time: Total thinking time taken to solve the task
time_breakdown: Thinking time taken for each attempt (e.g. {1: 27, 2: 14}: Player has taken two attempts to solve the task. Time taken in the first attempt is 27 seconds and time taken for the second attempt is 14 seconds)
hierachy_level: The level of the hierarchy
capability: The index of the capability
h_c: The index of the hierarchy and the capability (e.g. 2_3: hierarchy level 2, capability 3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hi-Phy: A Benchmark for Hierarchical Physical Reasoning

Table of contents

1.Hierarchy

2. Hi-Phy in Angry Birds

3. Task generator

4. Tasks generated for baseline analysis

5. Baseline Agents and the Framework

5.1 How to run heuristic agents

5.2 How to run DQN Baseline

5.3 How to develop your own agent

5.4 Outline of the Agent Code

6. Framework

6.1 The Game Environment

6.2 Symbolic Representation Data Structure

6.3 Communication Protocols

7.Human Player Data

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
Agents		Agents
buildgame		buildgame
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
human_player_data.zip		human_player_data.zip

License

Cheng-Xue/Hi-Phy

Folders and files

Latest commit

History

Repository files navigation

Hi-Phy: A Benchmark for Hierarchical Physical Reasoning

Table of contents

1.Hierarchy

2. Hi-Phy in Angry Birds

3. Task generator

4. Tasks generated for baseline analysis

5. Baseline Agents and the Framework

5.1 How to run heuristic agents

5.2 How to run DQN Baseline

5.3 How to develop your own agent

5.4 Outline of the Agent Code

6. Framework

6.1 The Game Environment

6.2 Symbolic Representation Data Structure

6.3 Communication Protocols

7.Human Player Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages