microsoft · you-n-g · Nov 10, 2022 · Oct 19, 2022 · Oct 19, 2022 · Oct 20, 2022
diff --git a/docs/_static/img/qlib_rl_highlevel.png b/docs/_static/img/qlib_rl_highlevel.png
diff --git a/docs/component/highfreq.rst b/docs/component/highfreq.rst
@@ -15,15 +15,17 @@ In order to support the joint backtest strategies in multiple levels, a correspo
 
 Besides backtesting, the optimization of strategies from different levels is not standalone and can be affected by each other.
 For example, the best portfolio management strategy may change with the performance of order executions(e.g. a portfolio with higher turnover may becomes a better choice when we improve the order execution strategies).
-To achieve the overall good performance , it is necessary to consider the interaction of strategies in different level.
+To achieve the overall good performance , it is necessary to consider the interaction of strategies in different level. 
 
 Therefore, building a new framework for trading in multiple levels becomes necessary to solve the various problems mentioned above, for which we designed a nested decision execution framework that consider the interaction of strategies.
 
 .. image:: ../_static/img/framework.svg
 
 The design of the framework is shown in the yellow part in the middle of the figure above. Each level consists of ``Trading Agent`` and ``Execution Env``. ``Trading Agent`` has its own data processing module (``Information Extractor``), forecasting module (``Forecast Model``) and decision generator (``Decision Generator``). The trading algorithm generates the decisions by the ``Decision Generator`` based on the forecast signals output by the ``Forecast Module``, and the decisions generated by the trading algorithm are passed to the ``Execution Env``, which returns the execution results.
 
-The frequency of trading algorithm, decision content and execution environment can be customized by users (e.g. intraday trading, daily-frequency trading, weekly-frequency trading), and the execution environment can be nested with finer-grained trading algorithm and execution environment inside (i.e. sub-workflow in the figure, e.g. daily-frequency orders can be turned into finer-grained decisions by splitting orders within the day). The flexibility of nested decision execution framework makes it easy for users to explore the effects of combining different levels of trading strategies and break down the optimization barriers between different levels of trading algorithm.
+The frequency of trading algorithm, decision content and execution environment can be customized by users (e.g. intraday trading, daily-frequency trading, weekly-frequency trading), and the execution environment can be nested with finer-grained trading algorithm and execution environment inside (i.e. sub-workflow in the figure, e.g. daily-frequency orders can be turned into finer-grained decisions by splitting orders within the day). The flexibility of nested decision execution framework makes it easy for users to explore the effects of combining different levels of trading strategies and break down the optimization barriers between different levels of trading algorithm. 
+
+The optimization for the nested decision execution framework can be implemented with an RL-based method, which can be supported by `qlib.rl<https://github.com/microsoft/qlib/tree/main/examples/rl>`_.
 
 Example
 =======

diff --git a/docs/component/rl.rst b/docs/component/rl.rst
@@ -6,13 +6,13 @@ Reinforcement Learning in Quantitative Trading
 
 Introduction
 ============
-The Qlib Reinforcement Learning toolkit (QlibRL) is the RL platform for quantitative investment. It contains a full set of components that cover the entire lifecycle of an RL pipeline, including building the simulator of the market, shaping states & actions, training policies (strategies), and backtesting strategies in the simulated environment.
+The Qlib Reinforcement Learning toolkit (QlibRL) is an RL platform for quantitative investment. It contains a full set of components that cover the entire lifecycle of an RL pipeline, including building the simulator of the market, shaping states & actions, training policies (strategies), and backtesting strategies in the simulated environment.
 
-QlibRL is basically implemented within the frameworks of Tianshou and Gym. The high-level structure of QlibRL is demonstrated below:
+QlibRL is basically implemented with the support of Tianshou and Gym frameworks. The high-level structure of QlibRL is demonstrated below:
 
 .. image:: ../_static/img/qlib_rl_highlevel.png
 
-Here, we briefly introduce each of the components in the figure.
+Here, we briefly introduce each component in the figure.
 
 Base Modules
 ============
@@ -24,7 +24,7 @@ EnvWrapper is the complete capsulation of the simulated environment. It receives
 In QlibRL, EnvWrapper is a subclass of gym.Env, so it implements all necessary interfaces of gym.Env. Any classes or pipelines that accept gym.Env should also accept EnvWrapper. Developers do not need to implement their own EnvWrapper to build their own environment. Instead, they only need to implement 4 components of the EnvWrapper:
 
 - `Simulator`
-    The simulator is the core component responsible for the environment simulation. Developers could implement all the logic that is directly related to the environment simulation in the Simulator in any way they like. In QlibRL, there are already two implementations of Simulator: 1) ``SingleAssetOrderExecution``, which is built based on Qlib's backtest toolkits. 2) ``SimpleSingleAssetOrderExecution``, which is built based on naive simulation logic.
+    The simulator is the core component responsible for the environment simulation. Developers could implement all the logic that is directly related to the environment simulation in the Simulator in any way they like. In QlibRL, there are already two implementations of Simulator for single asset trading: 1) ``SingleAssetOrderExecution``, which is built based on Qlib's backtest toolkits and hence considers a lot of practical trading details but is slow. 2) ``SimpleSingleAssetOrderExecution``, which is built based on a simplified trading simulator, which ignores a lot of details (e.g. trading limitations, rounding) but is quite fast.
 - `State interpreter` 
     The state interpreter is responsible for "interpret" states in the original format (format provided by the simulator) into states in a format that the policy could understand. For example, transform unstructured raw features into numerical tensors.
 - `Action interpreter` 
@@ -60,15 +60,35 @@ Order Execution
 ------------
 As a fundamental problem in algorithmic trading, order execution aims at fulfilling a specific trading order, either liquidation or acquirement, for a given instrument. Essentially, the goal of order execution is twofold: it not only requires to fulfill the whole order but also targets a more economical execution with maximizing profit gain (or minimizing capital loss). The order execution with only one order of liquidation or acquirement is called single-asset order execution.
 
-Considering stock investment always aim to pursue long-term maximized profits, is usually behaved in the form of a sequential process of continuously adjusting the asset portfolio, execution for multiple orders, including order of liquidation and acquirement, brings more constraints and making the sequence of execution for different orders should be considered, e.g. before executing an order to buy some stocks, we have to sell at least one stock. The order execution with multiple assets is called multi-asset order execution. 
+Considering stock investment always aim to pursue long-term maximized profits, is usually manifests as a sequential process of continuously adjusting the asset portfolios, execution for multiple orders, including order of liquidation and acquirement, brings more constraints and making the sequence of execution for different orders should be considered, e.g. before executing an order to buy some stocks, we have to sell at least one stock. The order execution with multiple assets is called multi-asset order execution. 
 
 According to the order execution’s trait of sequential decision making, an RL-based solution could be applied to solve the order execution. With an RL-based solution, an agent optimizes execution strategy through interacting with the market environment. 
 
 With QlibRL, the RL algorithm in the above scenarios can be easily implemented.
 
+Nested Portfolio Construction and Order Executor
+------------
+QlibRL make it possible to jointly optimize different levels of strategies/models/agents. Take `Nested Decision Execution Framework <https://github.com/microsoft/qlib/blob/main/examples/nested_decision_execution>`_ an example of, optimization of order execution strategy and portfolio management strategy can interact with each other to maximize returns.
+
+Base Class & Interface 
+============
+``Qlib`` provides a set of APIs for developers to further simplify their development, including base classes for Interpreter, Simulator and Reward.
+
+.. autoclass:: qlib.rl.interpreter.Interpreter
+    :members:
+
+.. autoclass:: qlib.rl.simulator.Simulator
+    :members:
+
+.. autoclass:: qlib.rl.reward.Reward
+    :members:
+
+
 Example
 ============
-QlibRL provides a set of APIs for developers to further simplify their development. For example, if developers have already defined their simulator / interpreters / reward function / policy, they could launch the training pipeline by simply running:
+``Qlib`` provides an example based on order execution, specifically, `qlib.rl.order_execution.simulator_qlib.SingleAssetOrderExecution<https://github.com/microsoft/qlib/blob/main/qlib/rl/order_execution/simulator_qlib.py>`_ and `qlib.rl.order_execution.simulator_simple.SingleAssetOrderExecutionSimple<https://github.com/microsoft/qlib/blob/main/qlib/rl/order_execution/simulator_simple.py>`_ as examples for simulator, two `StateInterpreter<https://github.com/microsoft/qlib/blob/main/qlib/rl/order_execution/interpreter.py>`_ and `ActionInterpreter<https://github.com/microsoft/qlib/blob/main/qlib/rl/order_execution/interpreter.py>`_ as examples for interpreter, and `qlib.rl.order_execution.reward.PAPenaltyReward<https://github.com/microsoft/qlib/blob/main/qlib/rl/order_execution/reward.py>`_ as an example for reward.
+
+If developers have already defined their simulator / interpreters / reward function / policy, they could launch the training pipeline by simply running:
 
 .. code-block:: python
     train(  
@@ -79,7 +99,7 @@ QlibRL provides a set of APIs for developers to further simplify their developme
         policy=policy,  
         reward=PAPenaltyReward(),  
         vessel_kwargs={
-            "episode_per_iter": 100, 
+            "episode_per_iter": 100, 6
             "update_kwargs": {
                 "batch_size": 64, 
                 "repeat": 5,
@@ -91,4 +111,4 @@ QlibRL provides a set of APIs for developers to further simplify their developme
         },  
     )
 
-We demonstrate an example of an implementation of a single asset order execution task based on QlibRL, the details about the example can be found `here <../../examples/rl/README.md>`_.
+We demonstrate an example of an implementation of a single asset order execution task based on QlibRL, the details about the example can be found `here <../../examples/rl/README.md>`_. RL-based portfolio construction learning will be released in the future.
diff --git a/docs/index.rst b/docs/index.rst
@@ -33,7 +33,7 @@ Document Structure
 
 .. toctree::
    :maxdepth: 3
-   :caption: COMPONENTS:
+   :caption: MAIN COMPONENTS:
 
    Workflow: Workflow Management <component/workflow.rst>
    Data Layer: Data Framework & Usage <component/data.rst>
@@ -48,7 +48,7 @@ Document Structure
 
 .. toctree::
    :maxdepth: 3
-   :caption: ADVANCED TOPICS:
+   :caption: OTHER COMPONENTS/FEATURES/TOPICS:
 
    Building Formulaic Alphas <advanced/alpha.rst>
    Online & Offline mode <advanced/server.rst>

diff --git a/docs/reference/api.rst b/docs/reference/api.rst
@@ -256,3 +256,16 @@ Serializable
 
 .. automodule:: qlib.utils.serial.Serializable
     :members:
+
+RL
+==============
+``Qlib`` provides a series of base classes for Interpreter, Simulator and Reward.
+
+.. autoclass:: qlib.rl.interpreter.Interpreter
+    :members:
+
+.. autoclass:: qlib.rl.simulator.Simulator
+    :members:
+
+.. autoclass:: qlib.rl.reward.Reward
+    :members:
diff --git a/setup.py b/setup.py
@@ -1,6 +1,5 @@
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT License.
-import io
 import os
 import numpy
 
@@ -26,9 +25,6 @@ def get_version(rel_path: str) -> str:
 DESCRIPTION = "A Quantitative-research Platform"
 REQUIRES_PYTHON = ">=3.5.0"
 
-from pathlib import Path
-from shutil import copyfile
-
 VERSION = get_version("qlib/__init__.py")
 
 # Detect Cython
@@ -148,15 +144,16 @@ def get_version(rel_path: str) -> str:
             # References: https://github.com/python/typeshed/issues/8799
             "mypy<0.981",
             "flake8",
+            # The 5.0.0 version of importlib-metadata removed the deprecated endpoint,
+            # which prevented flake8 from working properly, so we restricted the version of importlib-metadata.
+            # To help ensure the dependencies of flake8 https://github.com/python/importlib_metadata/issues/406
+            "importlib-metadata<5.0.0",
             "readthedocs_sphinx_ext",
             "cmake",
             "lxml",
             "baostock",
             "yahooquery",
             "beautifulsoup4",
-            # The 5.0.0 version of importlib-metadata removed the deprecated endpoint,
-            # which prevented flake8 from working properly, so we restricted the version of importlib-metadata.
-            "importlib-metadata<5.0.0",
             "tianshou",
             "gym>=0.24",  # If you do not put gym at the end, gym will degrade causing pytest results to fail.
         ],