Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Kaggle loop update (Feature & Model) #241

Merged
merged 70 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
49541c6
Init todo
you-n-g Jul 17, 2024
aa4c7e5
Evaluation & dataset
taozhiwang Jul 23, 2024
c51a6f0
Generate new data
taozhiwang Jul 23, 2024
90bd7e3
dataset generation
taozhiwang Jul 24, 2024
864f5a0
add the result
taozhiwang Jul 24, 2024
f9b57b9
Analysis
taozhiwang Jul 24, 2024
db82b67
Factor update
taozhiwang Jul 24, 2024
52dc938
Updates
taozhiwang Jul 25, 2024
702c830
Reformat analysis.py
taozhiwang Jul 25, 2024
ac80c93
CI fix
taozhiwang Jul 25, 2024
9357628
Merge branch 'main' into benchmark
you-n-g Jul 25, 2024
3088525
Merge pull request #112 from microsoft/benchmark
taozhiwang Jul 25, 2024
ab552ff
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Jul 26, 2024
fc01635
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Jul 29, 2024
681af16
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Jul 30, 2024
a4643de
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Jul 31, 2024
cc48faa
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Aug 6, 2024
3ff2406
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Aug 26, 2024
96a5e9e
Merge branch 'main' of https://github.com/microsoft/RD-Agent
xisen-w Aug 28, 2024
fa25aaf
Revised Preprocessing & Supported Random Forest
xisen-w Aug 28, 2024
5537ff0
Revised to support three models with feature
xisen-w Aug 30, 2024
818bf0b
Further revised prompts
xisen-w Aug 30, 2024
d3f91cb
Slight Revision
xisen-w Aug 30, 2024
79bdf4c
docs: update contributors (#230)
Hytn Aug 28, 2024
94a22cb
Revised to support three models with feature
xisen-w Aug 30, 2024
e8294a6
Further revised prompts
xisen-w Aug 30, 2024
a8e8dd9
Slight Revision
xisen-w Aug 30, 2024
f218b93
Merge branch 'model-loop-update' of https://github.com/microsoft/RD-A…
xisen-w Aug 30, 2024
ce8eeed
feat: kaggle model and feature (#238)
peteryang1 Sep 2, 2024
a8b2df9
feat: continue kaggle feature and model coder (#239)
peteryang1 Sep 2, 2024
c718143
finish the first round of runner (#240)
peteryang1 Sep 3, 2024
b2b7572
Optimized the factor scenario and added the front-end.
WinstonLiyt Sep 3, 2024
fed9a69
fix a small bug
WinstonLiyt Sep 4, 2024
05db6f1
fix a typo
WinstonLiyt Sep 4, 2024
88047f1
update the kaggle scenario
WinstonLiyt Sep 4, 2024
33b7b69
delete model_template folder
peteryang1 Sep 4, 2024
913ce10
use experiment to run data preprocess script
peteryang1 Sep 4, 2024
0135d21
add source data to scenarios
peteryang1 Sep 4, 2024
0de82d2
minor fix
peteryang1 Sep 4, 2024
ecbef88
minor bug fix
peteryang1 Sep 4, 2024
e8981ef
train.py debug
taozhiwang Sep 8, 2024
4da3957
fixed a bug in train.py and added some TODOs
WinstonLiyt Sep 8, 2024
b902a9e
For Debugging
xisen-w Sep 9, 2024
e6f95f5
fix two small bugs in based_exp
WinstonLiyt Sep 9, 2024
0ac70c0
fix some bugs
WinstonLiyt Sep 9, 2024
a6b603a
update preprocess
WinstonLiyt Sep 9, 2024
b4dd339
fix a bug in preprocess
WinstonLiyt Sep 9, 2024
fcd0f20
fix a bug in train.py
WinstonLiyt Sep 9, 2024
1e90441
reformat
WinstonLiyt Sep 9, 2024
121f5e0
Follow-up
xisen-w Sep 9, 2024
f915bc0
Merge branch 'model-loop-model-debug' into model-loop-update
xisen-w Sep 9, 2024
efb18a5
fix a bug in train.py
WinstonLiyt Sep 9, 2024
91ca2e9
fix a bug in workspace
WinstonLiyt Sep 9, 2024
966da07
fix a bug in feature duplication
WinstonLiyt Sep 9, 2024
d084e75
fix a bug in feedback
WinstonLiyt Sep 9, 2024
993b39e
fix a bug in preprocessed data
WinstonLiyt Sep 10, 2024
84e2447
Merge branch 'model-loop-update' of https://github.com/microsoft/RD-A…
xisen-w Sep 10, 2024
7effdf0
fix a bug om feature engineering
WinstonLiyt Sep 10, 2024
25a64c6
Merge branch 'main' into model-loop-update
WinstonLiyt Sep 10, 2024
275e526
fix a ci error
WinstonLiyt Sep 10, 2024
d1fa409
Debugged & Connected
xisen-w Sep 10, 2024
2aab52d
Merge branch 'model-loop-update' of https://github.com/microsoft/RD-A…
xisen-w Sep 10, 2024
c2dfbb8
Fixed error on feedback & added other fixes
xisen-w Sep 11, 2024
ffb2ff5
fix CI errors
WinstonLiyt Sep 11, 2024
3695a7b
fix a CI bug
WinstonLiyt Sep 11, 2024
1de9557
fix: fix_dotenv_error (#257)
SunsetWolf Sep 10, 2024
09812dc
chore(main): release 0.2.1 (#249)
you-n-g Sep 10, 2024
e0bc856
init a scenario for kaggle feature engineering
WinstonLiyt Aug 26, 2024
097d9f3
delete error codes
WinstonLiyt Sep 11, 2024
4ab8b96
Delete rdagent/app/kaggle_feature/conf.py
WinstonLiyt Sep 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# Changelog

## [0.2.1](https://github.com/microsoft/RD-Agent/compare/v0.2.0...v0.2.1) (2024-09-10)


### Bug Fixes

* default model value in config ([#256](https://github.com/microsoft/RD-Agent/issues/256)) ([c097585](https://github.com/microsoft/RD-Agent/commit/c097585f631f401c2c0966f6ad4c17286924f011))
* fix_dotenv_error ([#257](https://github.com/microsoft/RD-Agent/issues/257)) ([923063c](https://github.com/microsoft/RD-Agent/commit/923063c1fd957c4ed42e97272c72b5e9545451dc))
* readme ([#248](https://github.com/microsoft/RD-Agent/issues/248)) ([8cede22](https://github.com/microsoft/RD-Agent/commit/8cede2209922876490148459e1134da828e1fda0))

## [0.2.0](https://github.com/microsoft/RD-Agent/compare/v0.1.0...v0.2.0) (2024-09-07)


Expand Down
9 changes: 6 additions & 3 deletions rdagent/app/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,16 @@
- make rdagent a nice entry and
- autoamtically load dotenv
"""
from dotenv import load_dotenv

load_dotenv(".env")
# 1) Make sure it is at the beginning of the script so that it will load dotenv before initializing BaseSettings.
# 2) The ".env" argument is necessary to make sure it loads `.env` from the current directory.

import subprocess
from importlib.resources import path as rpath

import fire
from dotenv import load_dotenv

from rdagent.app.data_mining.model import main as med_model
from rdagent.app.general_model.general_model import (
Expand All @@ -20,8 +25,6 @@
from rdagent.app.qlib_rd_loop.model import main as fin_model
from rdagent.app.utils.info import collect_info

load_dotenv()


def ui(port=80, log_dir="", debug=False):
"""
Expand Down
7 changes: 1 addition & 6 deletions rdagent/app/general_model/general_model.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,3 @@
from dotenv import load_dotenv

from rdagent.scenarios.general_model.scenario import GeneralModelScenario

load_dotenv(override=True)

import fire

from rdagent.components.coder.model_coder.task_loader import (
Expand All @@ -13,6 +7,7 @@
extract_first_page_screenshot_from_pdf,
)
from rdagent.log import rdagent_logger as logger
from rdagent.scenarios.general_model.scenario import GeneralModelScenario
from rdagent.scenarios.qlib.developer.model_coder import QlibModelCoSTEER


Expand Down
24 changes: 14 additions & 10 deletions rdagent/app/kaggle/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,29 +13,33 @@ class Config:
"""Add 'model_' to the protected namespaces"""

# 1) overriding the default
scen: str = "rdagent.scenarios.kaggle.experiment.model_experiment.KGModelScenario"
scen: str = "rdagent.scenarios.kaggle.experiment.scenario.KGScenario"
"""Scenario class for data mining model"""

hypothesis_gen: str = "rdagent.scenarios.kaggle.proposal.model_proposal.KGModelHypothesisGen"
hypothesis_gen: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesisGen"
"""Hypothesis generation class"""

hypothesis2experiment: str = "rdagent.scenarios.kaggle.proposal.model_proposal.KGModelHypothesis2Experiment"
hypothesis2experiment: str = "rdagent.scenarios.kaggle.proposal.proposal.KGHypothesis2Experiment"
"""Hypothesis to experiment class"""

coder: str = "rdagent.scenarios.kaggle.developer.model_coder.KGModelCoSTEER"
"""Coder class"""
feature_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGFactorCoSTEER"
"""Feature Coder class"""

runner: str = "rdagent.scenarios.kaggle.developer.model_runner.KGModelRunner"
"""Runner class"""
model_coder: str = "rdagent.scenarios.kaggle.developer.coder.KGModelCoSTEER"
"""Model Coder class"""

summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGModelHypothesisExperiment2Feedback"
feature_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGFactorRunner"
"""Feature Runner class"""

model_runner: str = "rdagent.scenarios.kaggle.developer.runner.KGModelRunner"
"""Model Runner class"""

summarizer: str = "rdagent.scenarios.kaggle.developer.feedback.KGHypothesisExperiment2Feedback"
"""Summarizer class"""

evolving_n: int = 10
"""Number of evolutions"""

evolving_n: int = 10

competition: str = ""


Expand Down
98 changes: 98 additions & 0 deletions rdagent/app/kaggle/loop.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
from collections import defaultdict
from typing import Any

import fire

from rdagent.app.kaggle.conf import PROP_SETTING
from rdagent.components.workflow.conf import BasePropSetting
from rdagent.components.workflow.rd_loop import RDLoop
from rdagent.core.developer import Developer
from rdagent.core.exception import ModelEmptyError
from rdagent.core.proposal import (
Hypothesis2Experiment,
HypothesisExperiment2Feedback,
HypothesisGen,
Trace,
)
from rdagent.core.scenario import Scenario
from rdagent.core.utils import import_class
from rdagent.log import rdagent_logger as logger
from rdagent.scenarios.kaggle.proposal.proposal import (
KG_ACTION_FEATURE_ENGINEERING,
KG_ACTION_FEATURE_PROCESSING,
)


class ModelRDLoop(RDLoop):
def __init__(self, PROP_SETTING: BasePropSetting):
with logger.tag("init"):
scen: Scenario = import_class(PROP_SETTING.scen)(PROP_SETTING.competition)
logger.log_object(scen, tag="scenario")

self.hypothesis_gen: HypothesisGen = import_class(PROP_SETTING.hypothesis_gen)(scen)
logger.log_object(self.hypothesis_gen, tag="hypothesis generator")

self.hypothesis2experiment: Hypothesis2Experiment = import_class(PROP_SETTING.hypothesis2experiment)()
logger.log_object(self.hypothesis2experiment, tag="hypothesis2experiment")

self.feature_coder: Developer = import_class(PROP_SETTING.feature_coder)(scen)
logger.log_object(self.feature_coder, tag="feature coder")
self.model_coder: Developer = import_class(PROP_SETTING.model_coder)(scen)
logger.log_object(self.model_coder, tag="model coder")

self.feature_runner: Developer = import_class(PROP_SETTING.feature_runner)(scen)
logger.log_object(self.feature_runner, tag="feature runner")
self.model_runner: Developer = import_class(PROP_SETTING.model_runner)(scen)
logger.log_object(self.model_runner, tag="model runner")

self.summarizer: HypothesisExperiment2Feedback = import_class(PROP_SETTING.summarizer)(scen)
logger.log_object(self.summarizer, tag="summarizer")
self.trace = Trace(scen=scen)
super(RDLoop, self).__init__()

def coding(self, prev_out: dict[str, Any]):
with logger.tag("d"): # develop
if prev_out["propose"].action in [KG_ACTION_FEATURE_ENGINEERING, KG_ACTION_FEATURE_PROCESSING]:
exp = self.feature_coder.develop(prev_out["exp_gen"])
else:
exp = self.model_coder.develop(prev_out["exp_gen"])
logger.log_object(exp.sub_workspace_list, tag="coder result")
return exp

def running(self, prev_out: dict[str, Any]):
with logger.tag("ef"): # evaluate and feedback
if prev_out["propose"].action in [KG_ACTION_FEATURE_ENGINEERING, KG_ACTION_FEATURE_PROCESSING]:
exp = self.feature_runner.develop(prev_out["coding"])
else:
exp = self.model_runner.develop(prev_out["coding"])
logger.log_object(exp, tag="runner result")
return exp

skip_loop_error = (ModelEmptyError,)


def main(path=None, step_n=None, competition=None):
"""
Auto R&D Evolving loop for models in a kaggle{} scenario.

You can continue running session by

.. code-block:: python

dotenv run -- python rdagent/app/kaggle/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose --step_n 1 # `step_n` is a optional paramter

"""
if competition:
PROP_SETTING.competition = competition
if path is None:
model_loop = ModelRDLoop(PROP_SETTING)
else:
model_loop = ModelRDLoop.load(path)
model_loop.run(step_n=step_n)


if __name__ == "__main__":
from dotenv import load_dotenv

load_dotenv(override=True)
fire.Fire(main)
File renamed without changes.
5 changes: 0 additions & 5 deletions rdagent/components/benchmark/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,8 @@
from pathlib import Path
from typing import Optional

from dotenv import load_dotenv
from pydantic_settings import BaseSettings

# Load environment variables
load_dotenv(verbose=True, override=True)


DIRNAME = Path("./")


Expand Down
28 changes: 17 additions & 11 deletions rdagent/components/coder/factor_coder/CoSTEER/evaluators.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def evaluate(
)
buffer = io.StringIO()
gen_df.info(buf=buffer)
gen_df_info_str = buffer.getvalue()
gen_df_info_str = f"The use is currently working on a feature related task.\nThe output dataframe info is:\n{buffer.getvalue()}"
system_prompt = (
Environment(undefined=StrictUndefined)
.from_string(
Expand Down Expand Up @@ -378,6 +378,7 @@ def evaluate(
self,
implementation: Workspace,
gt_implementation: Workspace,
version: int = 1, # 1 for qlib factors and 2 for kaggle factors
**kwargs,
) -> Tuple:
conclusions = []
Expand All @@ -389,18 +390,21 @@ def evaluate(
equal_value_ratio_result = 0
high_correlation_result = False

# Check if both dataframe has only one columns
feedback_str, _ = FactorSingleColumnEvaluator(self.scen).evaluate(implementation, gt_implementation)
conclusions.append(feedback_str)
# Check if both dataframe has only one columns Mute this since factor task might generate more than one columns now
if version == 1:
feedback_str, _ = FactorSingleColumnEvaluator(self.scen).evaluate(implementation, gt_implementation)
conclusions.append(feedback_str)

# Check if the index of the dataframe is ("datetime", "instrument")
feedback_str, _ = FactorOutputFormatEvaluator(self.scen).evaluate(implementation, gt_implementation)
conclusions.append(feedback_str)

feedback_str, daily_check_result = FactorDatetimeDailyEvaluator(self.scen).evaluate(
implementation, gt_implementation
)
conclusions.append(feedback_str)
if version == 1:
feedback_str, daily_check_result = FactorDatetimeDailyEvaluator(self.scen).evaluate(
implementation, gt_implementation
)
conclusions.append(feedback_str)
else:
daily_check_result = None

# Check if both dataframe have the same rows count
if gt_implementation is not None:
Expand Down Expand Up @@ -627,7 +631,9 @@ def evaluate(
(
factor_feedback.factor_value_feedback,
decision_from_value_check,
) = self.value_evaluator.evaluate(implementation=implementation, gt_implementation=gt_implementation)
) = self.value_evaluator.evaluate(
implementation=implementation, gt_implementation=gt_implementation, version=target_task.version
)

factor_feedback.final_decision_based_on_gt = gt_implementation is not None

Expand All @@ -647,7 +653,7 @@ def evaluate(
target_task=target_task,
implementation=implementation,
execution_feedback=factor_feedback.execution_feedback,
value_feedback=factor_feedback.factor_value_feedback,
factor_value_feedback=factor_feedback.factor_value_feedback,
gt_implementation=gt_implementation,
)
(
Expand Down
41 changes: 30 additions & 11 deletions rdagent/components/coder/factor_coder/factor.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,19 @@ def __init__(
factor_name,
factor_description,
factor_formulation,
*args,
variables: dict = {},
resource: str = None,
factor_implementation: bool = False,
**kwargs,
) -> None:
self.factor_name = factor_name
self.factor_description = factor_description
self.factor_formulation = factor_formulation
self.variables = variables
self.factor_resources = resource
self.factor_implementation = factor_implementation
super().__init__(*args, **kwargs)

def get_task_information(self):
return f"""factor_name: {self.factor_name}
Expand Down Expand Up @@ -75,8 +78,8 @@ class FactorFBWorkspace(FBWorkspace):
def __init__(
self,
*args,
executed_factor_value_dataframe=None,
raise_exception=False,
executed_factor_value_dataframe: pd.DataFrame = None,
raise_exception: bool = False,
**kwargs,
) -> None:
super().__init__(*args, **kwargs)
Expand All @@ -102,7 +105,10 @@ def execute(self, store_result: bool = False, data_type: str = "Debug") -> Tuple
1. make the directory in workspace path
2. write the code to the file in the workspace path
3. link all the source data to the workspace path folder
4. execute the code
if call_factor_py is True:
4. execute the code
else:
4. generate a script from template to import the factor.py dump get the factor value to result.h5
5. read the factor value from the output file in the workspace path folder
returns the execution feedback as a string and the factor value as a pandas dataframe

Expand Down Expand Up @@ -130,15 +136,21 @@ def execute(self, store_result: bool = False, data_type: str = "Debug") -> Tuple
if self.executed_factor_value_dataframe is not None:
return self.FB_FROM_CACHE, self.executed_factor_value_dataframe

source_data_path = (
Path(
FACTOR_IMPLEMENT_SETTINGS.data_folder_debug,
if self.target_task.version == 1:
source_data_path = (
Path(
FACTOR_IMPLEMENT_SETTINGS.data_folder_debug,
)
if data_type == "Debug"
else Path(
FACTOR_IMPLEMENT_SETTINGS.data_folder,
)
)
if data_type == "Debug"
else Path(
elif self.target_task.version == 2:
# TODO you can change the name of the data folder for a better understanding
source_data_path = Path(
FACTOR_IMPLEMENT_SETTINGS.data_folder,
)
)

source_data_path.mkdir(exist_ok=True, parents=True)
code_path = self.workspace_path / f"factor.py"
Expand All @@ -147,9 +159,16 @@ def execute(self, store_result: bool = False, data_type: str = "Debug") -> Tuple

execution_feedback = self.FB_EXECUTION_SUCCEEDED
execution_success = False

if self.target_task.version == 1:
execution_code_path = code_path
elif self.target_task.version == 2:
execution_code_path = self.workspace_path / f"{uuid.uuid4()}.py"
execution_code_path.write_text((Path(__file__).parent / "factor_execution_template.txt").read_text())

try:
subprocess.check_output(
f"{FACTOR_IMPLEMENT_SETTINGS.python_bin} {code_path}",
f"{FACTOR_IMPLEMENT_SETTINGS.python_bin} {execution_code_path}",
shell=True,
cwd=self.workspace_path,
stderr=subprocess.STDOUT,
Expand All @@ -161,7 +180,7 @@ def execute(self, store_result: bool = False, data_type: str = "Debug") -> Tuple

execution_feedback = (
e.output.decode()
.replace(str(code_path.parent.absolute()), r"/path/to")
.replace(str(execution_code_path.parent.absolute()), r"/path/to")
.replace(str(site.getsitepackages()[0]), r"/path/to/site-packages")
)
if len(execution_feedback) > 2000:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import os

import numpy as np
import pandas as pd
from factor import feat_eng

if os.path.exists("valid.pkl"):
valid_df = pd.read_pickle("valid.pkl")
else:
raise FileNotFoundError("No valid data found.")

new_feat = feat_eng(valid_df)
new_feat.to_hdf("result.h5", key="data", mode="w")
Loading