Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

332 executable smartnoise #341

Merged
merged 19 commits into from
Mar 11, 2024
Merged

332 executable smartnoise #341

merged 19 commits into from
Mar 11, 2024

Conversation

ghost
Copy link

@ghost ghost commented Mar 5, 2024

原先由於要繞過 smartnoise 的資料前處理,因此使用了其提供的 IdentityTransformer,卻發現會引起其他問題。後來接受 smartnoise 的前處理方式:因為其對類別資料的前處理只有轉換為 LabelEncoder,不需要 epsilon。為了修正此 bug,做了以下變更:

將 encoder、discretizing 的輸出皆調整為 pd.Categorical。 5f93456 987447e

接受 smartnoise 的前處理方式。 8994fc6

新增與調整 demo 檔案。 1801d9b af94b5e

@ghost ghost added the bug Something isn't working label Mar 5, 2024
@ghost ghost added this to the 20240314, User Story beta testing milestone Mar 5, 2024
@ghost ghost requested a review from matheme-justyn March 5, 2024 08:19
@ghost ghost self-assigned this Mar 5, 2024
@ghost ghost linked an issue Mar 5, 2024 that may be closed by this pull request
@ghost
Copy link
Author

ghost commented Mar 5, 2024

修正對應的 README 文件,將 mwem 方法移除。2be8c03

@ghost
Copy link
Author

ghost commented Mar 7, 2024

smartnoise GAN 系 synthesizer 可用 (dpctgan, pategan)

將 smartnoise 中的 gan 方法整合進入至套件 cb0ce89

進行輸出型別的調整以對應上述整合 279524d

提供 smartnoise 在套件中使用的內部範例文件 b426ed3

對應 README 文件修改 b79a33a

@ghost ghost added the enhancement New feature or request label Mar 7, 2024
@matheme-justyn
Copy link
Contributor

由於 #350 的調整,將說明書檔名改成 2024-01-11-Synthesizer.md - 08eb8da

Copy link
Contributor

@matheme-justyn matheme-justyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我在 branch 332-executable-smartnoise 上用 executable_smartnoise.ipynb 試著對 adult-income [1.] 執行 issue332(),得到跟 #332 幾乎一樣的結果:

  • smartnoise-aim: ValueError: Synthesizer aim not found
  • smartnoise-mwem: MemoryError: Unable to allocate 1.87 TiB for an array with shape (256842399744,) and data type int64
  • smartnoise-mst: ModuleNotFoundError: No module named 'disjoint_set'
  • smartnoise-pacsynth: ValueError: Input contains NaN.

我沒有辦法復現你修繕的結果。我建議

  1. 先在你可以執行成功的環境下,列出你相關套件的版本,我們先盤點一次是否跟 requirement.txt 一致
  2. 如果不一致,例如 'disjoint_set',是否可以加入 pyproject.toml
  3. 如果有相依性衝突,嘗試在 pyproject.toml 上手動強制升級所需套件並輸出 requirement.txt,之後在一個全新的環境下,依照 requirement.txt 安裝,再嘗試一次 smartnoise 這些檢查

另外我看到你有 issue332_gan(),我的理解是打開 scaler_inhibit = True 就可以解決 Scaler 不應該設定的問題,但我不管開跟關都獲得相同錯誤:

  • smartnoise-dpctgan: RuntimeError: all elements of input should be between 0 and 1
  • smartnoise-patectgan: RuntimeError: all elements of input should be between 0 and 1

  1. 如果你把 dev 抓到這個 branch,需要改用 benchmark://adult-income,此時檔名為 adult-income.csv,但這才是我們之前用的 adult

docs/_posts/2024-01-11-Synthesizer.md Show resolved Hide resolved
@matheme-justyn
Copy link
Contributor

我先執行了一次 poetry lock 試圖提版,以下是有提升的部分:

  • boto3: 1.34.42 -> 1.34.58
  • botocore: 1.34.42 -> 1.34.58
  • importlib-metadata: 7.0.1 -> 7.0.2
  • importlib-resources: 6.1.1 -> 6.1.3
  • nvidia-nvjitlink-cu12: 12.3.101 -> 12.4.99
  • pyparsing: 3.1.1 -> 3.1.2
  • pytest: 8.0.0 -> 8.0.2
  • sqlalchemy: 1.4.51 -> 1.4.52

感覺都沒有立即跟這個 issue 有關,但我們一步步解,我 commit poetry.lock 在 2e8ddd8

@ghost
Copy link
Author

ghost commented Mar 8, 2024

1、3 是我有更新套件跟加裝 smartnoise doc 說要裝的套件
2 不是我們的問題 無法解決 但已經在doc上移除這個方法
4 這個我之前沒遇過 也許版本更新可以解決

@ghost
Copy link
Author

ghost commented Mar 8, 2024

剛剛再去確認第四點的問題,根據 smartnoise 的官方文件: "... To achieve this dimensional fidelity, the pac-synth synthesizer will sometimes generate rows with missing values.",因此合成資料本身會產生 NA 值,導致 inverse_transform 失敗。看起來 pacsynth 這樣的行為是無法被控制的,因此我建議看 Processor 這邊我能不能做一些調整,或者直接不使用這個 method。

@matheme-justyn
Copy link
Contributor

matheme-justyn commented Mar 8, 2024

更新 pyproject.toml 關於用 Poetry 設定環境的指令 - 6817af1

由於 poetry 軟體相依性問題過於嚴格(見 poetry/issues/697
會傾向以 pip install 可安裝的方式做最大公約數來建議我們的套件部署方式
poetry 或 conda 的安裝未來會考慮移動到其他地方做參考 (e.g. 手冊的 about 之類)

@ghost
Copy link
Author

ghost commented Mar 8, 2024

conda create -n re python=3.10
conda activate re
pip install peotry
poetry install
pip install ipykernel
pip install pyyaml
pip install boto3
pip install sdv
pip install smartnoise-synth # Error can be ignored
pip install anonymeter
pip install git+https://github.com/ryan112358/private-pgm.git
pip install --upgrade torch # Error can be ignored

@matheme-justyn
Copy link
Contributor

matheme-justyn commented Mar 8, 2024

conda create -n re python=3.10
conda activate re
pip install peotry
poetry install
pip install ipykernel
pip install pyyaml
pip install boto3
pip install sdv
pip install smartnoise-synth # Error can be ignored
pip install anonymeter
pip install git+https://github.com/ryan112358/private-pgm.git
pip install --upgrade torch # Error can be ignored

還需要加
pip install requests

備註會遇到的 error:

> pip install smartnoise-synth # Error can be ignored
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
rdt 1.9.2 requires Faker<20,>=17, but you have faker 15.3.4 which is incompatible.

> pip install --upgrade torch # Error can be ignored
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
smartnoise-synth 1.0.3 requires torch<2.0.0, but you have torch 2.2.1 which is incompatible.

其實這兩個問題都是已知,smartnoise 在 Faker 上卡住 rdt,以及 torch 上版本依賴太舊。我們現在用 pip 強制安裝 smartnoise-synth 1.0.3,我接著會做測試、測試完會重編 pyproject.toml 跟 requirement.txt,再看效果

CC @mileschangmoda

@matheme-justyn
Copy link
Contributor

matheme-justyn commented Mar 8, 2024

我本機測試完畢,並做出改動 - b536c87

  • 測試
    • smartnoise-aim: 通過
    • smartnoise-mst: 通過
    • smartnosie-mwem: MemeryError 已知問題,未來不使用
    • smartnoise-pacsynth: NaN 已知問題,未來不使用
    • smartnoise-dpctgan: 通過
    • smartnoise-patectgan: 通過
  • 修改
    • requirement.txt: 使用 pip freeze 導出
    • requirement-dev.txt: 刪除
    • poetry.lock: 刪除
    • pyproject.toml: 移除跟 Poetry 有關的章節

以後必須要以 requirement.txt 為核心,暫時不以 pyproject.toml 管理版本
下一步我會用 SageMaker 從零安裝 requirement.txt 測試

@matheme-justyn
Copy link
Contributor

SageMaker 報錯 - a6d4b2a

> !pip install -r ../../requirements.txt
ERROR: Ignored the following versions that require a different python version: 0.1.3 Requires-Python >=3.6,<3.9; 0.1.3.dev0 Requires-Python >=3.6,<3.9; 0.1.3.dev1 Requires-Python >=3.6,<3.9; 0.1.4 Requires-Python >=3.6,<3.9; 0.1.4.dev0 Requires-Python >=3.6,<3.9; 0.2.0 Requires-Python >=3.6,<3.9; 0.2.0.dev0 Requires-Python >=3.6,<3.9; 0.2.1 Requires-Python >=3.6,<3.9; 0.2.1.dev0 Requires-Python >=3.6,<3.9; 0.2.2 Requires-Python >=3.6,<3.9; 0.2.2.dev0 Requires-Python >=3.6,<3.9; 0.2.2.dev1 Requires-Python >=3.5,<3.9; 0.2.2.dev2 Requires-Python >=3.6,<3.9; 0.2.2.dev3 Requires-Python >=3.6,<3.9; 0.3.0 Requires-Python >=3.6,<3.10; 0.3.0 Requires-Python >=3.6,<3.9; 0.3.0.dev0 Requires-Python >=3.5,<3.9; 0.3.0.dev0 Requires-Python >=3.6,<3.10; 0.3.0.dev1 Requires-Python >=3.6,<3.9; 0.3.0.post1 Requires-Python >=3.6,<3.10; 0.3.1 Requires-Python >=3.5,<3.8; 0.3.1 Requires-Python >=3.6,<3.9; 0.3.1.dev0 Requires-Python >=3.5,<3.8; 0.3.1.dev0 Requires-Python >=3.6,<3.9; 0.3.1.dev1 Requires-Python >=3.6,<3.9; 0.3.1.dev2 Requires-Python >=3.6,<3.9; 0.3.2 Requires-Python >=3.5,<3.9; 0.3.2.dev0 Requires-Python >=3.5,<3.8; 0.3.2.dev0 Requires-Python >=3.6,<3.9; 0.3.2.dev1 Requires-Python >=3.5,<3.9; 0.3.3 Requires-Python >=3.5,<3.9; 0.3.3.dev0 Requires-Python >=3.5,<3.9; 0.4.0 Requires-Python >=3.5,<3.9; 0.4.0 Requires-Python >=3.6,<3.9; 0.4.0.dev0 Requires-Python >=3.5,<3.9; 0.4.0.dev0 Requires-Python >=3.6,<3.9; 0.4.0.dev1 Requires-Python >=3.6,<3.9; 0.4.1 Requires-Python >=3.6,<3.9; 0.4.1.dev0 Requires-Python >=3.6,<3.9; 0.4.1.dev1 Requires-Python >=3.6,<3.9; 0.4.2 Requires-Python >=3.6,<3.9; 0.4.2.dev0 Requires-Python >=3.6,<3.9; 0.4.3 Requires-Python >=3.6,<3.9; 0.4.3.dev0 Requires-Python >=3.6,<3.9; 0.4.3.dev1 Requires-Python >=3.6,<3.9; 0.4.4.dev0 Requires-Python >=3.6,<3.9; 0.5.0 Requires-Python >=3.6,<3.10; 0.5.0 Requires-Python >=3.6,<3.9; 0.5.0.dev0 Requires-Python >=3.6,<3.9; 0.5.0.dev1 Requires-Python >=3.6,<3.10; 0.5.0.dev1 Requires-Python >=3.6,<3.9; 0.5.1 Requires-Python >=3.6,<3.10; 0.5.1 Requires-Python >=3.6,<3.9; 0.5.1.dev0 Requires-Python >=3.6,<3.10; 0.5.1.dev0 Requires-Python >=3.6,<3.9; 0.5.1.dev1 Requires-Python >=3.6,<3.10; 0.5.1.dev1 Requires-Python >=3.6,<3.9; 0.5.1.dev2 Requires-Python >=3.6,<3.10; 0.5.1.dev3 Requires-Python >=3.6,<3.10; 0.5.2 Requires-Python >=3.6,<3.10; 0.5.2.dev0 Requires-Python >=3.6,<3.10; 0.5.2.dev0 Requires-Python >=3.6,<3.9; 0.5.2.dev1 Requires-Python >=3.6,<3.10; 0.5.3.dev0 Requires-Python >=3.6,<3.10; 0.6.0 Requires-Python >=3.6,<3.10; 0.6.0.dev0 Requires-Python >=3.6,<3.10; 0.6.1 Requires-Python >=3.6,<3.10; 0.6.1.dev0 Requires-Python >=3.6,<3.10; 0.7.0 Requires-Python >=3.6,<3.10; 0.7.0.dev0 Requires-Python >=3.6,<3.10
ERROR: Could not find a version that satisfies the requirement pywin32==306 (from versions: none)
ERROR: No matching distribution found for pywin32==306

實際執行後 SDV 沒裝起來

@matheme-justyn
Copy link
Contributor

matheme-justyn commented Mar 8, 2024

更新: 移除對 pywin32 的需求,試圖在 SageMaker 繼續安裝其他地方,仍然失敗 - 364e860

> !pip install -r ../../requirements.txt
Requirement already satisfied: pexpect>4.3 in /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages (from ipython==8.22.2->-r ../../requirements.txt (line 29)) (4.9.0)
INFO: pip is looking at multiple versions of rdt to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r ../../requirements.txt (line 66) and Faker==15.3.4 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested Faker==15.3.4
    rdt 1.9.2 depends on Faker<20 and >=17

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

由於不能靠 requirements.txt 完整安裝當前功能的依賴套件,我擔心這件事情不應該跟使用者說「你就照順序 pip install 就好」,我尋求 @mileschangmoda 的建議,在解決 requirement.txt 的疑慮之前,我傾向不 approval 這個 PR

@ghost
Copy link
Author

ghost commented Mar 8, 2024

如果使用 conda 讀入 requirements.txt 建立環境,會出現以下錯誤訊息:

LibMambaUnsatisfiableError: Encountered problems while solving:
  - nothing provides requested //github.com/ryan112358/private-pgm.git@5b9126295c110b741e5426ddbff419ea1e60e788
  - nothing provides requested anonymeter 1.0.0
  - nothing provides requested graphviz 0.17
  - nothing provides requested opacus 0.14.0
  - nothing provides requested opendp 0.8.0
  - nothing provides requested pac-synth 0.0.8
  - nothing provides requested prompt-toolkit 3.0.43
  - nothing provides requested pure-eval 0.2.2
  - nothing provides requested python-dateutil 2.9.0.post0
  - nothing provides requested smartnoise-sql 1.0.3
  - nothing provides requested smartnoise-synth 1.0.3
  - nothing provides requested stack-data 0.6.3
  - nothing provides requested torch 2.2.1
  - nothing provides requested tzdata 2024.1
  - package rdt-1.9.2-pyhd8ed1ab_0 requires faker >=17,<20, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
├─ //github.com/ryan112358/private-pgm.git@5b9126295c110b741e5426ddbff419ea1e60e788 does not exist (perhaps a typo or a missing channel);
├─ anonymeter 1.0.0  does not exist (perhaps a typo or a missing channel);
├─ faker 15.3.4  is requested and can be installed;
├─ graphviz 0.17  does not exist (perhaps a typo or a missing channel);
├─ opacus 0.14.0  does not exist (perhaps a typo or a missing channel);
├─ opendp 0.8.0  does not exist (perhaps a typo or a missing channel);
├─ pac-synth 0.0.8  does not exist (perhaps a typo or a missing channel);
├─ prompt-toolkit 3.0.43  does not exist (perhaps a typo or a missing channel);
├─ pure-eval 0.2.2  does not exist (perhaps a typo or a missing channel);
├─ python-dateutil 2.9.0.post0  does not exist (perhaps a typo or a missing channel);
├─ rdt 1.9.2  is not installable because it requires
│  └─ faker >=17,<20 , which conflicts with any installable versions previously reported;
├─ smartnoise-sql 1.0.3  does not exist (perhaps a typo or a missing channel);
├─ smartnoise-synth 1.0.3  does not exist (perhaps a typo or a missing channel);
├─ stack-data 0.6.3  does not exist (perhaps a typo or a missing channel);
├─ torch 2.2.1  does not exist (perhaps a typo or a missing channel);
└─ tzdata 2024.1  does not exist (perhaps a typo or a missing channel).

@mileschangmoda
Copy link
Collaborator

「你就照順序 pip install 就好」

這是可行的,但必須確保 pip install 可以保持不變
例如改用 pip install <package_name>== 等

@matheme-justyn
Copy link
Contributor

version conflict have been solved in #355 (CI check) and will be in #349 (README.md), so after discussion, I believe we can merge this branch.

Copy link
Contributor

@matheme-justyn matheme-justyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matheme-justyn matheme-justyn merged commit 3069eb2 into dev Mar 11, 2024
@matheme-justyn matheme-justyn deleted the 332-executable-smartnoise branch March 11, 2024 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Executable Smartnoise on adult
2 participants