Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting KeyError while generation of data (synthesizer.sample()) - sdv==1.12.1 #2026

Closed
burhanuddin-123 opened this issue May 23, 2024 · 3 comments
Labels
feature:sampling Related to generating synthetic data after a model is built resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@burhanuddin-123
Copy link

burhanuddin-123 commented May 23, 2024

Environment details

If you are already running SDV, please indicate the following details about the environment in
which you are running it:

  • SDV version: 1.12.1
  • Python version: 3.11.1
  • Operating System: Windows

Problem description

I am looking to generate synthetic data at scale, for two tables (Customers, and Orders) having a relationship between them, where customers is a parent and orders as a child. After Validating the MultiTableMetadata and applying constraints, I was also able to fit the HMASynthesizer on real data.

But while generating the sample data, I am getting the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\burha\Mentorskool\Synthetic Data Vault\new-venv\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   [3804](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3804) try:
-> [3805](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3805)     return self._engine.get_loc(casted_key)
   [3806](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3806) except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 7

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[208], [line 2](vscode-notebook-cell:?execution_count=208&line=2)
      [1](vscode-notebook-cell:?execution_count=208&line=1) # Step 3: Generate synthetic data
----> [2](vscode-notebook-cell:?execution_count=208&line=2) synthetic_data = synthesizer.sample(scale=0.01)  # it gives error

File c:\Users\burha\Mentorskool\Synthetic Data Vault\new-venv\Lib\site-packages\sdv\multi_table\base.py:423, in BaseMultiTableSynthesizer.sample(self, scale)
...
   [3815](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3815)     #  InvalidIndexError. Otherwise we fall through and re-raise
   [3816](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3816)     #  the TypeError.
   [3817](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3817)     self._check_indexing_error(key)

KeyError: 7

I had tried to generate it multiple times, and each time I got different KeyError, such as KeyError: 4, KeyError: 7, and so on. It is difficult to identify the root cause of this error.

@burhanuddin-123 burhanuddin-123 added new Automatic label applied to new issues question General question about the software labels May 23, 2024
@srinify
Copy link
Contributor

srinify commented May 30, 2024

Hi @burhanuddin-123 👋

Do you mind sharing more context with us so we can try to reproduce the issue on our end?

  • What does your metadata look like? You can run this method to get a nice diagram to share
  • Are you able to share your code itself so I can understand any preprocessing or other transformation you may have done?
  • Can you share the full stack trace of your error?
  • How many rows are in the Customers table and the Orders tables?

One thing I want to rule out is missing referential integrity, where all references in a foreign key reference a valid, existing primary key value. We created a function in our utils library to help process your data before model fitting. Try doing this step first before fitting and sampling. I doubt this is the issue since SDV usually checks for ref integrity, but still want to rule it out first.

@srinify srinify added feature:evaluation Related to running metrics or visualizations under discussion Issue is currently being discussed feature: modeling Related to training the model itself and removed question General question about the software new Automatic label applied to new issues feature:evaluation Related to running metrics or visualizations labels May 30, 2024
@srinify
Copy link
Contributor

srinify commented Jun 4, 2024

Hi there @burhanuddin-123 are you still running into this issue?

Another user ran into a very similar issue and it seems to be related to the scale parameter in their case. What value are you using for scale when sampling from HMA Synthesizer?

We opened this new issue to track the bug with the proposed solution as well: #2045

@srinify srinify added feature:sampling Related to generating synthetic data after a model is built and removed feature: modeling Related to training the model itself labels Jun 4, 2024
@srinify
Copy link
Contributor

srinify commented Jun 13, 2024

Hi there @burhanuddin-123 I haven't heard from you in a while so I'm going to go ahead and close this issue out. Please see the suggested workaround if you're still running into this issue: #2045 (comment)

@srinify srinify closed this as completed Jun 13, 2024
@srinify srinify added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:sampling Related to generating synthetic data after a model is built resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants