PARSynthesizer errors during `fit` if sequence_index is numerical sdtype #2079

frances-h · 2024-06-18T17:48:44Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version:
Python version:
Operating System:

Error Description

After #2043, we fixed an issue where enforce_min_max_values was by default being set to True for the sequence_index transformer. However, if no transformer is assigned to the sequence_index (i.e. if the sequence is already a numerical sdtype), fit now errors.

To fix, we should check that (1) a transformer has been assigned (transformer is not None) and (2) that the transformer has the enforce_min_max_values attribute (instead of adding an additional check, we could use getattr with a False default value in place of directly accessing the attribute)

Steps to reproduce

from sdv.datasets.demo import download_demo
from sdv.sequential import PARSynthesizer

data, metadata = download_demo('sequential', 'CMAPPSJetEngine')
s1 = PARSynthesizer(metadata)
s1.fit(data)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 s1.fit(data1)
      2 s1.sample(10)

File ~/Documents/SDV/sdv/single_table/base.py:471, in BaseSynthesizer.fit(self, data)
    469 self._data_processor.reset_sampling()
    470 self._random_state_set = False
--> 471 processed_data = self.preprocess(data)
    472 self.fit_processed_data(processed_data)

File ~/Documents/SDV/sdv/single_table/base.py:407, in BaseSynthesizer.preprocess(self, data)
    400     warnings.warn(
    401         'This model has already been fitted. To use the new preprocessed data, '
    402         "please refit the model using 'fit' or 'fit_processed_data'."
    403     )
    405 is_converted = self._store_and_convert_original_cols(data)
--> 407 preprocess_data = self._preprocess(data)
    409 if is_converted:
    410     data.columns = self._original_columns

File ~/Documents/SDV/sdv/sequential/par.py:286, in PARSynthesizer._preprocess(self, data)
    284 sequence_key_transformers = {sequence_key: None for sequence_key in self._sequence_key}
    285 if not self._data_processor._prepared_for_fitting:
--> 286     self.auto_assign_transformers(data)
    288 self.update_transformers(sequence_key_transformers)
    289 preprocessed = super()._preprocess(data)

File ~/Documents/SDV/sdv/sequential/par.py:267, in PARSynthesizer.auto_assign_transformers(self, data)
    265 if self._sequence_index:
    266     sequence_index_transformer = self.get_transformers()[self._sequence_index]
--> 267     if sequence_index_transformer.enforce_min_max_values:
    268         sequence_index_transformer.enforce_min_max_values = False

AttributeError: 'NoneType' object has no attribute 'enforce_min_max_values'

The text was updated successfully, but these errors were encountered:

ryantimjohn · 2024-06-26T19:04:45Z

Is there any workaround end-users can do to get around this in the meantime till this release drops @lajohn4747 ?

npatki · 2024-06-26T21:13:55Z

Hi @ryantimjohn, sure thing. The bug only appears when sequence_index is a numerical sdtype, but it works just fine if the sdtype is datetime. So one workaround would be to convert your numerical column into datetimes. In the example below, I am converting a numerical column to datetimes by adding the # of days to Jan 1, 2000:

import pandas as pd
from sdv.sequential import PARSynthesizer

index_name = 'COLUMN_NAME' # replace with the name of your numerical sequence index column

# convert the sequence index to datetime and update metadata to match
data[index_name] = pd.to_datetime('2000-01-01') + pd.to_timedelta(data[index_name], unit='d')
metadata.update_column(
    column_name=index_name,
    sdtype='datetime'
)

# now you can model and sample synthetic data using PAR
synthesizer = PARSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=10)

# be sure to convert the datetimes back into numbers
synthetic_data[index_name] = synthetic_data[index_name] - pd.to_datetime('2000-01-01')

This is a bit hacky, but after the next release, you will not need to apply the workaround. Hope that helps!

ryantimjohn · 2024-06-27T16:58:54Z

Came up with the same solution, thank you!

ryantimjohn · 2024-06-27T17:20:05Z

@npatki Unfortunately, when I did this, though, I got another error, sorry to ask for help troubleshooting but wondered if you could help because I saw you dealt with a similar error here:
#1214

After converting the sequence index column to a date and update the metadata, when I run the dataframe through the PAR Synthesizer, I get this error:
AttributeError: 'NoneType' object has no attribute 'is_generator'

Is there any reason why this might be that comes to mind?

Thanks very much for your help!

Full stack trace:

Cell In[15], line 5
      1 from sdv.sequential import PARSynthesizer
      2 synthesizer = PARSynthesizer(
      3     modified_metadata,context_columns=context_columns,enforce_min_max_values=False,
      4         verbose=True)
----> 5 synthesizer.fit(modified_data)

File /opt/conda/lib/python3.10/site-packages/sdv/single_table/base.py:460, in BaseSynthesizer.fit(self, data)
    458 self._data_processor.reset_sampling()
    459 self._random_state_set = False
--> 460 processed_data = self.preprocess(data)
    461 self.fit_processed_data(processed_data)

File /opt/conda/lib/python3.10/site-packages/sdv/single_table/base.py:396, in BaseSynthesizer.preprocess(self, data)
    389     warnings.warn(
    390         'This model has already been fitted. To use the new preprocessed data, '
    391         "please refit the model using 'fit' or 'fit_processed_data'."
    392     )
    394 is_converted = self._store_and_convert_original_cols(data)
--> 396 preprocess_data = self._preprocess(data)
    398 if is_converted:
    399     data.columns = self._original_columns

File /opt/conda/lib/python3.10/site-packages/sdv/sequential/par.py:280, in PARSynthesizer._preprocess(self, data)
    277 if not self._data_processor._prepared_for_fitting:
    278     self.auto_assign_transformers(data)
--> 280 self.update_transformers(sequence_key_transformers)
    281 preprocessed = super()._preprocess(data)
    283 if self._sequence_index:

File /opt/conda/lib/python3.10/site-packages/sdv/sequential/par.py:303, in PARSynthesizer.update_transformers(self, column_name_to_transformer)
    299 if set(column_name_to_transformer).intersection(set(self.context_columns)):
    300     raise SynthesizerInputError(
    301         'Transformers for context columns are not allowed to be updated.')
--> 303 super().update_transformers(column_name_to_transformer)

File /opt/conda/lib/python3.10/site-packages/sdv/single_table/base.py:228, in BaseSynthesizer.update_transformers(self, column_name_to_transformer)
    226 self._validate_transformers(column_name_to_transformer)
    227 self._warn_for_update_transformers(column_name_to_transformer)
--> 228 self._data_processor.update_transformers(column_name_to_transformer)
    229 if self._fitted:
    230     msg = 'For this change to take effect, please refit the synthesizer using `fit`.'

File /opt/conda/lib/python3.10/site-packages/sdv/data_processing/data_processor.py:652, in DataProcessor.update_transformers(self, column_name_to_transformer)
    646     raise NotFittedError(
    647         'The DataProcessor must be prepared for fitting before the transformers can be '
    648         'updated.'
    649     )
    651 for column, transformer in column_name_to_transformer.items():
--> 652     if column in self._keys and not transformer.is_generator():
    653         raise SynthesizerInputError(
    654             f"Invalid transformer '{transformer.__class__.__name__}' for a primary "
    655             f"or alternate key '{column}'. Please use a generator transformer instead."
    656         )
    658 with warnings.catch_warnings():

npatki · 2024-06-27T19:54:42Z

Hi @ryantimjohn no problem. I suspect this is unrelated to to the fit error and has something to do with the data/metadata itself. Would you mind filing a new bug with this info? It would be helpful to if you could also share the (updated) metadata and perhaps an example of the data itself that has the datetime index column. Thanks!

frances-h added the bug Something isn't working label Jun 18, 2024

lajohn4747 self-assigned this Jun 19, 2024

lajohn4747 added this to the 1.14.1 milestone Jun 19, 2024

lajohn4747 mentioned this issue Jun 19, 2024

Do not error if sequence_index is numerical #2080

Merged

lajohn4747 closed this as completed in #2080 Jun 20, 2024

npatki mentioned this issue Jul 2, 2024

SDV 1.14: PAR Synthesizer can't fit if metadata has a sequence_index set #2103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARSynthesizer errors during `fit` if sequence_index is numerical sdtype #2079

PARSynthesizer errors during `fit` if sequence_index is numerical sdtype #2079

frances-h commented Jun 18, 2024 •

edited

Loading

ryantimjohn commented Jun 26, 2024

npatki commented Jun 26, 2024

ryantimjohn commented Jun 27, 2024

ryantimjohn commented Jun 27, 2024 •

edited

Loading

npatki commented Jun 27, 2024

PARSynthesizer errors during fit if sequence_index is numerical sdtype #2079

PARSynthesizer errors during fit if sequence_index is numerical sdtype #2079

Comments

frances-h commented Jun 18, 2024 • edited Loading

Environment Details

Error Description

Steps to reproduce

ryantimjohn commented Jun 26, 2024

npatki commented Jun 26, 2024

ryantimjohn commented Jun 27, 2024

ryantimjohn commented Jun 27, 2024 • edited Loading

npatki commented Jun 27, 2024

PARSynthesizer errors during `fit` if sequence_index is numerical sdtype #2079

PARSynthesizer errors during `fit` if sequence_index is numerical sdtype #2079

frances-h commented Jun 18, 2024 •

edited

Loading

ryantimjohn commented Jun 27, 2024 •

edited

Loading