PARSynthesizer: Duplicate sequence index values when sequence_length
is higher than real data
#2031
Labels
Milestone
sequence_length
is higher than real data
#2031
Environment Details
Error Description
When the desired sequence length is higher than the real data's sequence length and min-max enforcement is enabled, PARSynthesizer can generate duplicate values. This seems to happen especially when the sequence index column is a datetime column. When synthesizing values for the sequence key column, PARSynthesizer runs into the max value and repeats it.
Steps to reproduce
Original Data
2 sequences, each with 5 unique values for the
visits
column (the sequence index)Synthetic Data
Synthetic Data example when you set
sequence_length
parameter to 25:Synthetic Data example when you set
sequence_length
parameter to 7:Full code in Internal Colab Notebook here
Workarounds
enforce_min_max_values
asFalse
and this will remove the max value ceiling for the datetime sequence_key column. But this will mean that the synthesized data will be less representative of your real data so this is a big tradeoff until this bug is fixed.num_sequences
to be identical to the number of rows in your smallest, least unique (when it comes to the sequence key column) sequence from your real data. E.g. if you have a small sequence with 5 unique values for the sequence key, don't generate more than 5 rows per sequence. But this is also a limitation of SDV until this bug is fixed.Original Discussion here: #2004
The text was updated successfully, but these errors were encountered: