Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a utility function get_random_sequence_subset #2085

Closed
npatki opened this issue Jun 20, 2024 · 1 comment · Fixed by #2098
Closed

Add a utility function get_random_sequence_subset #2085

npatki opened this issue Jun 20, 2024 · 1 comment · Fixed by #2098
Assignees
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Jun 20, 2024

Problem Description

Subsetting single and multi-table data is easy by using existing functions such as get_random_subset.

But subsetting sequential data is not as easy. Since different rows can belong together (within the same sequence) and have an order, it's not possible to simply select random rows. For such data, it will be helpful to have a utility unction to perform the subsetting.

Expected behavior

Add a function to utils called get_random_sequence_subset to be used by sequential data.

Parameters:

  • (required) data: A pandas DataFrame with the sequential data
  • (required) metadata: A SingleTableMetadata object describing the data
  • (required) num_sequences: The number of sequences to subsample
  • max_sequence_length: The maximum length each subsampled sequence is allowed to be
    • (default) None: Do not enforce any max length, meaning that entire sequences will be sampled
    • int: All subsampled sequences must be <= the provided length
  • long_sequence_subsampling_method: The method to use when a selected sequence is too long
    • (default) first_rows: Keep the first n rows of the sequence, where n is the max sequence length
    • last_rows: Keep the last n rows of the sequence, where n is the max sequence length
    • random: Randomly choose n rows to keep within the sequence. It is important to keep the randomly chosen rows in the same order as they appear in the original data.
from sdv.utils import get_random_sequence_subset

data_subset = get_random_sequence_subset(data, metadata,
  num_sequences=100, 
  max_sequence_length=1000,
  long_sequence_subsampling_method='last_rows')

The function would do the following:

  • Randomly select sequences according to num_sequences parameter. (Note that the sequence_key is used in determining sequences.)
  • For each selected sequence, ensure that the length is <= max_sequence_length. If sequences are longer, then use the long_sequence_subsampling_method to make it shorter

Return the shortened pandas DataFrame with the subsampled data. Ensure that the index of the DataFrame has been reset.

Additional context

  • The metadata must contain a sequence_key -- otherwise it is not multi-sequence data and not really eligible for this type of subsampling. If there is no sequence_key, throw an error
  • As a starting point, below is some code we've provided to a user to sample entire sequences. Note that this code does not consider max sequence length at all.
import numpy as np

def get_random_sequence_subset(data, metadata, num_sequences):
  sequence_key = metadata.to_dict()['sequence_key']
  unique_sequences = data[sequence_key].unique()
  sequence_subset = np.random.choice(unique_sequences, size=num_sequences)
  subsetted_data = data[data[sequence_key].isin(sequence_subset)].reset_index(drop=True)
  return subsetted_data
@npatki npatki added feature request Request for a new feature data:sequential Related to timeseries datasets labels Jun 20, 2024
@amontanez24
Copy link
Contributor

amontanez24 commented Jun 26, 2024

@npatki Should this also be in poc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants