Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow training labels to be categorial #901

Closed
sammlapp opened this issue Oct 26, 2023 · 1 comment · Fixed by #1053
Closed

allow training labels to be categorial #901

sammlapp opened this issue Oct 26, 2023 · 1 comment · Fixed by #1053
Assignees
Labels
feature request New feature or request module:ml Machine Learning with PyTorch performance speed something up resolved_in_develop The issue has been resolved in the develop branch but not in master
Milestone

Comments

@sammlapp
Copy link
Collaborator

requiring binary-encoded ("one hot") labels can mean having a very large, sparse array that might even be too big for memory in cases with many samples and species. If the labels could be provided as categorial eg ['a'] instead of [0,1,0,0,0], it would be a much smaller object to hold in memory. Conversion to binary encoding could happen within a batch.

@sammlapp sammlapp added feature request New feature or request performance speed something up labels Oct 26, 2023
@sammlapp
Copy link
Collaborator Author

alternatively use sparse arrays:

def categorical_to_one_hot(labels, class_subset=None):
    """transform multi-target categorical labels (list of lists) to one-hot array

    Args:
        labels: list of lists of categorical labels, eg
            [['white','red'],['green','white']] or [[0,1,2],[3]]
        classes=None: list of classes for one-hot labels. if None,
            taken to be the unique set of values in `labels`
    Returns:
        one_hot: 2d array with 0 for absent and 1 for present
        class_subset: list of classes corresponding to columns in the array
    """
    if class_subset is None:
        class_subset = list(set(itertools.chain(*labels)))

    label_idx_dict = {l: i for i, l in enumerate(class_subset)}
    vals = []
    rows = []
    cols = []

    def add_labels(i, labels):
        for label in labels:
            if label in class_subset:
                vals.append(True)
                rows.append(i)
                cols.append(label_idx_dict[label])

    [add_labels(i, l) for i, l in enumerate(labels)]

    one_hot = scipy.sparse.csr_matrix(
        (vals, (rows, cols)), shape=(len(labels), len(class_subset)), dtype=bool
    )

The sparse scipy csr matrix can be converted to a sparse Pandas df with:

pd.DataFrame.sparse.from_spmatrix

we can also provide resampling for categorical labels:

def resample_categorical(
    df,
    n,
    n_downsample=None,
    upsample=True,
    downsample=True,
    with_replace=False,
    random_state=None,
    label_column=None,
    class_list=None,
):
    """resample a categorical label df for a target n_samples_per_class

    args:
        df: dataframe with one column 'class' listing the class of the sample
        n: target number of samples per class when upsampling
        n_downsample [default:None]: target number of samples per class when downsampling;
            defaults to same value as n_upsample if None
        upsample: if True, duplicate samples for classes with <n samples to get to n samples
        downsample: if True, randomly sample classis with >n samples to get to n samples
        with_replace: flag to enable sampling of the same row more than once, default False
        random_state: passed to np.random calls. If None, random state is not fixed.
        label_column: column to use as labels; if None, uses first column
        class_list: if values are not in this list, they are excluded from the returned
            dataframe. None retains all unique labels.

    Note: The algorithm assumes that the label df is single-label.
    If the label df is multi-label, some classes can end up over-represented.

    Note 2: The resulting df will have samples ordered by class label, even if the input df
    had samples in a random order.
    """
    # if n_downsample is not specified (is None), use n_upsample
    n_downsample = n_downsample or n

    # if label_column not specified, use first column
    label_column = label_column or df.columns[0]

    # keep track of index columns, reset index and restore later
    index_cols = df.index.names

    class_dfs = []
    for class_name, sub_df in df.reset_index().groupby(label_column):
        # check if it should be included
        if class_list is not None and not class_name in class_list:
            continue  # don't include this class

        n_class_samples = sub_df.shape[0]

        if n_class_samples < n:
            if not upsample:
                # we don't want to upsample, so just keep these samples
                class_dfs.append(sub_df)
            else:  # upsample
                # upsample to get to n samples
                num_replicates, remainder = divmod(n, n_class_samples)

                # take a random sample for the "remainder" portion
                # this is the entirety of the new set of n samples if downsampling,
                # and the samples with an 'extra' representation if upsampling
                random_df = sub_df.sample(
                    n=remainder, replace=with_replace, random_state=random_state
                )

                # repeat all of the samples as many times as necessary, then add random extras
                if num_replicates > 0:
                    repeat_df = pd.concat(itertools.repeat(sub_df, num_replicates))
                    # don't run pd.concat in for loop (https://stackoverflow.com/a/36489724/6591124)
                    class_dfs.extend([repeat_df, random_df])
                else:
                    class_dfs.append(random_df)

        # downsample to `n_downsample` but not necessarily down to n
        elif n_class_samples > n_downsample:
            if not downsample:
                # we don't want to downsample, so just keep all of samples
                class_dfs.append(sub_df)
            else:
                # take a random sample
                class_dfs.append(
                    sub_df.sample(
                        n=n_downsample, replace=with_replace, random_state=random_state
                    )
                )
        else:  # n samples is between n_downsample and n_upsample
            class_dfs.append(sub_df)

    return pd.concat(class_dfs).set_index(index_cols)

@sammlapp sammlapp added this to the 0.11.0 milestone Apr 12, 2024
@sammlapp sammlapp added the module:ml Machine Learning with PyTorch label May 13, 2024
@sammlapp sammlapp added the resolved_in_branch The issue has been resolved in a feat/issue branch but not merged into develop label Sep 6, 2024
@sammlapp sammlapp linked a pull request Sep 10, 2024 that will close this issue
@sammlapp sammlapp added resolved_in_develop The issue has been resolved in the develop branch but not in master and removed resolved_in_branch The issue has been resolved in a feat/issue branch but not merged into develop labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request module:ml Machine Learning with PyTorch performance speed something up resolved_in_develop The issue has been resolved in the develop branch but not in master
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants