allow training labels to be categorial #901

sammlapp · 2023-10-26T20:31:40Z

requiring binary-encoded ("one hot") labels can mean having a very large, sparse array that might even be too big for memory in cases with many samples and species. If the labels could be provided as categorial eg ['a'] instead of [0,1,0,0,0], it would be a much smaller object to hold in memory. Conversion to binary encoding could happen within a batch.

sammlapp · 2023-11-15T20:59:45Z

alternatively use sparse arrays:

def categorical_to_one_hot(labels, class_subset=None):
    """transform multi-target categorical labels (list of lists) to one-hot array

    Args:
        labels: list of lists of categorical labels, eg
            [['white','red'],['green','white']] or [[0,1,2],[3]]
        classes=None: list of classes for one-hot labels. if None,
            taken to be the unique set of values in `labels`
    Returns:
        one_hot: 2d array with 0 for absent and 1 for present
        class_subset: list of classes corresponding to columns in the array
    """
    if class_subset is None:
        class_subset = list(set(itertools.chain(*labels)))

    label_idx_dict = {l: i for i, l in enumerate(class_subset)}
    vals = []
    rows = []
    cols = []

    def add_labels(i, labels):
        for label in labels:
            if label in class_subset:
                vals.append(True)
                rows.append(i)
                cols.append(label_idx_dict[label])

    [add_labels(i, l) for i, l in enumerate(labels)]

    one_hot = scipy.sparse.csr_matrix(
        (vals, (rows, cols)), shape=(len(labels), len(class_subset)), dtype=bool
    )

The sparse scipy csr matrix can be converted to a sparse Pandas df with:

pd.DataFrame.sparse.from_spmatrix

we can also provide resampling for categorical labels:

def resample_categorical(
    df,
    n,
    n_downsample=None,
    upsample=True,
    downsample=True,
    with_replace=False,
    random_state=None,
    label_column=None,
    class_list=None,
):
    """resample a categorical label df for a target n_samples_per_class

    args:
        df: dataframe with one column 'class' listing the class of the sample
        n: target number of samples per class when upsampling
        n_downsample [default:None]: target number of samples per class when downsampling;
            defaults to same value as n_upsample if None
        upsample: if True, duplicate samples for classes with <n samples to get to n samples
        downsample: if True, randomly sample classis with >n samples to get to n samples
        with_replace: flag to enable sampling of the same row more than once, default False
        random_state: passed to np.random calls. If None, random state is not fixed.
        label_column: column to use as labels; if None, uses first column
        class_list: if values are not in this list, they are excluded from the returned
            dataframe. None retains all unique labels.

    Note: The algorithm assumes that the label df is single-label.
    If the label df is multi-label, some classes can end up over-represented.

    Note 2: The resulting df will have samples ordered by class label, even if the input df
    had samples in a random order.
    """
    # if n_downsample is not specified (is None), use n_upsample
    n_downsample = n_downsample or n

    # if label_column not specified, use first column
    label_column = label_column or df.columns[0]

    # keep track of index columns, reset index and restore later
    index_cols = df.index.names

    class_dfs = []
    for class_name, sub_df in df.reset_index().groupby(label_column):
        # check if it should be included
        if class_list is not None and not class_name in class_list:
            continue  # don't include this class

        n_class_samples = sub_df.shape[0]

        if n_class_samples < n:
            if not upsample:
                # we don't want to upsample, so just keep these samples
                class_dfs.append(sub_df)
            else:  # upsample
                # upsample to get to n samples
                num_replicates, remainder = divmod(n, n_class_samples)

                # take a random sample for the "remainder" portion
                # this is the entirety of the new set of n samples if downsampling,
                # and the samples with an 'extra' representation if upsampling
                random_df = sub_df.sample(
                    n=remainder, replace=with_replace, random_state=random_state
                )

                # repeat all of the samples as many times as necessary, then add random extras
                if num_replicates > 0:
                    repeat_df = pd.concat(itertools.repeat(sub_df, num_replicates))
                    # don't run pd.concat in for loop (https://stackoverflow.com/a/36489724/6591124)
                    class_dfs.extend([repeat_df, random_df])
                else:
                    class_dfs.append(random_df)

        # downsample to `n_downsample` but not necessarily down to n
        elif n_class_samples > n_downsample:
            if not downsample:
                # we don't want to downsample, so just keep all of samples
                class_dfs.append(sub_df)
            else:
                # take a random sample
                class_dfs.append(
                    sub_df.sample(
                        n=n_downsample, replace=with_replace, random_state=random_state
                    )
                )
        else:  # n samples is between n_downsample and n_upsample
            class_dfs.append(sub_df)

    return pd.concat(class_dfs).set_index(index_cols)

sammlapp added feature request New feature or request performance speed something up labels Oct 26, 2023

sammlapp assigned sammlapp and LeonardoViotti Apr 12, 2024

sammlapp added this to the 0.11.0 milestone Apr 12, 2024

sammlapp unassigned LeonardoViotti Apr 15, 2024

sammlapp added the module:ml Machine Learning with PyTorch label May 13, 2024

sammlapp mentioned this issue May 23, 2024

allow sparse label representation for training/evaluation of CNN #771

Closed

sammlapp mentioned this issue Aug 29, 2024

categorical_to_one_hot is slow for large array with many classes #412

Closed

sammlapp added the resolved_in_branch The issue has been resolved in a feat/issue branch but not merged into develop label Sep 6, 2024

sammlapp linked a pull request Sep 10, 2024 that will close this issue

Feat_categorical_labels #1053

Merged

sammlapp added resolved_in_develop The issue has been resolved in the develop branch but not in master and removed resolved_in_branch The issue has been resolved in a feat/issue branch but not merged into develop labels Sep 10, 2024

sammlapp closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow training labels to be categorial #901

allow training labels to be categorial #901

sammlapp commented Oct 26, 2023

sammlapp commented Nov 15, 2023

allow training labels to be categorial #901

allow training labels to be categorial #901

Comments

sammlapp commented Oct 26, 2023

sammlapp commented Nov 15, 2023