-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow training labels to be categorial #901
Labels
feature request
New feature or request
module:ml
Machine Learning with PyTorch
performance
speed something up
resolved_in_develop
The issue has been resolved in the develop branch but not in master
Milestone
Comments
alternatively use sparse arrays: def categorical_to_one_hot(labels, class_subset=None):
"""transform multi-target categorical labels (list of lists) to one-hot array
Args:
labels: list of lists of categorical labels, eg
[['white','red'],['green','white']] or [[0,1,2],[3]]
classes=None: list of classes for one-hot labels. if None,
taken to be the unique set of values in `labels`
Returns:
one_hot: 2d array with 0 for absent and 1 for present
class_subset: list of classes corresponding to columns in the array
"""
if class_subset is None:
class_subset = list(set(itertools.chain(*labels)))
label_idx_dict = {l: i for i, l in enumerate(class_subset)}
vals = []
rows = []
cols = []
def add_labels(i, labels):
for label in labels:
if label in class_subset:
vals.append(True)
rows.append(i)
cols.append(label_idx_dict[label])
[add_labels(i, l) for i, l in enumerate(labels)]
one_hot = scipy.sparse.csr_matrix(
(vals, (rows, cols)), shape=(len(labels), len(class_subset)), dtype=bool
) The sparse scipy csr matrix can be converted to a sparse Pandas df with:
we can also provide resampling for categorical labels: def resample_categorical(
df,
n,
n_downsample=None,
upsample=True,
downsample=True,
with_replace=False,
random_state=None,
label_column=None,
class_list=None,
):
"""resample a categorical label df for a target n_samples_per_class
args:
df: dataframe with one column 'class' listing the class of the sample
n: target number of samples per class when upsampling
n_downsample [default:None]: target number of samples per class when downsampling;
defaults to same value as n_upsample if None
upsample: if True, duplicate samples for classes with <n samples to get to n samples
downsample: if True, randomly sample classis with >n samples to get to n samples
with_replace: flag to enable sampling of the same row more than once, default False
random_state: passed to np.random calls. If None, random state is not fixed.
label_column: column to use as labels; if None, uses first column
class_list: if values are not in this list, they are excluded from the returned
dataframe. None retains all unique labels.
Note: The algorithm assumes that the label df is single-label.
If the label df is multi-label, some classes can end up over-represented.
Note 2: The resulting df will have samples ordered by class label, even if the input df
had samples in a random order.
"""
# if n_downsample is not specified (is None), use n_upsample
n_downsample = n_downsample or n
# if label_column not specified, use first column
label_column = label_column or df.columns[0]
# keep track of index columns, reset index and restore later
index_cols = df.index.names
class_dfs = []
for class_name, sub_df in df.reset_index().groupby(label_column):
# check if it should be included
if class_list is not None and not class_name in class_list:
continue # don't include this class
n_class_samples = sub_df.shape[0]
if n_class_samples < n:
if not upsample:
# we don't want to upsample, so just keep these samples
class_dfs.append(sub_df)
else: # upsample
# upsample to get to n samples
num_replicates, remainder = divmod(n, n_class_samples)
# take a random sample for the "remainder" portion
# this is the entirety of the new set of n samples if downsampling,
# and the samples with an 'extra' representation if upsampling
random_df = sub_df.sample(
n=remainder, replace=with_replace, random_state=random_state
)
# repeat all of the samples as many times as necessary, then add random extras
if num_replicates > 0:
repeat_df = pd.concat(itertools.repeat(sub_df, num_replicates))
# don't run pd.concat in for loop (https://stackoverflow.com/a/36489724/6591124)
class_dfs.extend([repeat_df, random_df])
else:
class_dfs.append(random_df)
# downsample to `n_downsample` but not necessarily down to n
elif n_class_samples > n_downsample:
if not downsample:
# we don't want to downsample, so just keep all of samples
class_dfs.append(sub_df)
else:
# take a random sample
class_dfs.append(
sub_df.sample(
n=n_downsample, replace=with_replace, random_state=random_state
)
)
else: # n samples is between n_downsample and n_upsample
class_dfs.append(sub_df)
return pd.concat(class_dfs).set_index(index_cols) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
module:ml
Machine Learning with PyTorch
performance
speed something up
resolved_in_develop
The issue has been resolved in the develop branch but not in master
requiring binary-encoded ("one hot") labels can mean having a very large, sparse array that might even be too big for memory in cases with many samples and species. If the labels could be provided as categorial eg ['a'] instead of [0,1,0,0,0], it would be a much smaller object to hold in memory. Conversion to binary encoding could happen within a batch.
The text was updated successfully, but these errors were encountered: