Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbalanced dataset splits #125

Merged
merged 178 commits into from
Aug 25, 2021
Merged

Conversation

itrushkin
Copy link
Contributor

@itrushkin itrushkin commented Jul 15, 2021

This PR introduces a data splitting interface with the following implementations:

  • NumPy arrays
    • Equal split
    • Random split
    • LogNormal split
    • Dirichlet split
  • PyTorch datasets
    • Equal split
      • in-memory
      • on-disk
    • Random split
      • in-memory
      • on-disk
    • LogNormal split
      • in-memory
      • on-disk
    • Dirichlet split
      • in-memory
      • on-disk
  • Interactive API

igor-davidyuk and others added 30 commits April 23, 2021 08:39
Pulling openfl develop changes
Rename tensorboard option

Co-authored-by: Ilya Trushkin <76161256+itrushkin@users.noreply.github.com>
If we dont define this callback in notebook it will be taken from
workspace, if exists
Pulling changes from the main repo
It's necessary for logger to parse this tag
Collaborator logs metric task result, but sometimes collaborator
and aggregator have different consoles (in interactive api e.g.)
@itrushkin itrushkin marked this pull request as ready for review August 2, 2021 13:04
Copy link
Contributor

@igor-davidyuk igor-davidyuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the integration to the shard_descriptor part.

enforce_image_hw: str = None) -> None:
"""Initialize KvasirShardDescriptor."""
super().__init__()
class KvasirDataset(Dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need in pytorch dataset here. I suggest removing all pytorch mentions


# Sharding
shard_idx = data_splitter.split(labels, self.world_size)[self.rank]
self.shard = Subset(dataset, shard_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line could be just
self.images_names = [self.images_names[i] for i in shard_idx]

"""
self.shuffle = shuffle

def split(self, labels, num_collaborators):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the signature is changed in this subclass

Copy link
Contributor

@igor-davidyuk igor-davidyuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@itrushkin
Copy link
Contributor Author

Jenkins please retry a build

@alexey-gruzdev alexey-gruzdev merged commit 44f42cc into develop Aug 25, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Aug 25, 2021
@alexey-gruzdev alexey-gruzdev deleted the unbalanced_federated_dataset branch August 26, 2021 21:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants