Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Within Session splitter #664

Open
wants to merge 49 commits into
base: develop
Choose a base branch
from

Conversation

brunaafl
Copy link
Collaborator

This PR is a follow-up on PR #624 and is related to issue #612. It includes just the implementation for the WithinSessionSplitter data splitter.

brunaafl and others added 30 commits June 5, 2024 21:56
Deleting unified_eval, so it can be addressed on another pr. Working on tests.
Signed-off-by: Bruna Junqueira Lopes <99977808+brunaafl@users.noreply.github.com>
# Conflicts:
#	moabb/evaluations/metasplitters.py
#	moabb/tests/metasplits.py
Add shuffle and random_state parameters to WithinSession
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better represent only one session @brunaafl

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can evaluate this splitter now @tomMoral

Copy link
Collaborator

@PierreGtch PierreGtch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @brunaafl,
Thanks for this PR, it looks good!
I left one comment regarding a test I think you should add

@pytest.mark.parametrize("shuffle", [True, False])
@pytest.mark.parametrize("random_state", [0, 42])
def test_within_session(shuffle, random_state):
X, y, metadata = paradigm.get_data(dataset=dataset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to check if the split is the same when we load the data of one/a few subject(s) only, paradigm.get_data(dataset=dataset, subjects=[m, n...])

Copy link
Collaborator

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A frist batch of comments. Overall, it looks good but I think it could be improved by adding the possibility to pass in a cv object, that would allow to control the intrasession splits (for instance doing TimeSeriesSplit, which makes sense in an online setting).

Comment on lines +9 to +10
and test sets on separate session for each subject. This splitter assumes that
all data from all subjects is already known and loaded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and test sets on separate session for each subject. This splitter assumes that
all data from all subjects is already known and loaded.
and test sets for each subject in each session. This splitter
assumes that all data from all subjects is already known and loaded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using only the other version?

Parameters
----------
n_folds : int
Number of folds. Must be at least 2.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Number of folds. Must be at least 2.
Number of folds for the within-session k-fold split. Must be at least 2.

Comment on lines +66 to +68
random_state: int = 42,
shuffle_subjects: bool = False,
shuffle_session: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default random_state should be None. Also, convention would be to put it at the end of the argument list.

Suggested change
random_state: int = 42,
shuffle_subjects: bool = False,
shuffle_session: bool = True,
shuffle_subjects: bool = False,
shuffle_session: bool = True,
random_state: int = None,

Comment on lines +21 to +23
random_state: int, RandomState instance or None, default=None
Important when `shuffle` is True. Controls the randomness of splits.
Pass an int for reproducible output across multiple function calls.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move at the end.

Suggested change
random_state: int, RandomState instance or None, default=None
Important when `shuffle` is True. Controls the randomness of splits.
Pass an int for reproducible output across multiple function calls.
random_state: int, RandomState instance or None, default=None
Controls the randomness of splits. Only used when `shuffle` is True.
Pass an int for reproducible output across multiple function calls.

Comment on lines +24 to +29
shuffle_session : bool, default=True
Whether to shuffle each class's samples before splitting into batches.
Note that the samples within each split will not be shuffled.
shuffle_subjects : bool, default=False
Apply shuffle in mixing subjects and sessions, this parameter allows
sample iterations of the sppliter.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is necessary to have both? I don't really see any use case where I would only use one no?
I would only have a shuffle

self.n_folds = n_folds
self.shuffle_subjects = shuffle_subjects
self.shuffle_session = shuffle_session
self.random_state = check_random_state(random_state)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use it like that, this is not a random_state anymore but a rng.


def split(self, y, metadata, **kwargs):
all_index = metadata.index.values
subjects = metadata.subject.unique()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, avoid using the .col notation in pandas, this is super error prone and hard to read, use the indexation:

Suggested change
subjects = metadata.subject.unique()
subjects = metadata['subject'].unique()

Comment on lines +87 to +109
for subject in subjects:
subject_mask = metadata.subject == subject
subject_indices = all_index[subject_mask]
subject_metadata = metadata[subject_mask]
sessions = subject_metadata.session.unique()

# Shuffle sessions if required
if self.shuffle_session:
self.random_state.shuffle(sessions)

for session in sessions:
session_mask = subject_metadata.session == session
indices = subject_indices[session_mask]
group_y = y[indices]

# Use StratifiedKFold with the group-specific random state
cv = StratifiedKFold(
n_splits=self.n_folds,
shuffle=self.shuffle_session,
random_state=self.random_state,
)
for ix_train, ix_test in cv.split(indices, group_y):
yield indices[ix_train], indices[ix_test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talk a bit with @sylvchev and I think the best would be to modify this, to take a cv object in the constructor (default would be StratifiedKFold), clone it with a different random seed for each group subject, session, and then yield the right indices.

That way, we can do a real shuffle, with shuffling the groups from which we retrieve the next split.
Would this make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I'm working on that, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! Thank you so much! And thanks for your patience :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants