Feature/setfithead multi target #272

Yongtae723 · 2023-01-14T04:58:36Z

I tried to solve conflict error and fix some error

please check this PR is what you intended.

also, some I failed some test...and I cannot figure out the reason...

tests/test_trainer.py

See my review comments in huggingface#272 for details.

tomaarsen

Thanks for these changes! I think we're getting really close. I made some comments based on some recent PRs that got merged since #212 was made. In particular, removing support for numpy arrays in the differentiable head, and removing out_features=1: 2 is now the minimum.

I've made these changes and pushed them to this PR. Make sure to git pull them if you want to make more changes of your own. In short, all of the comments that I made with this review are now resolved (but you can still look at them if you want details on why I made some changes in 36f65bb).

As for your comments regarding the trainer.freeze(), I'm not sure what caused the issue, but it seems to be gone after I made my changes.

src/setfit/modeling.py

tomaarsen · 2023-01-14T19:07:49Z

I ran some experiments using the multiclass classification for the different heads.

Dataset

Dataset generation script

from setfit import SetFitModel, SetFitTrainer, sample_dataset
from datasets import load_dataset

dataset = load_dataset("SetFit/hate_speech_offensive")


def to_multiclass(sample):
    """
    from
        (0: 'hate-speech', 1: 'offensive-language' or 2: 'neither')
    to
        ([1, 0]: 'hate-speech', [1, 1]: 'offensive-language' or [0, 0]: 'neither')
    """
    label = sample["label"]
    sample["label"] = [1 if label == 0 else 0, 1 if label == 1 else 0]
    return sample


# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8).map(to_multiclass)
eval_dataset = dataset["test"].map(to_multiclass)

I want to point out that this isn't a very natural use of a multiclass dataset. That said, I couldn't find an actual multiclass dataset on the Hub.

Training Scripts

Logistic Regression head

trainer = SetFitTrainer(
    model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()
metrics = trainer.evaluate()

Differentiable head

trainer = SetFitTrainer(
    model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train and evaluate
trainer.freeze() # Freeze the head
trainer.train() # Train only the body

# Unfreeze the head and freeze the body -> head-only training
trainer.unfreeze(keep_body_frozen=True)

trainer.train(
    num_epochs=25, # The number of epochs to train the head or the whole model (body and head)
    batch_size=16,
    body_learning_rate=1e-5, # The body's learning rate
    learning_rate=1e-2, # The head's learning rate
    l2_weight=0.0, # Weight decay on **both** the body and head. If `None`, will use 0.01.
)

And the model for testing was "sentence-transformers/paraphrase-mpnet-base-v2".

Results

Model	Evaluation Accuracy
Multilabel Logistic Regression ("multi-output")	0.5550 (0.0000)
Multilabel Logistic Regression ("one-vs-rest")	0.5550 (0.0000)
Multilabel Logistic Regression ("classifier-chain")	0.5580 (0.0000)
Multilabel Differentiable Head	0.5849 (0.0044)

Notes:

Evaluation accuracy is displayed as mean and standard deviation between 5 executions.
The same seed is used in SetFitTrainer and sample_dataset between all executions, so all executions work on the same training data.
The Logistic Regression performance was identical between different executions (perhaps due to the same seed between executions)
For the differentiable head, there is no difference between the "multi-output" and "one-vs-rest" strategies, and "classifier-chain" is unsupported. So, I ran experiments only with "multi-output".

To me, this is indicative that the multilabel differentiable head performs equivalently to the logistic regression head. With other words, this PR seems to be successful in adding support to SetFitHead for with multi-label classification! 🎉

Tom Aarsen

Yongtae723 · 2023-01-15T02:23:31Z

I always feel thank you for your thoughtful comments and edits. (Also comment of previous Issues)
not only that, you did an experiment! for us! thank you!!!!

I will check your editing!
thank you!

Yongtae723 · 2023-01-15T02:27:22Z

@tomaarsen
I also think the readme should be edited.

I think it would be better to submit the edit of the readme in a separate PR.
how do you think?

Yongtae723 · 2023-01-15T02:45:13Z

I confirmed your change! thank you!

Also, I did a similar multi-label experiment and got a similar result.

So I think you can merge it into the main!

thank you

tomaarsen · 2023-01-15T11:40:35Z

@tomaarsen I also think the readme should be edited.

I think it would be better to submit the edit of the readme in a separate PR. how do you think?

If the changes that you have planned to the README related to the changes from this PR, then I think the changes should be included in this PR. That way, the code and README get updated at the same time.

I'm glad to hear that your experiments work too!

Yongtae723 · 2023-01-16T00:47:01Z

I got it!

I edited the readme to reflect our changes.
Since I am not a native English speaker, my written English might be strange.
So please rewrite the readme if my English is not correct.

tomaarsen

I'm satisfied with almost everything in this PR, with one exception. I'm not sure what the best course of action is, nor what the "normal" approach for this is. Perhaps to provide the SetFitDataset with a label_postprocessing function that either converts them to floats or to longs, depending on what is needed?

src/setfit/modeling.py

Yongtae723 · 2023-01-19T02:20:12Z

I understand your concern and agree with you.

I fix some code for what you want to do.

Yongtae723 · 2023-01-19T04:02:24Z

src/setfit/data.py

@@ -277,6 +277,7 @@ def collate_fn(batch):

        # convert to tensors
        features = {k: torch.Tensor(v).int() for k, v in features.items()}
-        labels = torch.Tensor(labels).long()
+        labels = torch.Tensor(labels)
+        labels = labels.long() if isinstance(label, int) else labels.float()


my suggestion is to use a type of label

The type of label should be 'int' for the single label classification, but 'List' for multilabel classification.

we can write

label = torch.Tensor(labels).long() if isinstance(label, int) else torch.Tensor(labels).float()

but I felt it is too long so that I fix as I push.

Or

labels = labels.long() if len(labels.size())== 1 else labels.float()

whichever you want!

I think the second solution is best!

labels = labels.long() if len(labels.size()) == 1 else labels.float()

That should accurately measure whether we are in a multitarget situation, even if the user accidentally supplies floats instead of integers.

Oh, I see.

your suggestion makes sense to me!
I will fix that!

tomaarsen

Looks good to me now! Thanks for making all of these changes! 🎉

OskarLiew and others added 9 commits November 29, 2022 13:56

Add multi-target support to SetFitHead

929c754

Improved type hints and setfit head predict_proba

7ce0e49

Tests for multi-target setfithead

37c054d

Remove debug print

43423f4

Merge branch 'fix/conflict' into feature/setfithead-multi-target

b134683

fix/typo

6a66b8e

fix_bug

09faa8f

update loss

5d4bb9f

update text

78f0670

Yongtae723 commented Jan 14, 2023

View reviewed changes

tests/test_trainer.py Outdated Show resolved Hide resolved

tests/test_trainer.py Show resolved Hide resolved

Yongtae and others added 2 commits January 14, 2023 05:01

reformat

74d4363

Applied eps, remove np.array support & remove out_features=1

36f65bb

See my review comments in huggingface#272 for details.

tomaarsen reviewed Jan 14, 2023

View reviewed changes

tomaarsen added the enhancement New feature or request label Jan 14, 2023

edit readme

1ceaa28

tomaarsen linked an issue Jan 18, 2023 that may be closed by this pull request

I want to change the loss during multi label classification. #267

Closed

tomaarsen added 2 commits January 18, 2023 17:09

Make small spelling changes to multilabel section

3544347

Fix incorrect docstring for SetFitHead

94fd10e

tomaarsen requested changes Jan 18, 2023

View reviewed changes

src/setfit/modeling.py Outdated Show resolved Hide resolved

type transfer in dataset

43de430

Yongtae723 commented Jan 19, 2023

View reviewed changes

Yongtae added 2 commits January 19, 2023 09:25

decision based on the size of the tensor.

9ef12a4

reformat

bc9c070

tomaarsen approved these changes Jan 19, 2023

View reviewed changes

tomaarsen merged commit 07de60d into huggingface:main Jan 19, 2023

This was referenced Jan 19, 2023

Add multi-target support to SetFitHead #212

Closed

SetFitHead incompatible with multi-label classification #211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/setfithead multi target #272

Feature/setfithead multi target #272

Yongtae723 commented Jan 14, 2023

tomaarsen left a comment •

edited

Loading

tomaarsen commented Jan 14, 2023

Yongtae723 commented Jan 15, 2023 •

edited

Loading

Yongtae723 commented Jan 15, 2023

Yongtae723 commented Jan 15, 2023 •

edited

Loading

tomaarsen commented Jan 15, 2023 •

edited

Loading

Yongtae723 commented Jan 16, 2023

tomaarsen left a comment

Yongtae723 commented Jan 19, 2023 •

edited

Loading

Yongtae723 Jan 19, 2023

Yongtae723 Jan 19, 2023 •

edited

Loading

tomaarsen Jan 19, 2023

Yongtae723 Jan 19, 2023

tomaarsen left a comment

Feature/setfithead multi target #272

Feature/setfithead multi target #272

Conversation

Yongtae723 commented Jan 14, 2023

tomaarsen left a comment • edited Loading

Choose a reason for hiding this comment

tomaarsen commented Jan 14, 2023

Dataset

Training Scripts

Results

Yongtae723 commented Jan 15, 2023 • edited Loading

Yongtae723 commented Jan 15, 2023

Yongtae723 commented Jan 15, 2023 • edited Loading

tomaarsen commented Jan 15, 2023 • edited Loading

Yongtae723 commented Jan 16, 2023

tomaarsen left a comment

Choose a reason for hiding this comment

Yongtae723 commented Jan 19, 2023 • edited Loading

Yongtae723 Jan 19, 2023

Choose a reason for hiding this comment

Yongtae723 Jan 19, 2023 • edited Loading

Choose a reason for hiding this comment

tomaarsen Jan 19, 2023

Choose a reason for hiding this comment

Yongtae723 Jan 19, 2023

Choose a reason for hiding this comment

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen left a comment •

edited

Loading

Yongtae723 commented Jan 15, 2023 •

edited

Loading

Yongtae723 commented Jan 15, 2023 •

edited

Loading

tomaarsen commented Jan 15, 2023 •

edited

Loading

Yongtae723 commented Jan 19, 2023 •

edited

Loading

Yongtae723 Jan 19, 2023 •

edited

Loading