Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate prediction files with probabilities instead of chosen label for cost sensitive multi class regression #1082

Closed
dhruvghulati-zz opened this issue Aug 18, 2016 · 25 comments

Comments

@dhruvghulati-zz
Copy link

dhruvghulati-zz commented Aug 18, 2016

I am using: https://github.com/JohnLangford/vowpal_wabbit/wiki/Cost-Sensitive-One-Against-All-%28csoaa%29-multi-class-example

With training of (example line):

1:1 2:1 3:0 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 0| cat mat sat

And test of (example line):

0| the mat is red

Currently using:

vw --csoaa --loss_function=logistic 16 open_cost_1.dat -f csoaa.model
vw -t -i --link=logistic csoaa.model test.dat -p open_cost_1.predict

But this is outputting a single prediction of a label from 1 to 16 for each test instance, instead of something like (test instance 1):

1:0.53 2:0.21 3:0.121 4:0.98 5:0.12 6:0.78 7:0.11 8:0.89 9:0.34 10:0.03 11:0.09 12:0.08 13:0.89 14:0.13 15:0.034 16:0.078 0

Which is the softmax probabilities of each label for that instance (apologies if doesn't sum up to 1).

@martinpopel
Copy link
Contributor

martinpopel commented Aug 19, 2016

Have you tried --probabilities? They work with --csoaa_ldf=mc, see test 110. It could be added for --csoaa as well, if someone comes with a usecase where it is needed.

@dhruvghulati-zz
Copy link
Author

I am doing cost sensitive classification for multi class sentence classification using logistic regression, and find there is very low recall (as with normal logistic regression). However if I threshold the probabilities (so anything below 5% probability is an unknown class and so always is wrong), I can use the threshold as a hyperparameter and up my recall to an acceptable rate. I did this with the same problem in scikit learn using the LogisticRegression.predict_proba() method and it works well.

@dhruvghulati-zz
Copy link
Author

dhruvghulati-zz commented Aug 19, 2016

Also, small point but please could you put --probabilities in the documentation? If you google "Vowpal Probabilities instead of Predictions", this --link=logistic parameter appears, and never clearly where in the command line it should appear in a command :)

@dhruvghulati-zz
Copy link
Author

@martinpopel what would my updated original command be with --csoaa_ldf=mc?

@martinpopel
Copy link
Contributor

could you put --probabilities in the documentation?

OK, I did now: https://github.com/JohnLangford/vowpal_wabbit/wiki/Predicting-probabilities
(It's wiki, so you/anyone can improve it.)

@martinpopel
Copy link
Contributor

martinpopel commented Aug 19, 2016

what would my updated original command be with --csoaa_ldf=mc?

train.dat:

shared | cat mat sat
1:1 | label1
2:1 | label2
3:0 | label3
4:1 | label4

shared | another example
1:1 | label1
2:0 | label2
3:1 | label3
4:1 | label4

label3 is the correct one as it has cost=0, all other labels have cost=1. For brevity, I used just 4 labels, but you should include all 16. Each example must end with an empty line.

test.dat:

shared | the mat is red
1 | label1
2 | label2
3 | label3
4 | label4

The format is same as the train data, just in real prediction you omit the costs (which are unknown).

Commands:
vw --csoaa_ldf=mc --loss_function=logistic -d train.dat -f csoaa_ldf.model --probabilities
vw -t -i csoaa_ldf.model -d test.dat -p probs.predict --probabilities

When using --probabilities during training (or testing with gold costs available), a multi-class logistic loss is reported.

Edit: in newer VW versions, --probabilities should not be used when testing (this option is stored in the model).

@dhruvghulati-zz
Copy link
Author

dhruvghulati-zz commented Aug 19, 2016

@martinpopel thanks for writing up in that Wiki, I will do that for next time! When predicting using your method I get:

final_regressor = data/output/zero/cost_test/csoaa_ldf.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/output/zero/cost_test/closed_cost_1closed_ld.dat
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
1.000000 1.000000            1            1.0    known        1      640
0.500000 0.000000            2            2.0    known        3      144
0.750000 1.000000            4            4.0    known        3      384
0.875000 1.000000            8            8.0    known        7      192
0.937500 1.000000           16           16.0    known        3      672
0.906250 0.875000           32           32.0    known        6      272
0.906250 0.906250           64           64.0    known       11      992
0.890625 0.875000          128          128.0    known       11      416
0.867188 0.843750          256          256.0    known       11      176
0.865234 0.863281          512          512.0    known       11      256
0.858398 0.851562         1024         1024.0    known       11      432
0.849121 0.839844         2048         2048.0    known       11      192
0.840088 0.831055         4096         4096.0    known       11      608
0.841431 0.842773         8192         8192.0    known       11      160

finished run
number of examples per pass = 15000
passes used = 1
weighted example sum = 15000.000000
weighted label sum = 0.000000
average loss = 0.842800
average multiclass log loss = 2.565777
total feature number = 5659248
only testing
predictions = data/output/zero/cost_test/probs.predict

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = nan
total feature number = 0
vw: option '--probabilities' cannot be specified more than once

Thus my prediction file is just blank.

@JohnLangford
Copy link
Member

FYI for Martin: I changed the prediction type for probabilities from float*
to v_array. This makes the array self-delimiting and makes it so we
don't need to alloc/dealloc for each prediction.

-John

On Fri, Aug 19, 2016 at 9:08 AM, Dhruv Ghulati notifications@github.com
wrote:

@martinpopel https://github.com/martinpopel thanks for writing up in
that Wiki, I will do that for next time! When predicting using your method
I get:

final_regressor = data/output/zero/cost_test/csoaa_ldf.model
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = data/output/zero/cost_test/closed_cost_1closed_ld.dat
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 known 1 640
0.500000 0.000000 2 2.0 known 3 144
0.750000 1.000000 4 4.0 known 3 384
0.875000 1.000000 8 8.0 known 7 192
0.937500 1.000000 16 16.0 known 3 672
0.906250 0.875000 32 32.0 known 6 272
0.906250 0.906250 64 64.0 known 11 992
0.890625 0.875000 128 128.0 known 11 416
0.867188 0.843750 256 256.0 known 11 176
0.865234 0.863281 512 512.0 known 11 256
0.858398 0.851562 1024 1024.0 known 11 432
0.849121 0.839844 2048 2048.0 known 11 192
0.840088 0.831055 4096 4096.0 known 11 608
0.841431 0.842773 8192 8192.0 known 11 160

finished run
number of examples per pass = 15000
passes used = 1
weighted example sum = 15000.000000
weighted label sum = 0.000000
average loss = 0.842800
average multiclass log loss = 2.565777
total feature number = 5659248
only testing
predictions = data/output/zero/cost_test/probs.predict

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = nan
total feature number = 0
vw: option '--probabilities' cannot be specified more than once


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1082 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAE25uE4g1tsCGIA_0mYYQmzWtgT7hC1ks5qhaq2gaJpZM4Jn7-S
.

@martinpopel
Copy link
Contributor

vw: option '--probabilities' cannot be specified more than once

The test command should be just

vw -t -i csoaa_ldf.model -d test.dat -p probs.predict

(i.e. without the --probabilities).

I am sorry for the confusion. I was using an older version of VW. Now in the newest version, the option --probabilities is stored in the model, so it does not need be repeated when testing (actually it cannot be repeated, otherwise there is the "option '--probabilities' cannot be specified more than once" error as you have seen). I've updated the wiki page accordingly.

@arielf
Copy link
Collaborator

arielf commented Aug 19, 2016

Seeing how many users get hit by the "cannot be specified more than once" error.
I'm thinking maybe we should rethink it.

Some options:

  • Last one wins (command line overrides model) with possible undesired consequences
  • Check and if the two 'specified' values for the option are the same, don't abort
  • A better idea?

@JohnLangford
Copy link
Member

I agree.

I think we need 3 kinds of arguments: static + saved (option cannot
change), mutable + savedg ( option can change), and not saved.

-John

On Fri, Aug 19, 2016 at 5:06 PM, Ariel Faigon notifications@github.com
wrote:

Seeing how many users get hit by the "cannot be specified more than once"
error.
I'm thinking maybe we should rethink it.

Some options:

  • Last one wins (command line overrides model) with possible undesired
    consequences
  • Check and if the two 'specified' values for the option are the same,
    don't abort
  • A better idea?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1082 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAE25n2ir1DGIk4bWIre7jTC56_OFbddks5qhhq4gaJpZM4Jn7-S
.

@martinpopel
Copy link
Contributor

Let's move the discussion about "cannot be specified more than once" to a separate issue - #1084.

This issue is about predicting probabilities for --csoaa, which is not supported at the moment (a workaround exists with --csoaa_ldf=mc --probabilities, but it requires changing the input data format).

When implementing --probabilities, originally I wanted to make it work for --csoaa as well. However, I was lazy (I needed csoaa_ldf=mc only for myself) and I knew it makes sense only if the cost-sensitivity is not actually used (i.e. if only one label is always the "correct" one with cost=0 and the rest has cost=1).

@dhruvghulati mentioned a use case with 5% threshold, but I think this could be done with --oaa --probabilities.
If this is true and if there is no use case for --csoaa --probabilities, I think we can close this issue.

@dhruvghulati-zz
Copy link
Author

dhruvghulati-zz commented Aug 20, 2016

Hi @martinpopel just to understand, do you mean that csoaa_ldf is not actually taking into the costs? The example in this issue is just a naive baseline I use, but I actually do need the costs to be taken into account e.g. I have versions like 1:0.211 2:0.34 3:0.056 4:0.03. In that case, if I do:

vw --oaa --loss_function=logistic -d train.dat -f oaa.model --probabilities
vw -t -i oaa.model -d test.dat -p probs.predict

In the same data format I had for csoaa will I achieve what I want? Or is the csoaa_ldf=mc what I need.

@arielf
Copy link
Collaborator

arielf commented Aug 20, 2016

@dhruvghulati the command-lines you keep posting (since the start of this thread) give errors. e.g:

vw --oaa --loss_function=logistic -d train.dat -f oaa.model --probabilities
...
Error: the argument ('--loss_function=logistic') for option '--oaa' is invalid

Please post reproducible issues with the exact command lines and data sets that you actually run. It is hard to help otherwise.

The number of classes <k> must always follow the multiclass option --oaa or --cs_oaa, --csoaa_ldf etc. You may use vw --help for options usage.

There's a reference that can help you here:

https://www.umiacs.umd.edu/~hal/tmp/multiclassVW.html

(I post it this way because GH markdown URL-encodes tildes and firefox + that umd.edu server give a broken link)

@martinpopel
Copy link
Contributor

@dhruvghulati If you have costs such as 0.211 and 0.34 then I'd say you want to predict these costs (and not probabilities - I am not sure how a probability would be defined in such case). The purpose of csoaa is that it outputs the label with the lowest cost. If you don't need this and want to report the cost for each label, you can treat it as standard regression, that is

0.211 | cat mat sat label1
0.340 | cat mat sat label2

etc.

and then don't use any csoaa and use squared loss (which is the default) instead of the logistic loss.

@dhruvghulati-zz
Copy link
Author

@martinpopel you've hit on what I want - I want a cost-sensitive model to output the label it thinks it is, taking into account the costs I provided as inputs, and output some sort of score or probability it assigned to that label prediction. I can then use those scores/probability and slightly change the label prediction. If it is the case that the model only outputs scores/probability for the label with the highest/score or probability, then fine (i.e. not outputting scores/probability for all the potential labels).

So lets say for test row 1, the cost sensitive prediction predicted label "14" with probability/score of 0.03, but in test row 2, it predicted "8" with score of 0.98. I can set a threshold of 0.5 and then say that the row 1 prediction is actually "Null", but the second row prediction is valid. This basically helps me affect the recall/precision of my overall classification model.

@arielf the command line stuff I put at the beginning of this thread does contain the number of potential labels (16) in the arguments. As for other examples, I merely followed what @martinpopel suggested for his adjusted commands with --csoaa_ldf=mc which didn't contain 16 or a number of potential labels. Are you saying what Martin suggested is wrong?

If I have the above points clarified this issue should be closed (sorry for taking all your time).

@arielf
Copy link
Collaborator

arielf commented Aug 21, 2016

@dhruvghulati

  1. Trying to reproduce your example, verbatim (cut and pasted from your original issue):
$ cat open_cost_1.dat
1:1 2:1 3:0 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 0| cat mat sat

$ vw --csoaa --loss_function=logistic 16 open_cost_1.dat -f csoaa.model
final_regressor = csoaa.model
Error: the argument ('--loss_function=logistic') for option '--csoaa' is invalid

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
total feature number = 0
vw: the argument ('--loss_function=logistic') for option '--csoaa' is invalid
  1. For why is it failing please reread my previous comment

  2. Have you tried to read the reference I gave above? I think it can help you understand vw multiclass formats and avoid errors (and questions).

Thanks

@martinpopel
Copy link
Contributor

--csoaa needs the number of classes (e.g. --csoaa 16).
--csoaa_ldf=mc does not need the number of classes, as this can be different for each example and it is inferred from the number of lines of the given example.

@martinpopel
Copy link
Contributor

@dhruvghulati you can try --csoaa 16 -r raw_predictions.txt (I cannot test it now, but it should give scores for each class).

@dhruvghulati-zz
Copy link
Author

dhruvghulati-zz commented Aug 21, 2016

@arielf sorry I think miscommunication here. I know the right command for the first thing in this thread (its all good and working).

vw --csoaa 16 open_cost_1.dat -f csoaa.model
vw -t -i csoaa.model test.dat -p open_cost_1.predict

The reason for putting in those additional lines was showing what I tried to get these raw predictions.

@martinpopel yes, that makes sense and this is probably what I was looking for all along. I tried this for this input data (first 2 lines of training set only):

1:1 2:1 3:0 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 0| cents idc chart include sales 
1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:0 12:1 13:1 14:1 15:1 16:1 1| location_slot turkey france 

And this test format (example first 2 lines):

0| population location_slot estimated 
1| location_slot pump price

And these were my command line args:

vw --csoaa 16 -d closed_cost_1.dat -f csoaa.model
vw -t -i csoaa.model -d test.dat -r closed_cost_1_raw.dat

And unfortunately the raw predictions file was (all lines are like this, these are 1st 2 lines):

 0
 1

i.e. Blank for the prediction for each row, instead of maybe being something like (lets say I had 4 classes overall not 16):

1:0.04 2:0.067 3:0.56 4:0.91 0
1:0.32 2:0.181 3:0.091 4:0.053 1

What am I doing wrong? The data format is the usual csoaa data format.

I am trying to get the exact functionality of predict_proba in http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

@martinpopel
Copy link
Contributor

Now I see, -r raw_predictions.txt does not work with --csoaa. So the options are:

  • Keep the csoaa data format, use --audit when testing and grep the raw scores from stdout (e.g. vw -t -i csoaa.model -d test.dat --audit | grep -P '^\d' > scores.txt).
  • Change the data format to the one expected by csoaa_ldf and use --probabilities as I suggested here.
  • Change the data format to simple regression (repeating the shared features for all classes) as I suggested in here.

@dhruvghulati-zz
Copy link
Author

Hi @martinpopel , the first --audit method did not work. Ended up with a blank file. For sanity I'm just including the full training and test files for latent dirichlet and the normal csoaa format.

For the second method with latent dirichlet, I obtained an output file that was perfect (thank you!!), but for some reason seems to always predict label 11 for every test instance which is the highest probability every time?


0.040561
0.066941
0.063240
0.117838
0.028892
0.109743
0.049672
0.009981
0.116857
0.050539
0.158474
0.030862
0.051209
0.050329
0.046200
0.008664

0.040187
0.066655
0.062926
0.118478
0.028563
0.110168
0.049299
0.009832
0.117469
0.050167
0.160582
0.030521
0.050839
0.049958
0.045823
0.008533

And so on very every instance. For a check, I did the normal csoaa code without probabilities and have attached my prediction (which is not all 11s unlike the ld case).

For the regression method, given my code this would be very hard to refactor - I would have to collapse every row back together to give the highest probability prediction for a given set of test features.

I should have changed my use case before, but I actually don't mind if the probability prediction just outputs the predicted label with its probability/score:

3:0.56 testinstance1
1:0.32 testinstance2

I've attached all my data so you can see for yourself - why is everything predicted as 11?
closed_cost_1_words_ld.txt
closed_cost_1_words.txt
probs_csoaa.txt
probs_ld_actual.txt
probs_ld.txt
words_ldf_closed_test.txt
words_test.txt

@MNovak12
Copy link

I have a similar question. I'm using the Java wrapper, and csoaa. I'm using 10-30 word sentences as features, so I don't want to use csoaa_ldf. I need to use the costs for each class instead of just the class with the lowest cost. My problem is that I need to write the raw prediction to a file from your C++ code and then read from that file in my Java code, which slows down my program.

I'm creating a test-only learner using a model like this:
VWLearners.create("-i src/test/resources/multiclass.model -t -r /<filename>")

Are there any plans to implement --probabilities for csoaa, or some way for Vowpal to return the raw scores instead of class with the lowest cost?

@martinpopel
Copy link
Contributor

I'm using 10-30 word sentences as features, so I don't want to use csoaa_ldf

Why? You don't need to repeat the features for each label. Just put the into a shared name space (see my examples above).

@JohnLangford
Copy link
Member

The original question seems to really be "How do you do cost sensitive classification and get all the scores returned"? That's not currently supported, although it's not hard to imagine doing so. This is effectively supported now in the oaa code itself (as of yesterday). If anyone wants to tweak to support this, go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants