-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to generate prediction files with probabilities instead of chosen label for cost sensitive multi class regression #1082
Comments
Have you tried |
I am doing cost sensitive classification for multi class sentence classification using logistic regression, and find there is very low recall (as with normal logistic regression). However if I threshold the probabilities (so anything below 5% probability is an unknown class and so always is wrong), I can use the threshold as a hyperparameter and up my recall to an acceptable rate. I did this with the same problem in scikit learn using the |
Also, small point but please could you put |
@martinpopel what would my updated original command be with |
OK, I did now: https://github.com/JohnLangford/vowpal_wabbit/wiki/Predicting-probabilities |
train.dat:
label3 is the correct one as it has cost=0, all other labels have cost=1. For brevity, I used just 4 labels, but you should include all 16. Each example must end with an empty line. test.dat:
The format is same as the train data, just in real prediction you omit the costs (which are unknown). Commands: When using Edit: in newer VW versions, |
@martinpopel thanks for writing up in that Wiki, I will do that for next time! When predicting using your method I get:
Thus my prediction file is just blank. |
FYI for Martin: I changed the prediction type for probabilities from float* -John On Fri, Aug 19, 2016 at 9:08 AM, Dhruv Ghulati notifications@github.com
|
The test command should be just
(i.e. without the I am sorry for the confusion. I was using an older version of VW. Now in the newest version, the option |
Seeing how many users get hit by the "cannot be specified more than once" error. Some options:
|
I agree. I think we need 3 kinds of arguments: static + saved (option cannot -John On Fri, Aug 19, 2016 at 5:06 PM, Ariel Faigon notifications@github.com
|
Let's move the discussion about "cannot be specified more than once" to a separate issue - #1084. This issue is about predicting probabilities for When implementing @dhruvghulati mentioned a use case with 5% threshold, but I think this could be done with |
Hi @martinpopel just to understand, do you mean that
In the same data format I had for |
@dhruvghulati the command-lines you keep posting (since the start of this thread) give errors. e.g:
Please post reproducible issues with the exact command lines and data sets that you actually run. It is hard to help otherwise. The number of classes There's a reference that can help you here:
(I post it this way because GH markdown URL-encodes tildes and firefox + that umd.edu server give a broken link) |
@dhruvghulati If you have costs such as 0.211 and 0.34 then I'd say you want to predict these costs (and not probabilities - I am not sure how a probability would be defined in such case). The purpose of csoaa is that it outputs the label with the lowest cost. If you don't need this and want to report the cost for each label, you can treat it as standard regression, that is
etc. and then don't use any |
@martinpopel you've hit on what I want - I want a cost-sensitive model to output the label it thinks it is, taking into account the costs I provided as inputs, and output some sort of score or probability it assigned to that label prediction. I can then use those scores/probability and slightly change the label prediction. If it is the case that the model only outputs scores/probability for the label with the highest/score or probability, then fine (i.e. not outputting scores/probability for all the potential labels). So lets say for test row 1, the cost sensitive prediction predicted label "14" with probability/score of 0.03, but in test row 2, it predicted "8" with score of 0.98. I can set a threshold of 0.5 and then say that the row 1 prediction is actually "Null", but the second row prediction is valid. This basically helps me affect the recall/precision of my overall classification model. @arielf the command line stuff I put at the beginning of this thread does contain the number of potential labels (16) in the arguments. As for other examples, I merely followed what @martinpopel suggested for his adjusted commands with If I have the above points clarified this issue should be closed (sorry for taking all your time). |
Thanks |
|
@dhruvghulati you can try |
@arielf sorry I think miscommunication here. I know the right command for the first thing in this thread (its all good and working).
The reason for putting in those additional lines was showing what I tried to get these raw predictions. @martinpopel yes, that makes sense and this is probably what I was looking for all along. I tried this for this input data (first 2 lines of training set only):
And this test format (example first 2 lines):
And these were my command line args:
And unfortunately the raw predictions file was (all lines are like this, these are 1st 2 lines):
i.e. Blank for the prediction for each row, instead of maybe being something like (lets say I had 4 classes overall not 16):
What am I doing wrong? The data format is the usual I am trying to get the exact functionality of |
Now I see,
|
Hi @martinpopel , the first For the second method with
And so on very every instance. For a check, I did the normal For the regression method, given my code this would be very hard to refactor - I would have to collapse every row back together to give the highest probability prediction for a given set of test features. I should have changed my use case before, but I actually don't mind if the probability prediction just outputs the predicted label with its probability/score:
I've attached all my data so you can see for yourself - why is everything predicted as 11? |
I have a similar question. I'm using the Java wrapper, and I'm creating a test-only learner using a model like this: Are there any plans to implement |
Why? You don't need to repeat the features for each label. Just put the into a |
The original question seems to really be "How do you do cost sensitive classification and get all the scores returned"? That's not currently supported, although it's not hard to imagine doing so. This is effectively supported now in the oaa code itself (as of yesterday). If anyone wants to tweak to support this, go ahead. |
I am using: https://github.com/JohnLangford/vowpal_wabbit/wiki/Cost-Sensitive-One-Against-All-%28csoaa%29-multi-class-example
With training of (example line):
And test of (example line):
Currently using:
But this is outputting a single prediction of a label from 1 to 16 for each test instance, instead of something like (test instance 1):
Which is the softmax probabilities of each label for that instance (apologies if doesn't sum up to 1).
The text was updated successfully, but these errors were encountered: