-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Participant %4: @tdurieux, INRIA #16
Comments
CodRep is like CandyCrush, there are plenty of easy tasks at the beginning but it is super hard at |
Hey @monperrus, The ranking does not contain my last score (see the google group), you put the old results with this commit: 4b3af7a Dataset 1: 0.1143165305556653 |
could you do a pull request?
|
Thanks for the pull request to update your score! |
the readme is now up to date with you top scores! |
@chenzimin @monperrus
|
@tdurieux |
I did a small update on my project, here my results.
@cesarsotovalero is still better on the dataset 4 Execution time: < 1min per dataset Interesting fact: when there are several times the same lines in the file, it is better to select the last one. I have no clue why. WDYT? |
Interesting fact: when there are several times the same lines in the file, it is better to select
the last one.
Indeed fun
I have no clue why. WDYT?
One idea: because code tends to be added to the end of the file, and new code tend to be more buggy
than old code.
|
My final results:
|
not bad!
|
I don't think we can do much better with my approach, I achieved more than 90% of perfect prediction on the 4 benches. |
Hi @tdurieux, Congrats on the smart features and overall technique! After you released your code I went I also did a small grid-search over some hyper-parameters of this model to get a better understanding of what parameters work well. A model with the following hyper-parameters trained for a relatively short time over your features # Hyper-parameters / Model
# Leaves 128
# Trees 2000
# Min. Support 8
# Stop At 30
# Model LambdaMART
root@442d3980f24a:/> python3 /app/src/guesser.py /tmp/scores.txt /data/Dataset5/Tasks | python3 /app/src/evaluate.py -d /data/Dataset5
Total files: 18366
Average line error: 0.07180536507565463 (the lower, the better)
Recall@1: 0.9273657846019819 (the higher, the better) I think this could be pushed even further by offering more features (essentially all the features we |
Hi @jjhenkel, That is really nice that you look at my features (sorry it is a little bit messy) and succeed to create a model from it. It is impressive. Is there a way for you to know which features are more important for the prediction? For extracting the feature, I extract some high-level AST of each line (basically the type of each code element, VARIABLE, CLASS, STRING, COMMENT, NUMBER, TOKEN, KEYWORD), do you think it can be used in a model? |
Hi @tdurieux, The feature importance is a good question! I'm not entirely sure yet---I think RankLib has a set of tools for looking at the importance of individual features (for certain supported models) but I haven't tried doing anything with that yet. The approximate parse you do for each line is a nice asset! They could maybe be used to create some set similarity based features. There are some techniques that could take advantage of that information and also avoid any hand-designed features/filters. I haven't investigated any of that seriously yet but there's been some interesting work on embedding edits that could maybe be applied to this challenge. |
Thanks Jordan for the follow-up, as Thomas said, that's indeed really interesting.
there's been some interesting work on embedding edits
Which papers do you think of?
|
This paper is the one I'm thinking of: https://arxiv.org/abs/1810.13337. The embedded edits could be fed directly to the learning to rank algorithms or to some other neural architecture and, in doing so, any hand-made features could be avoided. No idea if this would work---but, it's an interesting approach I've wanted to try. |
This paper is on my radar! We need an implementation of it for Java patches.
|
@monperrus you should update the ranking with the new best score ;) |
@tdurieux Done! |
Hi all,
I just did a quick naive solution based on string distance:
The text was updated successfully, but these errors were encountered: