Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Participant %4: @tdurieux, INRIA #16

Closed
tdurieux opened this issue Apr 19, 2018 · 20 comments
Closed

Participant %4: @tdurieux, INRIA #16

tdurieux opened this issue Apr 19, 2018 · 20 comments
Labels
participant Participant of the CodRep-competition

Comments

@tdurieux
Copy link
Contributor

Hi all,

I just did a quick naive solution based on string distance:

Dataset Perfect Match In Top 10 Recall Loss
Bench 1 3791 4322 98% 0.86 0.13615878141899027 
Bench 2 9910 10805 97% 0.89 0.10263617900182995 
@chenzimin chenzimin changed the title Result Naive Salution Participant %4: Thomas, INRIA Apr 19, 2018
@chenzimin chenzimin added the participant Participant of the CodRep-competition label Apr 19, 2018
@monperrus
Copy link
Collaborator

CodRep is like CandyCrush, there are plenty of easy tasks at the beginning but it is super hard at
the end :-)

@monperrus monperrus changed the title Participant %4: Thomas, INRIA Participant %4: @tdurieux, INRIA May 18, 2018
@tdurieux
Copy link
Contributor Author

tdurieux commented May 18, 2018

Hey @monperrus,

The ranking does not contain my last score (see the google group), you put the old results with this commit: 4b3af7a

Dataset 1: 0.1143165305556653
Dataset 2: 0.08536293667171425

@monperrus
Copy link
Collaborator

monperrus commented May 18, 2018 via email

@tdurieux tdurieux reopened this May 21, 2018
@monperrus
Copy link
Collaborator

Thanks for the pull request to update your score!

@monperrus
Copy link
Collaborator

the readme is now up to date with you top scores!

@tdurieux
Copy link
Contributor Author

tdurieux commented May 25, 2018

@chenzimin @monperrus
my result for the new bench:

Dataset Perfect Match In Top 10 Loss
Dataset 3 17410/18633 93% 10 18438 98% 0.06407823026301902

@chenzimin
Copy link
Collaborator

@tdurieux
Good job, I have updated the ranking accordingly.

@tdurieux
Copy link
Contributor Author

tdurieux commented Aug 22, 2018

I did a small update on my project, here my results.
(I have to keep the best score, haha 😄)

Dataset Perfect Match In Top 10 Loss
Dataset 1 3934 4334 98% 0.10400918447628195 
Dataset 2 10123 10850 98% 0.08405196035050681 
Dataset 3 17520 18454 99% 0.05824092991175515 
Dataset 4 15786 16936 98% 0.07700153400317002 

@cesarsotovalero is still better on the dataset 4

Execution time: < 1min per dataset

Interesting fact: when there are several times the same lines in the file, it is better to select the last one. I have no clue why. WDYT?

@monperrus
Copy link
Collaborator

monperrus commented Aug 22, 2018 via email

@tdurieux
Copy link
Contributor Author

tdurieux commented Oct 1, 2018

My final results:

Dataset Perfect Match In Top 10 Loss
Dataset 1 3967 4346 98% 0.09654287550022056 
Dataset 2 11069 10908 98% 0.07930336605281313 
Dataset 3 17753 18498 99% 0.04631417984476575 
Dataset 4 16074 17031 99% 0.06053843584998035 

@monperrus
Copy link
Collaborator

monperrus commented Oct 1, 2018 via email

@tdurieux
Copy link
Contributor Author

tdurieux commented Oct 1, 2018

I don't think we can do much better with my approach, I achieved more than 90% of perfect prediction on the 4 benches.
I think it is better possible to do better with ML with the features I used.

@jjhenkel
Copy link

Hi @tdurieux,

Congrats on the smart features and overall technique! After you released your code I went
ahead and made some small modifications so that I could test the performance of a hybrid model
that uses your features (plus some more simple text-distance features) and our learning-to-rank
approach.

I also did a small grid-search over some hyper-parameters of this model to get a better understanding of what parameters work well.

A model with the following hyper-parameters trained for a relatively short time over your features
can produce an incremental improvement on the 5th dataset:

# Hyper-parameters / Model
# Leaves       128
# Trees        2000
# Min. Support 8
# Stop At      30 
# Model        LambdaMART

root@442d3980f24a:/> python3 /app/src/guesser.py /tmp/scores.txt /data/Dataset5/Tasks | python3 /app/src/evaluate.py -d /data/Dataset5
Total files: 18366
Average line error: 0.07180536507565463 (the lower, the better)
Recall@1: 0.9273657846019819 (the higher, the better)

I think this could be pushed even further by offering more features (essentially all the features we
both used) and perhaps increasing the capacity of the model further. (From my grid-search it seems that,
in general, incremental increases in the number of leaves allowed in the tree improve performance on
the validation set. To avoid overfitting I've switched from training/validating on datasets 1,2,3, and 4 to training just on datasets 1,2,3 and validating on the 4th set.)

@tdurieux
Copy link
Contributor Author

Hi @jjhenkel,

That is really nice that you look at my features (sorry it is a little bit messy) and succeed to create a model from it. It is impressive.

Is there a way for you to know which features are more important for the prediction?

For extracting the feature, I extract some high-level AST of each line (basically the type of each code element, VARIABLE, CLASS, STRING, COMMENT, NUMBER, TOKEN, KEYWORD), do you think it can be used in a model?

@jjhenkel
Copy link

Hi @tdurieux,

The feature importance is a good question! I'm not entirely sure yet---I think RankLib has a set of tools for looking at the importance of individual features (for certain supported models) but I haven't tried doing anything with that yet.

The approximate parse you do for each line is a nice asset! They could maybe be used to create some set similarity based features. There are some techniques that could take advantage of that information and also avoid any hand-designed features/filters. I haven't investigated any of that seriously yet but there's been some interesting work on embedding edits that could maybe be applied to this challenge.

@monperrus
Copy link
Collaborator

monperrus commented Nov 13, 2018 via email

@jjhenkel
Copy link

This paper is the one I'm thinking of: https://arxiv.org/abs/1810.13337. The embedded edits could be fed directly to the learning to rank algorithms or to some other neural architecture and, in doing so, any hand-made features could be avoided. No idea if this would work---but, it's an interesting approach I've wanted to try.

@monperrus
Copy link
Collaborator

monperrus commented Nov 13, 2018 via email

@tdurieux
Copy link
Contributor Author

@monperrus you should update the ranking with the new best score ;)

@chenzimin
Copy link
Collaborator

@tdurieux Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
participant Participant of the CodRep-competition
Development

No branches or pull requests

4 participants