Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Participant %10: Team Ericsson-RISE, Ericsson & RISE #25

Open
chenzimin opened this issue Jun 13, 2018 · 3 comments
Open

Participant %10: Team Ericsson-RISE, Ericsson & RISE #25

chenzimin opened this issue Jun 13, 2018 · 3 comments
Labels
participant Participant of the CodRep-competition

Comments

@chenzimin
Copy link
Collaborator

Created for Jesper and Olof from Ericsson and RISE for discussions. Welcome!

@chenzimin chenzimin added the participant Participant of the CodRep-competition label Jun 13, 2018
@jderehag
Copy link

jderehag commented Oct 8, 2018


Copy+paste from email sent to the organizers
Shared here for the sake of openness:

I am sorry to say that despite our effort in beating string-distance we have failed.

We will therefore not submit any new final version, it would at best perform on par with our previous model but wasting alot more cycles.

Just some background on what we have done for the sake of openness:

We set a challenging condition upon ourselves that whatever model we trained, we wanted it o be a learned model without relying on any parser or AST.

The main model was a character based bidirectional RNN where we encoded each line (replacement line) and candidate line (+-1 line for context).
We then used cosine similarity between these two embeddings to determine if these should be replaced or not.
Unfortunatly it did not perform very well, that simple model had a recall@1 score of about 0.65 on dataset 4 (significantly lower than string distance) and its predictions had a very high correlation with string distance.
I think we are fairly confident in that whatever it learned, it was something very related to string distance.

We also tried to boost that by using an ensemble architecture where we had the replacement line as input (this time as a bag-of-word features) and output would be which model (out of several we tried) would perform best on this particular replacement line.
But since the RNN had such a high similarity with string distance it was not very distinguishable.

That means that we will have to give up at this point.

Thank you for organizing this challenge and I think we have a new respect for ML in learning formal languages.

Thanks again,
Jesper & Olof

@monperrus
Copy link
Collaborator

monperrus commented Oct 8, 2018 via email

@jderehag
Copy link

jderehag commented Oct 9, 2018

I think we are fairly confident in that whatever it learned, it was something very related to string distance.
This is interesting per se, that we can learn meaningful string distance metrics, and may be something equivalent to tf-idf, in a blackbox manner.

Maybe, although we dont really know that, we havent looked into the details that deeply.
Worth mentioning though is that we used a character based model, so it does not necessarily learn words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
participant Participant of the CodRep-competition
Development

No branches or pull requests

3 participants