Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping colons for Solr - fields erroneously detected #84

Open
AlexAddisonLN opened this issue Apr 11, 2019 · 7 comments
Open

Escaping colons for Solr - fields erroneously detected #84

AlexAddisonLN opened this issue Apr 11, 2019 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@AlexAddisonLN
Copy link

We keep running into failures where Solr thinks we're specifying a field. While we could escape our data to avoid this, it's not trivial and the likely case is that this is happening accidentally for other users as well.

Would it be desirable to escape colons for Solr by default?
Would it be more desirable to add a flag to specify that colons should be escaped for Solr?

Happy to put the work in, but not sure what would be the most desirable behaviour.

@agazzarini
Copy link
Member

Hi @AlexAddisonLN thanks for entering this. I'm not sure I got your case. Could you please expand a bit?

@agazzarini agazzarini self-assigned this Apr 11, 2019
@agazzarini agazzarini added the bug Something isn't working label Apr 11, 2019
@AlexAddisonLN
Copy link
Author

Where a query has a colon in it, Solr is detecting it as a field and returning an error. The ratings file entry would look like this:

{
    "name": "delivered_document_search",
    "description": "Wonderings: a title",
    "queries": [
    {
        "template": "query_template.json",
        "placeholders": {
        "$query": "Wonderings: a title"
        }
    }
    ],
    "relevant_documents": {
    "3": [
        "id1",
        "id2",
        "id3",
        "id4"
    ]
    }
},

and the error looks like this:

[ERROR] Caught Solr exception :: Error from server at http://solr-location:8983/solr: undefined field Wonderings

In our case searches with colons are most likely because that's a document title in our data. It's not clear to me that we should necessarily remove punctuation altogether, as we may at some point have to deal with citations, which can have a wide range of punctuation in them.

While we can do some work to escape punctuation in our rated queries, I was wondering if it would be appropriate to do this escaping in RRE instead, perhaps controlled by a command-line flag or a line in the pom file.

@agazzarini
Copy link
Member

Great, so I agree with you, we need to escape such content.

Thanks again

@AlexAddisonLN
Copy link
Author

In that case, want me to work on it and submit a pull request when I think it's ready? I've been warned that it may not be possible or practical to fix it. I've also been told the same issue applies for ElasticSearch, although I don't have any personal experience of that.

I'm also concerned that addressing it may be a breaking change, unless we put it behind a flag of some form.

@worleydl
Copy link
Contributor

I'd be in favor of supporting escaping queries in RRE via configuration but default to having this functionality turned off. Lucene supports special characters for specifying various clauses and I think it'll cause headaches for users with such queries if we escape by default.

@AlexAddisonLN
Copy link
Author

I see - escaping every string that is injected into the template is not necessarily correct, as the user has freedom to put any string for any purpose in there. For simple purposes, like ours, escaping it is the right thing to do, but someone out there is guaranteed to have all of their stuff broken by this if it's turned on by default, and to make matters worse, it will be much harder for them to debug, because it's the complex cases that would break.

I can see three ways that we could present this change:

  • enable by a global flag
  • explicit escape flag next to the placeholder (making the JSON more complicated?)
  • some separate mechanism for simple queries, leaving placeholders as they are for the more complex cases

I think my favourite would be a global flag to enable placeholder escaping, either in the pom.xml or the ratings file.

@irwinmx-lng-con
Copy link

Had some more discussion on this and we think a flag for Engine-query mode vs User-query mode would be a good way to define this. The Engine-query mode (default) would not alter the query in any way. The latter User-query mode would escape punctuation. This is pretty much the same as Alex's proposal but makes a formal definition on the use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants