Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time value changes #861

Open
kwalcock opened this issue May 18, 2020 · 10 comments
Open

Time value changes #861

kwalcock opened this issue May 18, 2020 · 10 comments
Assignees

Comments

@kwalcock
Copy link
Member

I read the same set of documents on Thursday (May 14) and again on Monday (May 18). The times have changed in the output. For the sentence February 21, 2015 (ADDIS ABABA) - South Sudan peace talks aimed at ending the more than 14-month-long conflict in the young East African nation have been postponed until Monday. the first reading is

        "@type" : "Word",
        "@id" : "_:Word_36",
        "text" : "Monday",
        "tag" : "NNP",
        "entity" : "DATE",
        "startOffset" : 191,
        "endOffset" : 197,
        "lemma" : "Monday",
        "chunk" : "B-NP",
        "norm" : "2015-02-23"
      }, {

and the second reading is

        "@type" : "Word",
        "@id" : "_:Word_36",
        "text" : "Monday",
        "tag" : "NNP",
        "entity" : "DATE",
        "startOffset" : 191,
        "endOffset" : 197,
        "lemma" : "Monday",
        "chunk" : "B-NP",
        "norm" : "2015-02-16"
      }, {

The document does have a DCT:

    "dct" : {
      "@type" : "DCT",
      "@id" : "_:DCT_1",
      "text" : "2015-02-22",
      "start" : "2015-02-22T00:00",
      "end" : "2015-02-23T00:00"
    },

This is with useNeuralParser = false. I don't think that anything has been changed in the configuration. Any idea what might cause this?

@kwalcock
Copy link
Member Author

In case it helps, here is another example:

        "@type" : "Word",
        "@id" : "_:Word_65",
        "text" : "Tuesday",
        "tag" : "NNP",
        "entity" : "DATE",
        "startOffset" : 359,
        "endOffset" : 366,
        "lemma" : "Tuesday",
        "chunk" : "B-NP",
        "norm" : "2012-09-25"
      }, {

becomes

        "@type" : "Word",
        "@id" : "_:Word_65",
        "text" : "Tuesday",
        "tag" : "NNP",
        "entity" : "DATE",
        "startOffset" : 359,
        "endOffset" : 366,
        "lemma" : "Tuesday",
        "chunk" : "B-NP",
        "norm" : "2012-09-18"
      }, {

despite a DCT of

    "dct" : {
      "@type" : "DCT",
      "@id" : "_:DCT_1",
      "text" : "2012-09-22",
      "start" : "2012-09-22T00:00",
      "end" : "2012-09-23T00:00"
    },

The sentence is The flight, which was funded by a donation from the Netherlands, follows two other Dutch-funded charters on Tuesday 18 and Wednesday 19 last week carrying another 551 migrants.

@kwalcock
Copy link
Member Author

        "@type" : "Word",
        "@id" : "_:Word_68",
        "text" : "Friday",
        "tag" : "NNP",
        "entity" : "DATE",
        "startOffset" : 363,
        "endOffset" : 369,
        "lemma" : "Friday",
        "chunk" : "B-NP",
        "norm" : "2010-07-16"
      }, {

becomes

        "@type" : "Word",
        "@id" : "_:Word_68",
        "text" : "Friday",
        "tag" : "NNP",
        "entity" : "DATE",
        "startOffset" : 363,
        "endOffset" : 369,
        "lemma" : "Friday",
        "chunk" : "B-NP",
        "norm" : "2010-07-23"
      }, {

for text According to a statement by the Administration for Refugees and Returnees Affairs (ARRA) seen by Sudan tribune, some 122 Eritrean refugees were flown to the United States on Friday to lead a new life in there after being exiled for years in different camps in the northern Ethiopia not far from the borders to Eritrea.
given DCT

    "dct" : {
      "@type" : "DCT",
      "@id" : "_:DCT_1",
      "text" : "2010-07-22",
      "start" : "2010-07-22T00:00",
      "end" : "2010-07-23T00:00"
    },

@kwalcock
Copy link
Member Author

So it seems to always have to do with a day of the week and possible confusion about whether the previous value or the next value is chosen. I don't recall checking in any recent code changes in this area, especially not since last Thursday.

@EgoLaparra EgoLaparra transferred this issue from clulab/timenorm May 18, 2020
@EgoLaparra
Copy link
Contributor

@kwalcock, I've moved the issue here since the problem is not caused by the neural parser.
I will do some tests, but, since norm values come from processors, this seems to be caused by SUTime.

@MihaiSurdeanu
Copy link
Contributor

Fwiw, it seems to me that the second reading should be the correct one, since it references a time before publication.

I think this is related to the heuristic in SUTime that resolves days of the week such as "Monday". But I can't see why this would change, if we didn't change CoreNLP versions...
@BeckySharp : do you know?

@kwalcock
Copy link
Member Author

Thanks for moving it to the right place. I'll try to see if it can be reproduced, perhaps on the same day, so that I'm absolutely certain that the code hasn't changed.

@kwalcock
Copy link
Member Author

This phenomenon does appear to be repeatable. I'm trying to isolate the situation.

@kwalcock
Copy link
Member Author

If Eidos reads, serially, the files 1742d787c22e9873c4bf9558e456ddd2, then 73f374515fed56aac5979d847591a7f8, and again 1742d787c22e9873c4bf9558e456ddd2, the two reads of the one file are different. Something must be keeping state around. The last time it happened, something from Stanford was running into an unknown word, noting it, and then not considering it unknown the next time around and working differently. I think that problem would happen when the same file was read twice in a row. That's not the case here.

1742d787c22e9873c4bf9558e456ddd2.json.txt
73f374515fed56aac5979d847591a7f8.json.txt

@kwalcock
Copy link
Member Author

These texts are adequate:

The flight follows two other Dutch-funded charters on Tuesday 18 and Wednesday 19 last week. A third charter left Yemen earlier this month.

and

Libya has collapsed, a UNHCR spokeswoman said on Tuesday.

They do not need to go through Eidos. A pass through Processors is enough. Only these stages are necessary:

  • tagPartsOfSpeech(doc)
  • lemmatize(doc)
  • recognizeNamedEntities(doc)

@kwalcock
Copy link
Member Author

It's looking like an edu.stanford.nlp.ling.tokensregex.Env is being maintained. This has a variable for TUESDAY which has a value which in turn has tags. There's a tag for resolveTo which is initially missing so that a default value of SUTime.RESOLVE_TO_CLOSEST is used. Sometime later in execution, that gets changed to RESOLVE_TO_PAST. It seems like it is getting incorporated into the environment and then not being reset/cleared properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants