-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I want to know the correct syntax for data to be indexed in ES. #3
Comments
@mohaNnnn docs in your |
Hi, Elasticsearch serverelasticsearch: host where Elasticsearch server is runninghost: "127.0.0.1" port on which Elasticsearch server is listeningport: 9200 #redis storage Corpus to usecorpus: name of the Elasticsearch index where the corpus is storedindex: "products" name of the Elasticsearch document type where the corpus is storedtype: "product" list of document fields to generate phrases fromtext_fields: ["description"] timeoutMonitorFrequency: 3600000 number of documents to process at a timeprocessingPageSize: 10 indicate whether to start annotating from scratchannotateFromScratch: True indicate whether to generate shinglesindexPhrases: True indicate whether to generate postagsgetPosTags: True Processors (add custom processors to list of modules)processor: name of the Elasticsearch index where annotated text is stored by the processorsindex: "products__annotated" name of the Elasticsearch document type where annotated text is stored by the processorstype: "product" list of processor modulesmodules: Generationgenerator: training set file path (relative to the location of this config file)trainingPhrasesFilePath: "training-phrases.csv" hold-out set file path (relative to the location of this config file)holdOutPhrasesFilePath: "hold-out-phrases.csv" maximum number of words in generated phrasemaxShingleSize: 3 minimum number of words in generated phraseminShingleSize: 2 list of features to extractfeatures: precision of numerical featuresfloatPrecision: 4 logger configlogger: directory where log files are written (relative to the location of this config file)logsDir: "Documents/bayzee/config/logs" |
I see that your |
Hi, |
When I run the generate Phrases part of worker, I am getting the following error. |
Hi, |
You should've another type called Following is an example of how these documents should look. This is how an example corpus looks (note: index is named {
"_index": "bayzee_test",
"_type": "quote",
"_id": "1",
"_source": {
"text": "If I had asked people what they wanted, they would have said, \"a faster horse\""
}
} This is how a document that stores the pos tagged sentences looks: {
"_index": "bayzee_test__annotated",
"_type": "quote",
"_id": "1",
"_source": {
"pos_tagged_sentences": [
[
[
"if",
"IN"
],
[
"i",
"VBN"
],
[
"had",
"VBD"
],
[
"asked",
"VBN"
],
[
"people",
"NNS"
],
[
"what",
"WP"
],
[
"they",
"PRP"
],
[
"wanted",
"VBD"
],
[
",",
","
],
[
"they",
"PRP"
],
[
"would",
"MD"
],
[
"have",
"VB"
],
[
"said",
"VBD"
],
[
",",
","
],
[
"``",
"``"
],
[
"a",
"DT"
],
[
"faster",
"JJ"
],
[
"horse",
"NN"
],
[
"''",
"''"
]
]
]
}
} This is how a document with the phrase with its extracted features looks: {
"_index": "bayzee_test__annotated",
"_type": "quote__phrase",
"_id": "would-have-said",
"_source": {
"phrase": "would have said",
"features": {
"avg_term_frequency": "1.0000",
"non_alpha_chars": "0",
"middle_pos_tag": "VB",
"max_term_frequency": "1.0000",
"first_pos_tag": "MD",
"last_pos_tag": "VB",
"pos_tags": "MDVBVB",
"doc_count": "1.0000",
"avg_score": "0.2301",
"avg_word_length": "4.33",
"max_score": "0.2301"
},
"document_id": "1",
"phrase__not_analyzed": "would have said"
}
} |
Hi, |
Do the docs in your |
Yes. Below is a sample. |
I think you are using the example training and holdout phrases. These files should have phrases from your corpus that are manually labelled by you. The ones in the repo are given as an example. Once you populate these files with a sampled set of phrases for training and hold-out, please run generation again. |
Thanks for the help. It worked. |
Hi,
My products index is shown below:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "products",
"_type" : "product",
"_id" : "1",
"_score" : 1.0,
"_source":{"annotation": "doc1", "text": "I bought new bread"}
}, {
"_index" : "products",
"_type" : "product",
"_id" : "2",
"_score" : 1.0,
"_source":{"annotation": "doc2", "text": "John bought honey carob powder and it was good"}
}, {
"_index" : "products",
"_type" : "product",
"_id" : "3",
"_score" : 1.0,
"_source":{"text": "John bought honey carob powder and it was good.Check the curry instant at the store.", "annotation": "doc3"}
} ]
}
}
Can someone tell me whether this format is correct?. If, wrong please share the correct syntax.
My annotator code is running, but generator code stops in between.
This is how my products_annotated index looks like:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "1",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
}, {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "2",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
}, {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "3",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
} ]
}
}
The text was updated successfully, but these errors were encountered: