I want to know the correct syntax for data to be indexed in ES. #3

mohanarunachalam · 2016-07-26T06:05:14Z

Hi,
My products index is shown below:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "products",
"_type" : "product",
"_id" : "1",
"_score" : 1.0,
"_source":{"annotation": "doc1", "text": "I bought new bread"}
}, {
"_index" : "products",
"_type" : "product",
"_id" : "2",
"_score" : 1.0,
"_source":{"annotation": "doc2", "text": "John bought honey carob powder and it was good"}
}, {
"_index" : "products",
"_type" : "product",
"_id" : "3",
"_score" : 1.0,
"_source":{"text": "John bought honey carob powder and it was good.Check the curry instant at the store.", "annotation": "doc3"}
} ]
}
}
Can someone tell me whether this format is correct?. If, wrong please share the correct syntax.
My annotator code is running, but generator code stops in between.
This is how my products_annotated index looks like:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "1",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
}, {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "2",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
}, {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "3",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
} ]
}
}

maheshyellai · 2016-07-27T04:57:52Z

@mohaNnnn docs in your products__annotated index don't look right. Its possible you are using incorrect config. Can you share your config.yml? Please refer to the Configuration section of README for info on how to configure bayzee.

mohanarunachalam · 2016-07-27T11:58:14Z

Hi,
PFA the config file.
It's pretty much the same as yours.
I am connecting to local ES only.
I mainly want to know the initial step (ie)manually indexing the data in ES.
Can you please provide a sample of it.

Elasticsearch server

elasticsearch:

host where Elasticsearch server is running

host: "127.0.0.1"

port on which Elasticsearch server is listening

port: 9200

#redis storage
redis:
host: "127.0.0.1"
port: 6379

Corpus to use

corpus:

name of the Elasticsearch index where the corpus is stored

index: "products"

name of the Elasticsearch document type where the corpus is stored

type: "product"

list of document fields to generate phrases from

text_fields: ["description"]

timeoutMonitorFrequency: 3600000

number of documents to process at a time

processingPageSize: 10

indicate whether to start annotating from scratch

annotateFromScratch: True

indicate whether to generate shingles

indexPhrases: True

indicate whether to generate postags

getPosTags: True

Processors (add custom processors to list of modules)

processor:

name of the Elasticsearch index where annotated text is stored by the processors

index: "products__annotated"

name of the Elasticsearch document type where annotated text is stored by the processors

type: "product"

list of processor modules

modules:
# standard bayzee processor to POS tag english text
# name of the prcessor
- name: "pos_processor"
# path to the python module (relative to the location of this config file)
path: "../lib/pos-processor.py"
# features that this processor extracts
features:
- name: "pos_tags"
isNumerical: False
- name: "first_pos_tag"
isNumerical: False
- name: "middle_pos_tag"
isNumerical: False
- name: "last_pos_tag"
isNumerical: False
- name: "avg_word_length"
isNumerical: True
- name: "non_alpha_chars"
isNumerical: True

Generation

generator:

training set file path (relative to the location of this config file)

trainingPhrasesFilePath: "training-phrases.csv"

hold-out set file path (relative to the location of this config file)

holdOutPhrasesFilePath: "hold-out-phrases.csv"

maximum number of words in generated phrase

maxShingleSize: 3

minimum number of words in generated phrase

minShingleSize: 2

list of features to extract

features:
- name: "doc_count"
isNumerical: True
- name: "max_term_frequency"
isNumerical: True
- name: "avg_term_frequency"
isNumerical: True
- name: "max_score"
isNumerical: True
- name: "avg_score"
isNumerical: True

precision of numerical features

floatPrecision: 4

logger config

logger:

directory where log files are written (relative to the location of this config file)

logsDir: "Documents/bayzee/config/logs"

maheshyellai · 2016-07-29T05:59:01Z

I see that your product type has a field called text but your configuration indicates that the field should be description. Please change your config file to indicate that the field in product type is text.

mohanarunachalam · 2016-08-01T06:10:33Z

Hi,
I am posting the updated products_annotated index below. Please let me know if this is correct.
"hits" : [ {
"_index" : "products__annotated",
"_type" : "product__phrase",
"_id" : "john-bought",
"_score" : 1.0,
"_source":{"phrase": "john bought", "document_id": "2", "phrase__not_analyzed": "john bought"}
}, {
"_index" : "products__annotated",
"_type" : "product__phrase",
"_id" : "bought-honey-carob",
"_score" : 1.0,
"_source":{"phrase": "bought honey carob", "document_id": "2", "phrase__not_analyzed": "bought honey carob"}
}, {
"_index" : "products__annotated",
"_type" : "product__phrase",
"_id" : "honey-carob",
"_score" : 1.0,
"_source":{"phrase": "honey carob", "document_id": "2", "phrase__not_analyzed": "honey carob"}
}

mohanarunachalam · 2016-08-01T06:14:00Z

When I run the generate Phrases part of worker, I am getting the following error.
elasticsearch.exceptions.NotFoundError: TransportError(404, u'{"_index":"products__annotated","_type":"product","_id":"2","found":false}')

mohanarunachalam · 2016-08-01T07:19:54Z

Hi,
Can you please share the data which you have used if possible.

maheshyellai · 2016-08-02T13:11:03Z

You should've another type called product in products__annotated index. Can you see if that's populated?

Following is an example of how these documents should look.

This is how an example corpus looks (note: index is named bayzee_test and type is named quote)

{
   "_index": "bayzee_test",
   "_type": "quote",
   "_id": "1",
   "_source": {
      "text": "If I had asked people what they wanted, they would have said, \"a faster horse\""
   }
}

This is how a document that stores the pos tagged sentences looks:

{
   "_index": "bayzee_test__annotated",
   "_type": "quote",
   "_id": "1",
   "_source": {
      "pos_tagged_sentences": [
         [
            [
               "if",
               "IN"
            ],
            [
               "i",
               "VBN"
            ],
            [
               "had",
               "VBD"
            ],
            [
               "asked",
               "VBN"
            ],
            [
               "people",
               "NNS"
            ],
            [
               "what",
               "WP"
            ],
            [
               "they",
               "PRP"
            ],
            [
               "wanted",
               "VBD"
            ],
            [
               ",",
               ","
            ],
            [
               "they",
               "PRP"
            ],
            [
               "would",
               "MD"
            ],
            [
               "have",
               "VB"
            ],
            [
               "said",
               "VBD"
            ],
            [
               ",",
               ","
            ],
            [
               "``",
               "``"
            ],
            [
               "a",
               "DT"
            ],
            [
               "faster",
               "JJ"
            ],
            [
               "horse",
               "NN"
            ],
            [
               "''",
               "''"
            ]
         ]
      ]
   }
}

This is how a document with the phrase with its extracted features looks:

{
   "_index": "bayzee_test__annotated",
   "_type": "quote__phrase",
   "_id": "would-have-said",
   "_source": {
      "phrase": "would have said",
      "features": {
         "avg_term_frequency": "1.0000",
         "non_alpha_chars": "0",
         "middle_pos_tag": "VB",
         "max_term_frequency": "1.0000",
         "first_pos_tag": "MD",
         "last_pos_tag": "VB",
         "pos_tags": "MDVBVB",
         "doc_count": "1.0000",
         "avg_score": "0.2301",
         "avg_word_length": "4.33",
         "max_score": "0.2301"
      },
      "document_id": "1",
      "phrase__not_analyzed": "would have said"
   }
}

mohanarunachalam · 2016-08-03T07:58:43Z

Hi,
Thanks for the help. I got the above formats correctly.
When I run the worker for classification, I got the following error.
orange.KernelException: 'orange.EntropyDiscretization': no examples or all values of attribute 'doc_count' are unknown
Any idea about where it went wrong?

maheshyellai · 2016-08-05T09:36:04Z

Do the docs in your __annotated index have doc_count populated?

mohanarunachalam · 2016-08-05T09:41:21Z

Yes. Below is a sample.
"_source":{"phrase": "had asked",
"features":
{"avg_term_frequency": "1.0000",
"non_alpha_chars": "0",
"middle_pos_tag": "X",
"max_term_frequency": "1.0000",
"first_pos_tag": "VB",
"last_pos_tag": "VB",
"pos_tags": "VBVB",
"doc_count": "1.0000",
"avg_score": "0.1534",
"avg_word_length": "4.0",
"max_score": "0.1534"
},
"document_id": "1",
"phrase__not_analyzed": "had asked"
}

maheshyellai · 2016-08-05T10:30:07Z

I think you are using the example training and holdout phrases.

These files should have phrases from your corpus that are manually labelled by you. The ones in the repo are given as an example. Once you populate these files with a sampled set of phrases for training and hold-out, please run generation again.

mohanarunachalam · 2016-08-05T10:48:24Z

Thanks for the help. It worked.
Sorry for bothering you with simple doubts.

mohanarunachalam closed this as completed Aug 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I want to know the correct syntax for data to be indexed in ES. #3

I want to know the correct syntax for data to be indexed in ES. #3

mohanarunachalam commented Jul 26, 2016

maheshyellai commented Jul 27, 2016

mohanarunachalam commented Jul 27, 2016 •

edited

Loading

maheshyellai commented Jul 29, 2016

mohanarunachalam commented Aug 1, 2016

mohanarunachalam commented Aug 1, 2016

mohanarunachalam commented Aug 1, 2016 •

edited

Loading

maheshyellai commented Aug 2, 2016

mohanarunachalam commented Aug 3, 2016

maheshyellai commented Aug 5, 2016

mohanarunachalam commented Aug 5, 2016

maheshyellai commented Aug 5, 2016

mohanarunachalam commented Aug 5, 2016

I want to know the correct syntax for data to be indexed in ES. #3

I want to know the correct syntax for data to be indexed in ES. #3

Comments

mohanarunachalam commented Jul 26, 2016

maheshyellai commented Jul 27, 2016

mohanarunachalam commented Jul 27, 2016 • edited Loading

Elasticsearch server

host where Elasticsearch server is running

port on which Elasticsearch server is listening

Corpus to use

name of the Elasticsearch index where the corpus is stored

name of the Elasticsearch document type where the corpus is stored

list of document fields to generate phrases from

number of documents to process at a time

indicate whether to start annotating from scratch

indicate whether to generate shingles

indicate whether to generate postags

Processors (add custom processors to list of modules)

name of the Elasticsearch index where annotated text is stored by the processors

name of the Elasticsearch document type where annotated text is stored by the processors

list of processor modules

Generation

training set file path (relative to the location of this config file)

hold-out set file path (relative to the location of this config file)

maximum number of words in generated phrase

minimum number of words in generated phrase

list of features to extract

precision of numerical features

logger config

directory where log files are written (relative to the location of this config file)

maheshyellai commented Jul 29, 2016

mohanarunachalam commented Aug 1, 2016

mohanarunachalam commented Aug 1, 2016

mohanarunachalam commented Aug 1, 2016 • edited Loading

maheshyellai commented Aug 2, 2016

mohanarunachalam commented Aug 3, 2016

maheshyellai commented Aug 5, 2016

mohanarunachalam commented Aug 5, 2016

maheshyellai commented Aug 5, 2016

mohanarunachalam commented Aug 5, 2016

mohanarunachalam commented Jul 27, 2016 •

edited

Loading

mohanarunachalam commented Aug 1, 2016 •

edited

Loading