Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to know the correct syntax for data to be indexed in ES. #3

Closed
mohanarunachalam opened this issue Jul 26, 2016 · 12 comments
Closed

Comments

@mohanarunachalam
Copy link

Hi,
My products index is shown below:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "products",
"_type" : "product",
"_id" : "1",
"_score" : 1.0,
"_source":{"annotation": "doc1", "text": "I bought new bread"}
}, {
"_index" : "products",
"_type" : "product",
"_id" : "2",
"_score" : 1.0,
"_source":{"annotation": "doc2", "text": "John bought honey carob powder and it was good"}
}, {
"_index" : "products",
"_type" : "product",
"_id" : "3",
"_score" : 1.0,
"_source":{"text": "John bought honey carob powder and it was good.Check the curry instant at the store.", "annotation": "doc3"}
} ]
}
}
Can someone tell me whether this format is correct?. If, wrong please share the correct syntax.
My annotator code is running, but generator code stops in between.
This is how my products_annotated index looks like:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "1",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
}, {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "2",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
}, {
"_index" : "products__annotated",
"_type" : "product",
"_id" : "3",
"_score" : 1.0,
"_source":{"pos_tagged_sentences": []}
} ]
}
}

@maheshyellai
Copy link
Contributor

@mohaNnnn docs in your products__annotated index don't look right. Its possible you are using incorrect config. Can you share your config.yml? Please refer to the Configuration section of README for info on how to configure bayzee.

@mohanarunachalam
Copy link
Author

mohanarunachalam commented Jul 27, 2016

Hi,
PFA the config file.
It's pretty much the same as yours.
I am connecting to local ES only.
I mainly want to know the initial step (ie)manually indexing the data in ES.
Can you please provide a sample of it.


Elasticsearch server

elasticsearch:

host where Elasticsearch server is running

host: "127.0.0.1"

port on which Elasticsearch server is listening

port: 9200

#redis storage
redis:
host: "127.0.0.1"
port: 6379

Corpus to use

corpus:

name of the Elasticsearch index where the corpus is stored

index: "products"

name of the Elasticsearch document type where the corpus is stored

type: "product"

list of document fields to generate phrases from

text_fields: ["description"]

timeoutMonitorFrequency: 3600000

number of documents to process at a time

processingPageSize: 10

indicate whether to start annotating from scratch

annotateFromScratch: True

indicate whether to generate shingles

indexPhrases: True

indicate whether to generate postags

getPosTags: True

Processors (add custom processors to list of modules)

processor:

name of the Elasticsearch index where annotated text is stored by the processors

index: "products__annotated"

name of the Elasticsearch document type where annotated text is stored by the processors

type: "product"

list of processor modules

modules:
# standard bayzee processor to POS tag english text
# name of the prcessor
- name: "pos_processor"
# path to the python module (relative to the location of this config file)
path: "../lib/pos-processor.py"
# features that this processor extracts
features:
- name: "pos_tags"
isNumerical: False
- name: "first_pos_tag"
isNumerical: False
- name: "middle_pos_tag"
isNumerical: False
- name: "last_pos_tag"
isNumerical: False
- name: "avg_word_length"
isNumerical: True
- name: "non_alpha_chars"
isNumerical: True

Generation

generator:

training set file path (relative to the location of this config file)

trainingPhrasesFilePath: "training-phrases.csv"

hold-out set file path (relative to the location of this config file)

holdOutPhrasesFilePath: "hold-out-phrases.csv"

maximum number of words in generated phrase

maxShingleSize: 3

minimum number of words in generated phrase

minShingleSize: 2

list of features to extract

features:
- name: "doc_count"
isNumerical: True
- name: "max_term_frequency"
isNumerical: True
- name: "avg_term_frequency"
isNumerical: True
- name: "max_score"
isNumerical: True
- name: "avg_score"
isNumerical: True

precision of numerical features

floatPrecision: 4

logger config

logger:

directory where log files are written (relative to the location of this config file)

logsDir: "Documents/bayzee/config/logs"

@maheshyellai
Copy link
Contributor

I see that your product type has a field called text but your configuration indicates that the field should be description. Please change your config file to indicate that the field in product type is text.

@mohanarunachalam
Copy link
Author

Hi,
I am posting the updated products_annotated index below. Please let me know if this is correct.
"hits" : [ {
"_index" : "products__annotated",
"_type" : "product__phrase",
"_id" : "john-bought",
"_score" : 1.0,
"_source":{"phrase": "john bought", "document_id": "2", "phrase__not_analyzed": "john bought"}
}, {
"_index" : "products__annotated",
"_type" : "product__phrase",
"_id" : "bought-honey-carob",
"_score" : 1.0,
"_source":{"phrase": "bought honey carob", "document_id": "2", "phrase__not_analyzed": "bought honey carob"}
}, {
"_index" : "products__annotated",
"_type" : "product__phrase",
"_id" : "honey-carob",
"_score" : 1.0,
"_source":{"phrase": "honey carob", "document_id": "2", "phrase__not_analyzed": "honey carob"}
}

@mohanarunachalam
Copy link
Author

When I run the generate Phrases part of worker, I am getting the following error.
elasticsearch.exceptions.NotFoundError: TransportError(404, u'{"_index":"products__annotated","_type":"product","_id":"2","found":false}')

@mohanarunachalam
Copy link
Author

mohanarunachalam commented Aug 1, 2016

Hi,
Can you please share the data which you have used if possible.

@maheshyellai
Copy link
Contributor

You should've another type called product in products__annotated index. Can you see if that's populated?

Following is an example of how these documents should look.

This is how an example corpus looks (note: index is named bayzee_test and type is named quote)

{
   "_index": "bayzee_test",
   "_type": "quote",
   "_id": "1",
   "_source": {
      "text": "If I had asked people what they wanted, they would have said, \"a faster horse\""
   }
}

This is how a document that stores the pos tagged sentences looks:

{
   "_index": "bayzee_test__annotated",
   "_type": "quote",
   "_id": "1",
   "_source": {
      "pos_tagged_sentences": [
         [
            [
               "if",
               "IN"
            ],
            [
               "i",
               "VBN"
            ],
            [
               "had",
               "VBD"
            ],
            [
               "asked",
               "VBN"
            ],
            [
               "people",
               "NNS"
            ],
            [
               "what",
               "WP"
            ],
            [
               "they",
               "PRP"
            ],
            [
               "wanted",
               "VBD"
            ],
            [
               ",",
               ","
            ],
            [
               "they",
               "PRP"
            ],
            [
               "would",
               "MD"
            ],
            [
               "have",
               "VB"
            ],
            [
               "said",
               "VBD"
            ],
            [
               ",",
               ","
            ],
            [
               "``",
               "``"
            ],
            [
               "a",
               "DT"
            ],
            [
               "faster",
               "JJ"
            ],
            [
               "horse",
               "NN"
            ],
            [
               "''",
               "''"
            ]
         ]
      ]
   }
}

This is how a document with the phrase with its extracted features looks:

{
   "_index": "bayzee_test__annotated",
   "_type": "quote__phrase",
   "_id": "would-have-said",
   "_source": {
      "phrase": "would have said",
      "features": {
         "avg_term_frequency": "1.0000",
         "non_alpha_chars": "0",
         "middle_pos_tag": "VB",
         "max_term_frequency": "1.0000",
         "first_pos_tag": "MD",
         "last_pos_tag": "VB",
         "pos_tags": "MDVBVB",
         "doc_count": "1.0000",
         "avg_score": "0.2301",
         "avg_word_length": "4.33",
         "max_score": "0.2301"
      },
      "document_id": "1",
      "phrase__not_analyzed": "would have said"
   }
}

@mohanarunachalam
Copy link
Author

Hi,
Thanks for the help. I got the above formats correctly.
When I run the worker for classification, I got the following error.
orange.KernelException: 'orange.EntropyDiscretization': no examples or all values of attribute 'doc_count' are unknown
Any idea about where it went wrong?

@maheshyellai
Copy link
Contributor

Do the docs in your __annotated index have doc_count populated?

@mohanarunachalam
Copy link
Author

Yes. Below is a sample.
"_source":{"phrase": "had asked",
"features":
{"avg_term_frequency": "1.0000",
"non_alpha_chars": "0",
"middle_pos_tag": "X",
"max_term_frequency": "1.0000",
"first_pos_tag": "VB",
"last_pos_tag": "VB",
"pos_tags": "VBVB",
"doc_count": "1.0000",
"avg_score": "0.1534",
"avg_word_length": "4.0",
"max_score": "0.1534"
},
"document_id": "1",
"phrase__not_analyzed": "had asked"
}

@maheshyellai
Copy link
Contributor

I think you are using the example training and holdout phrases.

These files should have phrases from your corpus that are manually labelled by you. The ones in the repo are given as an example. Once you populate these files with a sampled set of phrases for training and hold-out, please run generation again.

@mohanarunachalam
Copy link
Author

Thanks for the help. It worked.
Sorry for bothering you with simple doubts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants