Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nl #426

Merged
merged 38 commits into from
Jul 21, 2016
Merged

Nl #426

Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
5fb9def
Add samples for natural language api.
Apr 6, 2016
aff93ad
fixed variable name error
puneithk Jul 20, 2016
03cb67b
logged error message with exception
puneithk Jul 20, 2016
e2cc3d5
fixed variable unused error
puneithk Jul 20, 2016
1789677
Refactor for clarity.
Jul 20, 2016
1242524
cast samples to int
puneithk Jul 20, 2016
3e4deb1
added sample variable
puneithk Jul 20, 2016
4ffc546
fixed indentation bug
puneithk Jul 20, 2016
7ca9918
Fix lint errors
Jul 20, 2016
0be30eb
Remove movie_nl sample until it's more stable.
Jul 20, 2016
47c7c63
Revert "Remove movie_nl sample until it's more stable."
Jul 20, 2016
0a8d040
catch HttpError and log it
puneithk Jul 20, 2016
8076a63
added retry in the request
puneithk Jul 20, 2016
f39bd6e
Merge branch 'nl' of github.com:GoogleCloudPlatform/python-docs-sampl…
puneithk Jul 20, 2016
c9ac6d3
resolved README conflicts
puneithk Jul 20, 2016
d465566
removed reverse bool
puneithk Jul 20, 2016
cf12fce
fixed PR comments
puneithk Jul 20, 2016
729e96e
fixed nox issues
puneithk Jul 20, 2016
483e66d
changed from io.StringIO to StringIO.StringIO
puneithk Jul 20, 2016
8d974f3
fixed nox issues
puneithk Jul 20, 2016
f2930fe
added rank_entities tests
puneithk Jul 20, 2016
ec185c7
removed urlparse
puneithk Jul 20, 2016
7a8a962
fixed PR comments
puneithk Jul 20, 2016
e77231d
renamed e_tuple to better name
puneithk Jul 20, 2016
11d6b0b
fixed nox issues for main_test
puneithk Jul 21, 2016
0bea72b
replaced StringIO.StringIO with io.BytesIO
puneithk Jul 21, 2016
5242b74
changed order of io
puneithk Jul 21, 2016
f5f1ec1
used capsys to capture stdout output
puneithk Jul 21, 2016
6730388
fixed docstring
puneithk Jul 21, 2016
6f01839
imported six.StringIO
puneithk Jul 21, 2016
7e2cde5
fixed ordering of the expected out
puneithk Jul 21, 2016
b31a76e
fixed sorted string
puneithk Jul 21, 2016
0820734
added docstrings for the argument
puneithk Jul 21, 2016
769eaed
added arguments description
puneithk Jul 21, 2016
850771d
added wikipedia url example
puneithk Jul 21, 2016
36ffcb1
removed File type from argsparse
puneithk Jul 21, 2016
9c610f0
updated README.md to work due to wrong input name
puneithk Jul 21, 2016
ae433ee
removed FileType from argsparse
puneithk Jul 21, 2016
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions language/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ This directory contains Python examples that use the

- [api](api) has a simple command line tool that shows off the API's features.

- [movie_nl](movie_nl) combines sentiment and entity analysis to come up with
actors/directors who are the most and least popular in the imdb movie reviews.

- [ocr_nl](ocr_nl) uses the [Cloud Vision API](https://cloud.google.com/vision/)
to extract text from images, then uses the NL API to extract entity information
from those texts, and stores the extracted information in a database in support
Expand Down
152 changes: 152 additions & 0 deletions language/movie_nl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Introduction
This sample is an application of the Google Cloud Platform Natural Language API.
It uses the [imdb movie reviews data set](https://www.cs.cornell.edu/people/pabo/movie-review-data/)
from [Cornell University](http://www.cs.cornell.edu/) and performs sentiment & entity
analysis on it. It combines the capabilities of sentiment analysis and entity recognition
to come up with actors/directors who are the most and least popular.

### Set Up to Authenticate With Your Project's Credentials

Please follow the [Set Up Your Project](https://cloud.google.com/natural-language/docs/getting-started#set_up_your_project)
steps in the Quickstart doc to create a project and enable the
Cloud Natural Language API. Following those steps, make sure that you
[Set Up a Service Account](https://cloud.google.com/natural-language/docs/common/auth#set_up_a_service_account),
and export the following environment variable:

```
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-project-credentials.json
```

**Note:** If you get an error saying your API hasn't been enabled, make sure
that you have correctly set this environment variable, and that the project that
you got the service account from has the Natural Language API enabled.

## How it works
This sample uses the Natural Language API to annotate the input text. The
movie review document is broken into sentences using the `extract_syntax` feature.
Each sentence is sent to the API for sentiment analysis. The positive and negative
sentiment values are combined to come up with a single overall sentiment of the
movie document.

In addition to the sentiment, the program also extracts the entities of type
`PERSON`, who are the actors in the movie (including the director and anyone
important). These entities are assigned the sentiment value of the document to
come up with the most and least popular actors/directors.

### Movie document
We define a movie document as a set of reviews. These reviews are individual
sentences and we use the NL API to extract the sentences from the document. See
an example movie document below.

```
Sample review sentence 1. Sample review sentence 2. Sample review sentence 3.
```

### Sentences and Sentiment
Each sentence from the above document is assigned a sentiment as below.

```
Sample review sentence 1 => Sentiment 1
Sample review sentence 2 => Sentiment 2
Sample review sentence 3 => Sentiment 3
```

### Sentiment computation
The final sentiment is computed by simply adding the sentence sentiments.

```
Total Sentiment = Sentiment 1 + Sentiment 2 + Sentiment 3
```


### Entity extraction and Sentiment assignment
Entities with type `PERSON` are extracted from the movie document using the NL
API. Since these entities are mentioned in their respective movie document,
they are associated with the document sentiment.

```
Document 1 => Sentiment 1

Person 1
Person 2
Person 3

Document 2 => Sentiment 2

Person 2
Person 4
Person 5
```

Based on the above data we can calculate the sentiment associated with Person 2:

```
Person 2 => (Sentiment 1 + Sentiment 2)
```

## Movie Data Set
We have used the Cornell Movie Review data as our input. Please follow the instructions below to download and extract the data.

### Download Instructions

```
$ curl -O http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens.zip
$ unzip mix20_rand700_tokens.zip
```

## Command Line Usage
In order to use the movie analyzer, follow the instructions below. (Note that the `--sample` parameter below runs the script on
fewer documents, and can be omitted to run it on the entire corpus)

### Install Dependencies

Install [pip](https://pip.pypa.io/en/stable/installing) if not already installed.

Then, install dependencies by running the following pip command:

```
$ pip install -r requirements.txt
```
### How to Run

```
$ python main.py analyze --inp "tokens/*/*" \
--sout sentiment.json \
--eout entity.json \
--sample 5
```

You should see the log file `movie.log` created.

## Output Data
The program produces sentiment and entity output in json format. For example:

### Sentiment Output
```
{
"doc_id": "cv310_tok-16557.txt",
"sentiment": 3.099,
"label": -1
}
```

### Entity Output

```
{
"name": "Sean Patrick Flanery",
"wiki_url": "http://en.wikipedia.org/wiki/Sean_Patrick_Flanery",
"sentiment": 3.099
}
```

### Entity Output Sorting
In order to sort and rank the entities generated, use the same `main.py` script. For example,
this will print the top 5 actors with negative sentiment:

```
$ python main.py rank entity.json \
--sentiment neg \
--reverse True \
--sample 5
```
Loading