diff --git a/language/README.md b/language/README.md index 130ce66ff83e..e63d45eb9a6a 100644 --- a/language/README.md +++ b/language/README.md @@ -5,6 +5,9 @@ This directory contains Python examples that use the - [api](api) has a simple command line tool that shows off the API's features. +- [movie_nl](movie_nl) combines sentiment and entity analysis to come up with +actors/directors who are the most and least popular in the imdb movie reviews. + - [ocr_nl](ocr_nl) uses the [Cloud Vision API](https://cloud.google.com/vision/) to extract text from images, then uses the NL API to extract entity information from those texts, and stores the extracted information in a database in support diff --git a/language/movie_nl/README.md b/language/movie_nl/README.md new file mode 100644 index 000000000000..687a6c4058ab --- /dev/null +++ b/language/movie_nl/README.md @@ -0,0 +1,152 @@ +# Introduction +This sample is an application of the Google Cloud Platform Natural Language API. +It uses the [imdb movie reviews data set](https://www.cs.cornell.edu/people/pabo/movie-review-data/) +from [Cornell University](http://www.cs.cornell.edu/) and performs sentiment & entity +analysis on it. It combines the capabilities of sentiment analysis and entity recognition +to come up with actors/directors who are the most and least popular. + +### Set Up to Authenticate With Your Project's Credentials + +Please follow the [Set Up Your Project](https://cloud.google.com/natural-language/docs/getting-started#set_up_your_project) +steps in the Quickstart doc to create a project and enable the +Cloud Natural Language API. Following those steps, make sure that you +[Set Up a Service Account](https://cloud.google.com/natural-language/docs/common/auth#set_up_a_service_account), +and export the following environment variable: + +``` +export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-project-credentials.json +``` + +**Note:** If you get an error saying your API hasn't been enabled, make sure +that you have correctly set this environment variable, and that the project that +you got the service account from has the Natural Language API enabled. + +## How it works +This sample uses the Natural Language API to annotate the input text. The +movie review document is broken into sentences using the `extract_syntax` feature. +Each sentence is sent to the API for sentiment analysis. The positive and negative +sentiment values are combined to come up with a single overall sentiment of the +movie document. + +In addition to the sentiment, the program also extracts the entities of type +`PERSON`, who are the actors in the movie (including the director and anyone +important). These entities are assigned the sentiment value of the document to +come up with the most and least popular actors/directors. + +### Movie document +We define a movie document as a set of reviews. These reviews are individual +sentences and we use the NL API to extract the sentences from the document. See +an example movie document below. + +``` + Sample review sentence 1. Sample review sentence 2. Sample review sentence 3. +``` + +### Sentences and Sentiment +Each sentence from the above document is assigned a sentiment as below. + +``` + Sample review sentence 1 => Sentiment 1 + Sample review sentence 2 => Sentiment 2 + Sample review sentence 3 => Sentiment 3 +``` + +### Sentiment computation +The final sentiment is computed by simply adding the sentence sentiments. + +``` + Total Sentiment = Sentiment 1 + Sentiment 2 + Sentiment 3 +``` + + +### Entity extraction and Sentiment assignment +Entities with type `PERSON` are extracted from the movie document using the NL +API. Since these entities are mentioned in their respective movie document, +they are associated with the document sentiment. + +``` + Document 1 => Sentiment 1 + + Person 1 + Person 2 + Person 3 + + Document 2 => Sentiment 2 + + Person 2 + Person 4 + Person 5 +``` + +Based on the above data we can calculate the sentiment associated with Person 2: + +``` + Person 2 => (Sentiment 1 + Sentiment 2) +``` + +## Movie Data Set +We have used the Cornell Movie Review data as our input. Please follow the instructions below to download and extract the data. + +### Download Instructions + +``` + $ curl -O http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens.zip + $ unzip mix20_rand700_tokens.zip +``` + +## Command Line Usage +In order to use the movie analyzer, follow the instructions below. (Note that the `--sample` parameter below runs the script on +fewer documents, and can be omitted to run it on the entire corpus) + +### Install Dependencies + +Install [pip](https://pip.pypa.io/en/stable/installing) if not already installed. + +Then, install dependencies by running the following pip command: + +``` +$ pip install -r requirements.txt +``` +### How to Run + +``` +$ python main.py analyze --inp "tokens/*/*" \ + --sout sentiment.json \ + --eout entity.json \ + --sample 5 +``` + +You should see the log file `movie.log` created. + +## Output Data +The program produces sentiment and entity output in json format. For example: + +### Sentiment Output +``` + { + "doc_id": "cv310_tok-16557.txt", + "sentiment": 3.099, + "label": -1 + } +``` + +### Entity Output + +``` + { + "name": "Sean Patrick Flanery", + "wiki_url": "http://en.wikipedia.org/wiki/Sean_Patrick_Flanery", + "sentiment": 3.099 + } +``` + +### Entity Output Sorting +In order to sort and rank the entities generated, use the same `main.py` script. For example, +this will print the top 5 actors with negative sentiment: + +``` +$ python main.py rank --entity_input entity.json \ + --sentiment neg \ + --reverse True \ + --sample 5 +``` diff --git a/language/movie_nl/main.py b/language/movie_nl/main.py new file mode 100644 index 000000000000..ba5c63b60b98 --- /dev/null +++ b/language/movie_nl/main.py @@ -0,0 +1,383 @@ +# Copyright 2016 Google, Inc +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import codecs +import glob +import json +import logging +import os + +from googleapiclient import discovery +from googleapiclient.errors import HttpError +from oauth2client.client import GoogleCredentials +import requests + + +def analyze_document(service, document): + """Analyze the document and get the distribution of sentiments and + the movie name.""" + logging.info('Analyzing {}'.format(document.doc_id)) + + sentences, entities = document.extract_all_sentences(service) + + sentiments = [get_sentiment(service, sentence) for sentence in sentences] + + return sentiments, entities + + +def get_request_body(text, syntax=True, entities=True, sentiment=True): + """Creates the body of the request to the language api in + order to get an appropriate api response.""" + body = { + 'document': { + 'type': 'PLAIN_TEXT', + 'content': text, + }, + 'features': { + 'extract_syntax': syntax, + 'extract_entities': entities, + 'extract_document_sentiment': sentiment, + }, + 'encoding_type': 'UTF32' + } + + return body + + +def get_sentiment(service, sentence): + """Get the sentence-level sentiment.""" + body = get_request_body( + sentence, syntax=False, entities=True, sentiment=True) + + docs = service.documents() + request = docs.annotateText(body=body) + + response = request.execute(num_retries=3) + + sentiment = response.get('documentSentiment') + + if sentiment is None: + return (None, None) + else: + pol = sentiment.get('polarity') + mag = sentiment.get('magnitude') + + if pol is None and mag is not None: + pol = 0 + return (pol, mag) + + +class Document(object): + """Document class captures a single document of movie reviews.""" + + def __init__(self, text, doc_id, doc_path): + self.text = text + self.doc_id = doc_id + self.doc_path = doc_path + self.sentence_entity_pair = None + self.label = None + + def extract_all_sentences(self, service): + """Extract the sentences in a document.""" + + if self.sentence_entity_pair is not None: + return self.sentence_entity_pair + + docs = service.documents() + request_body = get_request_body( + self.text, + syntax=True, + entities=True, + sentiment=False) + request = docs.annotateText(body=request_body) + + ent_list = [] + + response = request.execute() + entities = response.get('entities', []) + sentences = response.get('sentences', []) + + sent_list = [ + sentence.get('text', {}).get('content') for sentence in sentences + ] + + for entity in entities: + ent_type = entity.get('type') + wiki_url = entity.get('metadata', {}).get('wikipedia_url') + + if ent_type == 'PERSON' and wiki_url is not None: + ent_list.append(wiki_url) + + self.sentence_entity_pair = (sent_list, ent_list) + + return self.sentence_entity_pair + + +def to_sentiment_json(doc_id, sent, label): + """Convert the sentiment info to json. + + Args: + doc_id: Document id + sent: Overall Sentiment for the document + label: Actual label +1, 0, -1 for the document + + Returns: + String json representation of the input + + """ + json_doc = {} + + json_doc['doc_id'] = doc_id + json_doc['sentiment'] = float('%.3f' % sent) + json_doc['label'] = label + + return json.dumps(json_doc) + + +def get_wiki_title(wiki_url): + """Get the wikipedia page title for a given wikipedia URL. + + Args: + wiki_url: Wikipedia URL e.g., http://en.wikipedia.org/wiki/Sean_Connery + + Returns: + Wikipedia canonical name e.g., Sean Connery + + """ + try: + content = requests.get(wiki_url).text + return content.split('title')[1].split('-')[0].split('>')[1].strip() + except: + return os.path.basename(wiki_url).replace('_', ' ') + + +def to_entity_json(entity, entity_sentiment, entity_frequency): + """Convert entities and their associated sentiment to json. + + Args: + entity: Wikipedia entity name + entity_sentiment: Sentiment associated with the entity + entity_frequency: Frequency of the entity in the corpus + + Returns: + Json string representation of input + + """ + json_doc = {} + + avg_sentiment = float(entity_sentiment) / float(entity_frequency) + + json_doc['wiki_url'] = entity + json_doc['name'] = get_wiki_title(entity) + json_doc['sentiment'] = float('%.3f' % entity_sentiment) + json_doc['avg_sentiment'] = float('%.3f' % avg_sentiment) + + return json.dumps(json_doc) + + +def get_sentiment_entities(service, document): + """Compute the overall sentiment volume in the document. + + Args: + service: Client to Google Natural Language API + document: Movie review document (See Document object) + + Returns: + Tuple of total sentiment and entities found in the document + + """ + + sentiments, entities = analyze_document(service, document) + + sentiments = [sent for sent in sentiments if sent[0] is not None] + negative_sentiments = [ + polarity for polarity, magnitude in sentiments if polarity < 0.0] + positive_sentiments = [ + polarity for polarity, magnitude in sentiments if polarity > 0.0] + + negative = sum(negative_sentiments) + positive = sum(positive_sentiments) + total = positive + negative + + return (total, entities) + + +def get_sentiment_label(sentiment): + """Return the sentiment label based on the sentiment quantity.""" + if sentiment < 0: + return -1 + elif sentiment > 0: + return 1 + else: + return 0 + + +def process_movie_reviews(service, reader, sentiment_writer, entity_writer): + """Perform some sentiment math and come up with movie review.""" + collected_entities = {} + + for document in reader: + try: + sentiment_total, entities = get_sentiment_entities( + service, document) + except HttpError as e: + logging.error('Error process_movie_reviews {}'.format(e.content)) + continue + + document.label = get_sentiment_label(sentiment_total) + + sentiment_writer.write( + to_sentiment_json( + document.doc_id, + sentiment_total, + document.label + ) + ) + + sentiment_writer.write('\n') + + for ent in entities: + ent_sent, frequency = collected_entities.get(ent, (0, 0)) + ent_sent += sentiment_total + frequency += 1 + + collected_entities[ent] = (ent_sent, frequency) + + for entity, sentiment_frequency in collected_entities.items(): + entity_writer.write(to_entity_json(entity, sentiment_frequency[0], + sentiment_frequency[1])) + entity_writer.write('\n') + + sentiment_writer.flush() + entity_writer.flush() + + +def document_generator(dir_path_pattern, count=None): + """Generator for the input movie documents. + + Args: + dir_path_pattern: Input dir pattern e.g., "foo/bar/*/*" + count: Number of documents to read else everything if None + + Returns: + Generator which contains Document (See above) + + """ + for running_count, item in enumerate(glob.iglob(dir_path_pattern)): + if count and running_count >= count: + raise StopIteration() + + doc_id = os.path.basename(item) + + with codecs.open(item, encoding='utf-8') as f: + try: + text = f.read() + except UnicodeDecodeError: + continue + + yield Document(text, doc_id, item) + + +def rank_entities(reader, sentiment=None, topn=None, reverse_bool=False): + """Rank the entities (actors) based on their sentiment + assigned from the movie.""" + + items = [] + for item in reader: + json_item = json.loads(item) + sent = json_item.get('sentiment') + entity_item = (sent, json_item) + + if sentiment: + if sentiment == 'pos' and sent > 0: + items.append(entity_item) + elif sentiment == 'neg' and sent < 0: + items.append(entity_item) + else: + items.append(entity_item) + + items.sort(reverse=reverse_bool) + items = [json.dumps(item[1]) for item in items] + + print('\n'.join(items[:topn])) + + +def get_service(): + """Build a client to the Google Cloud Natural Language API.""" + + credentials = GoogleCredentials.get_application_default() + + return discovery.build('language', 'v1beta1', + credentials=credentials) + + +def analyze(input_dir, sentiment_writer, entity_writer, sample, log_file): + """Analyze the document for sentiment and entities""" + + # Create logger settings + logging.basicConfig(filename=log_file, level=logging.DEBUG) + + # Create a Google Service object + service = get_service() + + reader = document_generator(input_dir, sample) + + # Process the movie documents + process_movie_reviews(service, reader, sentiment_writer, entity_writer) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser( + description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + + subparsers = parser.add_subparsers(dest='command') + + rank_parser = subparsers.add_parser('rank') + + rank_parser.add_argument( + '--entity_input', help='location of entity input') + rank_parser.add_argument( + '--sentiment', help='filter sentiment as "neg" or "pos"') + rank_parser.add_argument( + '--reverse', help='reverse the order of the items', type=bool, + default=False + ) + rank_parser.add_argument( + '--sample', help='number of top items to process', type=int, + default=None + ) + + analyze_parser = subparsers.add_parser('analyze') + + analyze_parser.add_argument( + '--inp', help='location of the input', required=True) + analyze_parser.add_argument( + '--sout', help='location of the sentiment output', required=True) + analyze_parser.add_argument( + '--eout', help='location of the entity output', required=True) + analyze_parser.add_argument( + '--sample', help='number of top items to process', type=int) + analyze_parser.add_argument('--log_file', default='movie.log') + + args = parser.parse_args() + + if args.command == 'analyze': + with open(args.sout, 'w') as sout, open(args.eout, 'w') as eout: + analyze(args.inp, sout, eout, args.sample, args.log_file) + elif args.command == 'rank': + with open(args.entity_input, 'r') as entity_input: + rank_entities( + entity_input, args.sentiment, args.sample, args.reverse) diff --git a/language/movie_nl/main_test.py b/language/movie_nl/main_test.py new file mode 100644 index 000000000000..fc69e9bccfea --- /dev/null +++ b/language/movie_nl/main_test.py @@ -0,0 +1,128 @@ +# Copyright 2016 Google, Inc +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json + +import main +import six + + +def test_get_request_body(): + text = 'hello world' + body = main.get_request_body(text, syntax=True, entities=True, + sentiment=False) + assert body.get('document').get('content') == text + + assert body.get('features').get('extract_syntax') is True + assert body.get('features').get('extract_entities') is True + assert body.get('features').get('extract_document_sentiment') is False + + +def test_get_sentiment_label(): + assert main.get_sentiment_label(20.50) == 1 + assert main.get_sentiment_label(-42.34) == -1 + + +def test_to_sentiment_json(): + doc_id = '12345' + sentiment = 23.344564 + label = 1 + + sentiment_json = json.loads( + main.to_sentiment_json(doc_id, sentiment, label) + ) + + assert sentiment_json.get('doc_id') == doc_id + assert sentiment_json.get('sentiment') == 23.345 + assert sentiment_json.get('label') == label + + +def test_process_movie_reviews(): + service = main.get_service() + + doc1 = main.Document('Top Gun was awesome and Tom Cruise rocked!', 'doc1', + 'doc1') + doc2 = main.Document('Tom Cruise is a great actor.', 'doc2', 'doc2') + + reader = [doc1, doc2] + swriter = six.StringIO() + ewriter = six.StringIO() + + main.process_movie_reviews(service, reader, swriter, ewriter) + + sentiments = swriter.getvalue().strip().split('\n') + entities = ewriter.getvalue().strip().split('\n') + + sentiments = [json.loads(sentiment) for sentiment in sentiments] + entities = [json.loads(entity) for entity in entities] + + # assert sentiments + assert sentiments[0].get('sentiment') == 1.0 + assert sentiments[0].get('label') == 1 + + assert sentiments[1].get('sentiment') == 1.0 + assert sentiments[1].get('label') == 1 + + # assert entities + assert len(entities) == 1 + assert entities[0].get('name') == 'Tom Cruise' + assert (entities[0].get('wiki_url') == + 'http://en.wikipedia.org/wiki/Tom_Cruise') + assert entities[0].get('sentiment') == 2.0 + + +def test_rank_positive_entities(capsys): + reader = [ + ('{"avg_sentiment": -12.0, ' + '"name": "Patrick Macnee", "sentiment": -12.0}'), + ('{"avg_sentiment": 5.0, ' + '"name": "Paul Rudd", "sentiment": 5.0}'), + ('{"avg_sentiment": -5.0, ' + '"name": "Martha Plimpton", "sentiment": -5.0}'), + ('{"avg_sentiment": 7.0, ' + '"name": "Lucy (2014 film)", "sentiment": 7.0}') + ] + + main.rank_entities(reader, 'pos', topn=1, reverse_bool=False) + out, err = capsys.readouterr() + + expected = ('{"avg_sentiment": 5.0, ' + '"name": "Paul Rudd", "sentiment": 5.0}') + + expected = ''.join(sorted(expected)) + out = ''.join(sorted(out.strip())) + assert out == expected + + +def test_rank_negative_entities(capsys): + reader = [ + ('{"avg_sentiment": -12.0, ' + '"name": "Patrick Macnee", "sentiment": -12.0}'), + ('{"avg_sentiment": 5.0, ' + '"name": "Paul Rudd", "sentiment": 5.0}'), + ('{"avg_sentiment": -5.0, ' + '"name": "Martha Plimpton", "sentiment": -5.0}'), + ('{"avg_sentiment": 7.0, ' + '"name": "Lucy (2014 film)", "sentiment": 7.0}') + ] + + main.rank_entities(reader, 'neg', topn=1, reverse_bool=True) + out, err = capsys.readouterr() + + expected = ('{"avg_sentiment": -5.0, ' + '"name": "Martha Plimpton", "sentiment": -5.0}') + + expected = ''.join(sorted(expected)) + out = ''.join(sorted(out.strip())) + assert out == expected diff --git a/language/movie_nl/requirements.txt b/language/movie_nl/requirements.txt new file mode 100644 index 000000000000..c385fb4e4e03 --- /dev/null +++ b/language/movie_nl/requirements.txt @@ -0,0 +1,2 @@ +google-api-python-client==1.5.1 +requests==2.10.0