In this code pattern, we will create a web app for visualizing unstructured data using Watson Natural Understanding, Apache Tika, and D3.js. After a user uploads a local file of their choosing, the application leverages Apache Tika to extract text from the unstructured data file. The text is then passed through Watson Natural Language Understanding, where entities and concepts are extracted. Finally, the application uses the D3.js library as a visualization tool to display the results to the user.
The main benefit of using the Watson Natural Understanding Service is its powerful analytics engine that provides cognitive enrichments and insights into your data. The key enrichments that are extracted include:
- Entities: people, companies, organizations, cities, and more.
- Keywords: important topics typically used to index or search the data.
- Concepts: identified general concepts that aren't necessarily referenced in the data.
- Sentiment: the overall positive or negative sentiment of the data.
The enrichments will be displayed using D3.js, a JavaScript library that provides powerful visualization techniques that helps bring data to life. In this app, we will use it to display each of the enrichments in an interactive bubble cloud, with each elements size and location determined by its relative significance.
When the reader has completed this code pattern, they will understand how to:
- Create and use an instance of Watson Natural Language Understanding
- Leverage Apache Tika to extract text from unstructured files
- Use D3.js for displaying the visuals
- User configures credentials for the Watson NLU service and starts the app.
- User selects data file to proecess and load.
- Text is extracted from the data file using Apache Tika.
- Extracted text is passed to Watson NLU for enrichment.
- Enriched data is visualized in the UI using the D3.js library.
This video is from a webinar produced for the "Building With Watson" series.
Clone the visualize-unstructured-data-with-watson
repo locally. In a terminal, run:
git clone https://github.com/IBM/visualize-unstructured-data-with-watson
Create the following services:
The credentials for IBM Cloud services, can be found in the Services
menu in IBM Cloud, by selecting the Service Credentials
option for each service.
Use those values to update the config.properties
file located in the src/main/resources
directory. Replace the default values with the appropriate credentials (either API key, or username/password). Note that quotes are not required.
# Watson Natural Language Understanding
NATURAL_LANGUAGE_UNDERSTANDING_URL=<add_nlu_url>
## Un-comment and use either username+password or IAM apikey.
NATURAL_LANGUAGE_UNDERSTANDING_IAM_APIKEY=<add_nlu_iam_apikey>
#NATURAL_LANGUAGE_UNDERSTANDING_USERNAME=<add_nlu_username>
#NATURAL_LANGUAGE_UNDERSTANDING_PASSWORD=<add_nlu_password>
Maven >= 3.5 is used to build, test, and run the app. Check your maven version using the following command:
mvn -v
To download and install maven, click here.
Note: If you would prefer not to download Maven, you can substitute the
mvn
portion of any Maven command with either./mvnw
(on Linux or Mac), ormvnw.cmd
(on Windows). This will run a pre-installed local version of Maven that is included in this repo.
- Install and package the Java app by running the following Maven command (remember, you can substitute
mvn
withmnvw
if you do not have Maven installed):
mvn clean install
- Start the app by running:
java -jar target/nlu-visual-1.0.jar
-
Browse to
http://localhost:8080
to see the app. -
To start the visualization process, select and upload a data file from your local file system. Note that while Apache Tika supports over a thousand different files types, this app has only been tested using a small set of standard document type formats. For your convenience, we have included a few sample poems located in the data subdirectory of this repo.
From the home page, you will be prompted to choose a file from your local system:
Select a file and press the Upload
button. In this example, the file "The Raven.pdf" was selected from the data
folder:
If you click on the Sentiments
tab, you will see:
This code pattern is licensed under the Apache License, Version 2. Separate third-party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 and the Apache License, Version 2.