This project is by no means a fully spark-aware implementation of cTakes. It meant to fuse cTakes processing engine and parallelism provided by Spark/Hadoop to make cTakes work @ scale and access data stored in HDFS-like storage. This could be viewed as an attempt to provide an alternative to UIMA DUCC.
To make things completely Spark-aware one need to make changes to cTakes processing engine code directly. For example, NER lookups would need to change from using hsql (or any other relational database) to use data frames/RDDs for performance.
- Download and install Apache cTAKES v4.0.0 as shown below. It is important to install v4.0.0 as this is expected later on.
$ sudo su
$ cd /usr/local
$ wget "http://archive.apache.org/dist/ctakes/ctakes-4.0.0/apache-ctakes-4.0.0-bin.tar.gz"
$ tar -zxvf apache-ctakes-4.0.0-bin.tar.gz
cTAKES source code is available here.
$ sudo su
$ cd /usr/local
$ git clone https://github.com/yugagarin/ctakesspark.git
$ cd ctakesspark
$ apt install maven
cTakes depends on descriptor files and resource databases for NER. Descriptor files can be downloaded from official cTakes source code and resource databases can be taken here.
Update CtakesFunction.java with UMLS username and password. The case be obtained by registering with and signing the UMLS Metathesaurus License. Once you have it, proceeed as below:
$ vim src/main/java/org/poc/ctakes/spark/CtakesFunction.java
or
$ vim src/main/java/org/poc/ctakes/spark/CtakesFlatMapFunction.java
Then populate the following two properties with your username and password respectively:
private void setup() throws UIMAException {
System.setProperty("ctakes.umlsuser", "");
System.setProperty("ctakes.umlspw", "");
Then build the project.
$ mvn clean install
$ cd ./target
When you installed the project as above, you may have also noticied the build system generates an spark-ctakes-0.1-shaded.jar artifact.
This artifact is a self-contained job JAR (uber jar) that contains all the dependencies required to run our application on an existing cluster.
The Spark Documentation provides excellent context on how to submit your jobs. A bare bones example is provided below:
- Typical syntax using spark-submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
Now an example for our application
$ ./usr/bin/spark-submit \
--class org.poc.ctakes.spark.CtakesSparkMain \
--master yarn --deploy-mode cluster \
--conf spark.executor.extraClassPath=/tmp/ctakesdependencies/ \
--conf spark.driver.extraClassPath=/tmp/ctakesdependencies/ \
--conf spark.driver.memory=5g --executor-memory=10g \
spark-ctakes-0.1-shaded.jar
Ctakesspark is licensed under the Apache License v2.0. A copy of that license is shipped with this source code.