GitHub - ludekcizinsky/pl-stream-local: Implementation of the PLStream locally on macOS.

Disclaimer

This code is a modification of the original code from PLStream repository. Therefore, all credit goes to the authors of the original code. You can check out the paper for which the code was written here.

About

The original code assumed that the environment in which the code will be executed is a cluster with 4 quite powerful nodes. The purpose of this repo is to allow the execution of the code on a local machine. More specifically, the code was tested in the following environment:

OS: macOS Monterey v12.2.1
Processor: 2 GHz Quad-Core Intel Core i5
Memory: 16 GB 3733 MHz

Quickstart

Software installation

In order to run the code, you first need to make sure you have the below mentioned software installed.

Apache Flink

Assuming you have brew installed, run the following:

brew install apache-flink

Java V8

Check if you have Java v8 installed:

java --version

If not, then you can install it according to this tutorial.

Python >3.7

Again, check if you have python installed with at least version 3.7:

python --version

If not, you can follow the official docs to install it.

Run the code

Getting data and downloading packages

Assuming you are at the root of repo, first, we need to download yelp review data and install python dependencies:

source setup.sh

[Warning] You might run into issues when installing dependencies since you might have different versions of packages already installed and according to the requirements file, you need a different version. As of now, just ignore these conflicts. With that being said, make sure that all packages listed in requirements.txt are installed.

Run redis

Before execution of the scripts, please run redis in a separate terminal window:

redis-server

You might need to install it seperately with

brew install redis

Get text with predictions

Finally, you can run the code and get reviews with corresponding label:

cd src
plreview

If you get an error with nltk run the python interpreter and run:

import nltk
nltk.download('stopwords')

The output is stored in the folder called output present within the src directory. In the output folder, you can find sub-directories which are named in the form of YY-mm-hh. If you then enter corresponding sub-folder you can check its content by writing:

ls -a

As you can see, this subfolder includes several files. If you choose the one created most recently you can check the output by for example using head. As mentioned in the original docs:

The outputs' form is "original text" + "label" + "@@@@". With help of a split("@@@@") function we can further reorganize the labelled dataset.

So for example, you can run the following command to get the result in a nice form to the file called result.out (make sure you replace the variable in square brackets):

cat [name_of_the_raw_file] | tr "@@@@" "\n" > result.out

Get accuracy

You can check how accuracy of the model evolves as you input more data by running the following:

cd src
placc

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

About

Quickstart

Software installation

Apache Flink

Java V8

Python >3.7

Run the code

Getting data and downloading packages

Run redis

Get text with predictions

Get accuracy

Todo

Bug report

About

Contributors 2

Languages

ludekcizinsky/pl-stream-local

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

About

Quickstart

Software installation

Apache Flink

Java V8

Python >3.7

Run the code

Getting data and downloading packages

Run redis

Get text with predictions

Get accuracy

Todo

Bug report

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages