Our project consists of 4 people etc etc.. Marcel will finish the rest of it anyway...
Type the following command to install required packages :
sudo pip3 install -r requirements.txt
Create following directories to store postproducts of scripts:
mkdir plots
Download the required data sets, unzip them and store the results in data
directory.
Component whose job is to split the data into test set and data set. The test set will later be used to validate and rank the analysis approaches used.
It is mandatory to first use:
python3 -m utils.data_exploration -aggregate
To aggregate data in aggregated
directory.
The component consists of a class Test_set_splitter and it's public methods:
SplitAll()
and SplitAndAppendToExisting()
To use them, the data needs to be in .csv format in folder aggregated
. The component will then split the data into test_set/test_set.csv and data_set/data_set.csv
To split all data use:
python3 -m utils.test_set_splitter
To get more verbose information use:
python3 -m utils.test_set_splitter --log
To append new data to existing sets use:
python3 -m utils.test_set_splitter --append [path_to_file]
-
Aggregating data in
data
folder, which consits of multiple .xml files seperated in each domain. Script then saves data to one .csv file for each domain toaggregated
directory -
Preparing
results.json
file for each domain. File contains results of premilimnary analysis of data sets. Examples of few of the attributes exported to json file:- numberOfPositives
- numberOfNegatives
- averageTextLengthWhenPolarityPositiveChars
- averageTextLengthWhenPolarityNegativeChars etc ...
-
Prepering plots of selected attributes to visualise the data.
-debug
- to print error logs-dump
- to dump the analysis results to results.json file-plot
- to show plot analysis as well as save plots toplot
directory-aggregate
- to aggregate data inaggregated
directory
- 1# at this point
results
analysis sometimes crashes. This exception is then caught and21.37
value is saved to results temoporarly
This model uses linear regression, as well as SGD to calculate loss function.
- have generated
test_set
anddata_set
To teach model and save it to binary file for later use, type :
python3 -m baselines.mean_length_baseline -teach
To evaluate the model on a test set:
python3 -m baselines.mean_length_baseline -evaluate
This model calulates border value based on text score which later is used to decide if text has 'positive' or 'negative' polarity. Border value is just mean of all text scores from data_set
- have generated
test_set
anddata_set
To teach model and save it to binary file for later use, type :
python3 -m baselines.word_polarity_counting -teach
To evaluate the model on a test set:
python3 -m baselines.word_polarity_counting -evaluate
Aditional options:
--log - to get more information about script process
--parallel - to tech model on multiple processors, by default it uses only one :(
--help - to get more information
- have generated
test_set
anddata_set
If you have not installed FastText just run scrip
fasttext_tool/setup.sh
It will download FastText and install it (one comand requiers sudo privileges).
To test and run FastText just type command
python3 -m fasttext_tool.fast
Script with generated data.train
and data.test
needed for futher learing and evaluating. If such files exists in fasttext_tool/
then script will skip generaing step. After that script will train model based on data.train
and save trained model to model.bin
. If such file already exists in fasttext_tool/
then script will skip this step. Last step is evaulating model based on data.train
. Results of evaluation will be saved to file fasttext_tool/results.txt
and printed on console.