Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microposts original data not matching with what gerbil expects #206

Closed
sagnik opened this issue Aug 3, 2017 · 18 comments
Closed

Microposts original data not matching with what gerbil expects #206

sagnik opened this issue Aug 3, 2017 · 18 comments
Assignees

Comments

@sagnik
Copy link

sagnik commented Aug 3, 2017

Please refer to #41 , The wiki mentions that it expects the data in certain formats, specifically:

Microposts2013

gerbil_data/datasets/microposts2013/goldStandard.tsv
gerbil_data/datasets/microposts2013/testSet.tsv
gerbil_data/datasets/microposts2013/TweetsTrainingSetCH.tsv

Microposts2014

gerbil_data/datasets/microposts2014/Microposts2014-NEEL_challenge_TweetsTestSet.csv
gerbil_data/datasets/microposts2014/Microposts2014-NEEL_challenge_TweetsTrainingSet.csv

Microposts2015

gerbil_data/datasets/microposts2015/dev/NEEL2015-dev-gold_v3.tsv
gerbil_data/datasets/microposts2015/dev/NEEL2015-dev-tweets.tsv
gerbil_data/datasets/microposts2015/test/NEEL2015-test-gold_v2.tsv
gerbil_data/datasets/microposts2015/test/NEEL2015-test-tweets.tsv
gerbil_data/datasets/microposts2015/training/NEEL2015-training-gold_v4.tsv
gerbil_data/datasets/microposts2015/training/NEEL2015-training-tweets_v2.tsv

Microposts2016

gerbil_data/datasets/microposts2016/Dev Set/NEEL2016-dev.tsv
gerbil_data/datasets/microposts2016/Dev Set/NEEL2016-dev_neel.gs
gerbil_data/datasets/microposts2016/Test Set/NEEL2016-test.tsv
gerbil_data/datasets/microposts2016/Test Set/NEEL2016-test_neel.gs
gerbil_data/datasets/microposts2016/Training Set/NEEL2016-training.tsv
gerbil_data/datasets/microposts2016/Training Set/NEEL2016-training_neel.gs

I downloaded microposts data from the following sources:

For 2013, the contents of the zip file do match, for others, the contents are as follows:

2014

Microposts2014-NEEL_Dataset-Test.GS
Microposts2014-NEEL_challenge_README  
Microposts2014-NEEL_Dataset-Train.GS

2015

AUTHOR.txt1
microposts2015-neel_challenge_gs_cc-by4.0_license.txt
NEEL2015-dev-gold.tsv
NEEL2015-dev-tweets-ids.tsv
NEEL2015-README
NEEL2015-test-gold.tsv
NEEL2015-test-tweets-ids.tsv
NEEL2015-training-gold.tsv
NEEL2015-training-tweets-ids.tsv

2016

microposts2016-neel-dev_neel.gs
microposts2016-neel-dev-tweets-ids.tsv
microposts2016-neel-README
microposts2016-neel-test_neel.gs
microposts2016-neel-test-tweets-ids.tsv
microposts2016-neel-training_neel.gs
microposts2016-neel-training-tweets-ids.tsv

This is clearly different from what Gerbil expects. If you have any suggestions, please let me know. Also, as @TortugaAttack suggested in #41, I went through the logs in my local machine and it does seem that the microposts data is not loaded:

sagnik@research:~/gerbil$ cat gerbil.log | grep -i micropost 
2017-08-02 01:03:28,793 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2014-Train" failed. It won't be available.>
2017-08-02 01:03:28,794 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2013-Test" failed. It won't be available.>
2017-08-02 01:03:28,794 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2013-Train" failed. It won't be available.>
2017-08-02 01:03:28,796 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2016-Train" failed. It won't be available.>
2017-08-02 01:03:28,798 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2015-Train" failed. It won't be available.>
2017-08-02 01:03:28,799 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2016-Dev" failed. It won't be available.>
2017-08-02 01:03:28,799 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2016-Test" failed. It won't be available.>
2017-08-02 01:03:28,799 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2015-Dev" failed. It won't be available.>
2017-08-02 01:03:28,800 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2015-Test" failed. It won't be available.>
2017-08-02 01:03:28,800 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2014-Test" failed. It won't be available.>
2017-08-02 02:34:10,926 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2014-Train" failed. It won't be available.>
2017-08-02 02:34:10,926 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2013-Test" failed. It won't be available.>
2017-08-02 02:34:10,927 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2013-Train" failed. It won't be available.>
2017-08-02 02:34:10,928 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2016-Train" failed. It won't be available.>
2017-08-02 02:34:10,929 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2015-Train" failed. It won't be available.>
2017-08-02 02:34:10,930 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2016-Dev" failed. It won't be available.>
2017-08-02 02:34:10,930 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2016-Test" failed. It won't be available.>
2017-08-02 02:34:10,930 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2015-Dev" failed. It won't be available.>
2017-08-02 02:34:10,930 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2015-Test" failed. It won't be available.>
2017-08-02 02:34:10,931 [localhost-startStop-1] INFO [org.aksw.gerbil.web.config.DatasetsConfig] - <Check for dataset "Microposts2014-Test" failed. It won't be available.>

@TortugaAttack
Copy link
Contributor

Thanks for creating a new issue! :)

hmm it looks allright. The datasets you linked are correct and should work with the implemented Wrapper.
The logs suggest that the properties cant find the Microposts files.
I will look closer into it tomorrow or todays evening!

@TortugaAttack TortugaAttack self-assigned this Aug 3, 2017
@MichaelRoeder
Copy link
Member

I assume that you simply have to rename the files and move them in the directory where they are expected. Can you please try this?

The "check" that is done by GERBIL when starting the server is very simple and does only look for the files that it is expecting. If there would be a problem with the data inside the files, you would encounter it when you try to benchmark something with these datasets.

@TortugaAttack
Copy link
Contributor

i used the Microposts 2014 set you provided, added them to the gerbil_data and changed the file name. It worked.

So i guess what MichaelRoeder just said seems correct.

@sagnik
Copy link
Author

sagnik commented Aug 3, 2017 via email

@sagnik
Copy link
Author

sagnik commented Aug 3, 2017

This is what I have done:

2014
----
mv Microposts2014-NEEL_Dataset-Test.GS Microposts2014-NEEL_challenge_TweetsTestSet.csv
mv Microposts2014-NEEL_Dataset-Train.GS Microposts2014-NEEL_challenge_TweetsTrainingSet.csv

2015
-------
mv NEEL2015-dev-gold.tsv dev/NEEL2015-dev-gold_v3.tsv
mv NEEL2015-dev-tweets-ids.tsv dev/NEEL2015-dev-tweets.tsv
mv NEEL2015-test-gold.tsv test/NEEL2015-test-gold_v2.tsv
mv NEEL2015-test-tweets-ids.tsv test/NEEL2015-test-tweets.tsv
mv NEEL2015-training-gold.tsv training/NEEL2015-training-gold_v4.tsv
mv NEEL2015-training-tweets-ids.tsv training/NEEL2015-training-tweets_v2.tsv


2016
------

mv microposts2016-neel-dev_neel.gs Dev\ Set/NEEL2016-dev_neel.gs
mv microposts2016-neel-dev-tweets-ids.tsv Dev\ Set/NEEL2016-dev.tsv
mv microposts2016-neel-test_neel.gs Test\ Set/NEEL2016-test_neel.gs
mv microposts2016-neel-test-tweets-ids.tsv Test\ Set/NEEL2016-test.tsv 
mv microposts2016-neel-training_neel.gs Training\ Set/NEEL2016-training_neel.gs
mv microposts2016-neel-training-tweets-ids.tsv Training\ Set/NEEL2016-training.tsv

Is the mapping correct? Because this is the output I get

Annotator Dataset Micro F1
AIDA Microposts2014-Test The dataset couldn't be loaded.
AIDA Microposts2014-Train The dataset couldn't be loaded.
AIDA Microposts2015-Dev Got an unexpected exception while running the experiment.
AIDA Microposts2015-Test Got an unexpected exception while running the experiment.
AIDA Microposts2015-Train Got an unexpected exception while running the experiment.
AIDA Microposts2016-Dev Got an unexpected exception while running the experiment.
AIDA Microposts2016-Test Got an unexpected exception while running the experiment.
AIDA Microposts2016-Train Got an unexpected exception while running the experiment.
Babelfy Microposts2014-Test The dataset couldn't be loaded.
Babelfy Microposts2014-Train The dataset couldn't be loaded.
Babelfy Microposts2015-Dev Got an unexpected exception while running the experiment.
Babelfy Microposts2015-Test Got an unexpected exception while running the experiment.
Babelfy Microposts2015-Train Got an unexpected exception while running the experiment.
Babelfy Microposts2016-Dev Got an unexpected exception while running the experiment.
Babelfy Microposts2016-Test Got an unexpected exception while running the experiment.
Babelfy Microposts2016-Train Got an unexpected exception while running the experiment.
DBpedia Spotlight Microposts2014-Test The dataset couldn't be loaded.
DBpedia Spotlight Microposts2014-Train The dataset couldn't be loaded.
DBpedia Spotlight Microposts2015-Dev Got an unexpected exception while running the experiment.
DBpedia Spotlight Microposts2015-Test Got an unexpected exception while running the experiment.
DBpedia Spotlight Microposts2015-Train Got an unexpected exception while running the experiment.
DBpedia Spotlight Microposts2016-Dev Got an unexpected exception while running the experiment.
DBpedia Spotlight Microposts2016-Test Got an unexpected exception while running the experiment.
DBpedia Spotlight Microposts2016-Train Got an unexpected exception while running the experiment.
Dexter Microposts2014-Test The dataset couldn't be loaded.
Dexter Microposts2014-Train The dataset couldn't be loaded.
Dexter Microposts2015-Dev Got an unexpected exception while running the experiment.
Dexter Microposts2015-Test Got an unexpected exception while running the experiment.
Dexter Microposts2015-Train Got an unexpected exception while running the experiment.
Dexter Microposts2016-Dev Got an unexpected exception while running the experiment.
Dexter Microposts2016-Test Got an unexpected exception while running the experiment.
Dexter Microposts2016-Train Got an unexpected exception while running the experiment.
FOX Microposts2014-Test The dataset couldn't be loaded.
FOX Microposts2014-Train The dataset couldn't be loaded.
FOX Microposts2015-Dev Got an unexpected exception while running the experiment.
FOX Microposts2015-Test Got an unexpected exception while running the experiment.
FOX Microposts2015-Train Got an unexpected exception while running the experiment.
FOX Microposts2016-Dev Got an unexpected exception while running the experiment.
FOX Microposts2016-Test Got an unexpected exception while running the experiment.
FOX Microposts2016-Train Got an unexpected exception while running the experiment.
FRED Microposts2014-Test The dataset couldn't be loaded.
FRED Microposts2014-Train The dataset couldn't be loaded.
FRED Microposts2015-Dev Got an unexpected exception while running the experiment.
FRED Microposts2015-Test Got an unexpected exception while running the experiment.
FRED Microposts2015-Train Got an unexpected exception while running the experiment.
FRED Microposts2016-Dev Got an unexpected exception while running the experiment.
FRED Microposts2016-Test Got an unexpected exception while running the experiment.
FRED Microposts2016-Train Got an unexpected exception while running the experiment.
FREME NER Microposts2014-Test The dataset couldn't be loaded.
FREME NER Microposts2014-Train The dataset couldn't be loaded.
FREME NER Microposts2015-Dev Got an unexpected exception while running the experiment.
FREME NER Microposts2015-Test Got an unexpected exception while running the experiment.
FREME NER Microposts2015-Train Got an unexpected exception while running the experiment.
FREME NER Microposts2016-Dev Got an unexpected exception while running the experiment.
FREME NER Microposts2016-Test Got an unexpected exception while running the experiment.
FREME NER Microposts2016-Train Got an unexpected exception while running the experiment.
WAT Microposts2014-Test The dataset couldn't be loaded.
WAT Microposts2014-Train The dataset couldn't be loaded.
WAT Microposts2015-Dev Got an unexpected exception while running the experiment.
WAT Microposts2015-Test Got an unexpected exception while running the experiment.
WAT Microposts2015-Train Got an unexpected exception while running the experiment.
WAT Microposts2016-Dev Got an unexpected exception while running the experiment.
WAT Microposts2016-Test Got an unexpected exception while running the experiment.
WAT Microposts2016-Train Got an unexpected exception while running the experiment.
xLisa-NER Microposts2014-Test The dataset couldn't be loaded.
xLisa-NER Microposts2014-Train The dataset couldn't be loaded.
xLisa-NER Microposts2015-Dev Got an unexpected exception while running the experiment.
xLisa-NER Microposts2015-Test Got an unexpected exception while running the experiment.
xLisa-NER Microposts2015-Train Got an unexpected exception while running the experiment.
xLisa-NER Microposts2016-Dev Got an unexpected exception while running the experiment.
xLisa-NER Microposts2016-Test Got an unexpected exception while running the experiment.
xLisa-NER Microposts2016-Train Got an unexpected exception while running the experiment.
xLisa-NGRAM Microposts2014-Test The dataset couldn't be loaded.
xLisa-NGRAM Microposts2014-Train The dataset couldn't be loaded.
xLisa-NGRAM Microposts2015-Dev Got an unexpected exception while running the experiment.
xLisa-NGRAM Microposts2015-Test Got an unexpected exception while running the experiment.
xLisa-NGRAM Microposts2015-Train Got an unexpected exception while running the experiment.
xLisa-NGRAM Microposts2016-Dev Got an unexpected exception while running the experiment.
xLisa-NGRAM Microposts2016-Test Got an unexpected exception while running the experiment.
xLisa-NGRAM Microposts2016-Train Got an unexpected exception while running the experiment.

@TortugaAttack
Copy link
Contributor

cp Microposts2014-NEEL_Dataset-Test.GS gerbil_data/datasets/microposts2014/Microposts2014-NEEL_challenge_TweetsTestSet.csv

@TortugaAttack
Copy link
Contributor

TortugaAttack commented Aug 3, 2017

and the filenames are not hardcoded but in the dataset.properties file in src/main/properties ;)

@sagnik
Copy link
Author

sagnik commented Aug 3, 2017

I exactly followed your instruction, this is what I am getting from the log while trying to run an experiment with type=A2KB; match=weak; annotator=aida; dataset=Microposts 2014 Train, Microposts 2014 Test:

2017-08-03 17:57:03,700 [pool-1-thread-2] ERROR [org.aksw.gerbil.execute.ExperimentTask] - <Got an error while running the task. Storing the error code in the db...>
GerbilException: GerbilException: Dataset is malformed. Each line shoud have an even number of cells. Malformed line = [86309321994022913] (error type -104: The dataset couldn't be loaded.) (error type -104: The dataset couldn't be loaded.)
        at org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getDataset(AbstractDatasetConfiguration.java:52)
        at org.aksw.gerbil.execute.ExperimentTask.run(ExperimentTask.java:102)
        at org.aksw.simba.topicmodeling.concurrent.workers.WorkerImpl.run(WorkerImpl.java:44)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: GerbilException: Dataset is malformed. Each line shoud have an even number of cells. Malformed line = [86309321994022913] (error type -104: The dataset couldn't be loaded.)
        at org.aksw.gerbil.dataset.impl.micro.Microposts2014Dataset.loadDocuments(Microposts2014Dataset.java:105)
        at org.aksw.gerbil.dataset.impl.micro.Microposts2014Dataset.init(Microposts2014Dataset.java:77)
        at org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getPreparedDataset(AbstractDatasetConfiguration.java:62)
        at org.aksw.gerbil.dataset.SingletonDatasetConfigImpl.getPreparedDataset(SingletonDatasetConfigImpl.java:50)
        at org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getDataset(AbstractDatasetConfiguration.java:50)
        ... 5 more

@TortugaAttack
Copy link
Contributor

hmm, the dataset has empty lines (only IDs)
The wrapper was probably written with a cleaned up dataset.

I will handle it in the code. But not sure right now whats the best way.
Thus i will take care of it tomorrow.
For now simply remove all those empty lines. ;)

This seems to be the problem for all the datasets.

@sagnik
Copy link
Author

sagnik commented Aug 3, 2017

Yup. It fails with empty id lines. Can confirm. But more importantly, for 2014, the filetype is TSV, not CSV. If an id has multiple named entities, each gets separated by a tab.

@TortugaAttack
Copy link
Contributor

This python script should do the line removal trick (tested only with MP2014)

import re
import sys

fileInput = sys.argv[1]
fileOutput = sys.argv[2]

f = open(fileInput)
out = open(fileOutput, 'w')

for line in f:
	if not re.match('[0-9]+$', line):
		out.write(line)

out.close()
f.close()

python removeLines.py Microposts2014-NEEL_Dataset-Test.GS Microposts2014-NEEL_challenge_TweetsTestSet.csv

@sagnik
Copy link
Author

sagnik commented Aug 3, 2017

you have been very helpful @TortugaAttack ! I am assuming you are converting micropost data to NFI format? If you could point out the code for me, I could give it a try. Thanks anyway.

TortugaAttack pushed a commit that referenced this issue Aug 3, 2017
@TortugaAttack
Copy link
Contributor

TortugaAttack commented Aug 3, 2017

ah thanks!
but i just did it :D
If you are still interested in it: https://github.com/dice-group/gerbil/blob/version1.2.6/src/main/java/org/aksw/gerbil/dataset/impl/micro/Microposts2014Dataset.java

2014 should work now
Not sure about 2015 and 2016, cannot download them, can you provide the logs for them if they still not work?

@sagnik
Copy link
Author

sagnik commented Aug 3, 2017

ok, ll test and keep you updated.

@sagnik
Copy link
Author

sagnik commented Aug 6, 2017

Extremely sorry for the late reply, but this is the result I get, which I don't think is correct:

Annotator Dataset Micro F1 Micro Precision Micro Recall Macro F1 Macro Precision Macro Recall Error Count avg millis/doc confidence threshold Timestamp GERBIL version
AIDA Microposts2014-Test 1 1 1 1 1 1 0 895 0 2017-08-06 19:49:38 1.2.5
AIDA Microposts2014-Test Entity Recognition 1 1 1 1 1 1 0 2017-08-06 19:49:38 1.2.5
AIDA Microposts2014-Test D2KB 1 1 1 1 1 1 0 2017-08-06 19:49:38 1.2.5

@TortugaAttack
Copy link
Contributor

Nope, that is not correct :D
http://gerbil.aksw.org/gerbil/experiment?id=201708070004
The global version works.
Hence i am thinking it is a diff between the dataset we use global and the dataset you use.
I will check into that.

@TortugaAttack
Copy link
Contributor

TortugaAttack commented Sep 1, 2017

Sorry it took so long!
I finally could check the datasets against each other.
The ones you uses are quite different than the ones we got.

Yours missing the tweet itself (this is why the dataset has empty lines) and thus cannot be used with the MP2014 Wrapper.
What happens is that the first annotation will be used as the tweet and will be send to the Annotator. (this results into the 1s with AIDA)

Sorry this took so long!

@MichaelRoeder
Copy link
Member

@sagnik If the last post answered your question, please close this issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants