Increase Database performance #260

MichaelRoeder · 2018-06-26T16:55:39Z

Description

The current structure of the database leads to a very slow performance, e.g., some experiments can not even be loaded anymore (http://gerbil.aksw.org/gerbil/experiment?id=201603140002). We need to increase the performance by restructuring the database and make it more flexible for future additions of data.

This should take #79 into account.

Solution

Restructure the database
Offer a class that transforms old databases of GERBIL into the new format

RicardoUsbeck · 2018-06-26T17:11:17Z

Is there an ETA?
How many experiments are in the current DB?
Can you need help programming? I could probably organize something.
Can we in the meantime increase the RAM of the java process to help?

MichaelRoeder · 2018-06-27T09:04:00Z

I don't think that it is even restricted.

Restarting the application was helpful. At least the experiment is shown, now. However, this is not a long term solution 😉

MichaelRoeder · 2018-06-28T15:43:10Z

I have an idea in mind how the new DB could look like:

-- table for experiments remains as it is
CREATE TABLE IF NOT EXISTS Experiments (
  id VARCHAR(300) NOT NULL,
  taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
  PRIMARY KEY (id, taskId)
);

-- table for experiment tasks will be shortened. annotatorName has been renamed to systemName
CREATE TABLE IF NOT EXISTS ExperimentTasks (
id int GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1) PRIMARY KEY,
experimentType VARCHAR(20),
matching VARCHAR(50),
systemName VARCHAR(100),
datasetName VARCHAR(100),
state int,
lastChanged TIMESTAMP,
version VARCHAR(20)
);

-- index on experiment tasks won't be changed (unless somebody has an idea how to improve it ;) )
DROP INDEX IF EXISTS ExperimentTaskConfig;
CREATE INDEX ExperimentTaskConfig ON ExperimentTasks (matching,experimentType,annotatorName,datasetName);

-- ExperimentTasks_Version table will be dropped

-- ExperimentTasks_AdditionalResults will be renamed to ExperimentTasks_DoubleResults
CREATE TABLE IF NOT EXISTS ExperimentTasks_DoubleResults (
resultId int NOT NULL FOREIGN KEY REFERENCES ResultNames(id),
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
value double,
PRIMARY KEY (resultId, taskId)
);

-- New table added for int results (e.g., number of errors)
CREATE TABLE IF NOT EXISTS ExperimentTasks_IntResults (
resultId int NOT NULL FOREIGN KEY REFERENCES ResultNames(id),
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
value int,
PRIMARY KEY (resultId, taskId)
);

-- New table added for blob results (e.g., ROC)
CREATE TABLE IF NOT EXISTS ExperimentTasks_BlobResults (
resultId int NOT NULL FOREIGN KEY REFERENCES ResultNames(id),
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
value BLOB,
PRIMARY KEY (resultId, taskId)
);

-- New table added for mapping from resultId to resultName (optional but would make the solution cleaner)
CREATE TABLE IF NOT EXISTS ResultNames (
id int GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1) PRIMARY KEY,
name VARCHAR(50)
);

-- SubTask table remains the same
CREATE TABLE IF NOT EXISTS ExperimentTasks_SubTasks (
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
subTaskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
PRIMARY KEY (taskId, subTaskId)
);

With the solution above we have a clear separation between the data every experiment task has and the results (which can vary between the different types of experiments). Additionally, we could query all the results for a given experiment (e.g., 123) easily with

SELECT type, name, value
FROM (
    SELECT 'double' AS type, name, value FROM ExperimentTasks_DoubleResults WHERE taskId=123
    UNION
    SELECT 'int' AS type, name, value FROM ExperimentTasks_IntResults WHERE taskId=123
    UNION
    SELECT 'blob' AS type, name, value FROM ExperimentTasks_BlobResults WHERE taskId=123
) JOIN ResultNames ON resultId

and the program code use the type to determine how to handle the content of value.

Opinions? Ideas? @RicardoUsbeck ?
We should also make sure that we have made usage of indexes where possible 🤔

RicardoUsbeck · 2018-06-29T07:41:56Z

I very much like that solution.

Maybe here we can change annotatorName to systemName
Also, we should add more indices than the one above based on the queries in the system. For example for the query above, an index over taskID in all three tables would make sense.
Will the database system be the same?

MichaelRoeder · 2018-06-29T10:53:51Z

True. Following the HSQLDB documentation the database would create indexes for foreign keys. I updated the schema accordingly.

I think the indexes are simply missing at the moment which causes these huge delays... 🤔

RicardoUsbeck · 2018-07-18T08:31:23Z

Can this be closed?

MichaelRoeder · 2018-07-26T14:32:45Z

Documentation:
Instead of a class, there is a SQL script for updating old databases: https://github.com/dice-group/gerbil/blob/master/src/main/resources/spring/database/schema/update-experiment-database.sql

MichaelRoeder added type:bug type:enhancement labels Jun 26, 2018

MichaelRoeder assigned nikit-srivastava Jun 26, 2018

nikit-srivastava mentioned this issue Jul 13, 2018

Issue 260 #261

Merged

RicardoUsbeck mentioned this issue Jul 18, 2018

[QA/normal] More insights into results #268

Open

1 task

MichaelRoeder closed this as completed Jul 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase Database performance #260

Increase Database performance #260

MichaelRoeder commented Jun 26, 2018 •

edited

Loading

RicardoUsbeck commented Jun 26, 2018 •

edited

Loading

MichaelRoeder commented Jun 27, 2018 •

edited

Loading

MichaelRoeder commented Jun 28, 2018 •

edited

Loading

RicardoUsbeck commented Jun 29, 2018 •

edited

Loading

MichaelRoeder commented Jun 29, 2018

RicardoUsbeck commented Jul 18, 2018

MichaelRoeder commented Jul 26, 2018

Increase Database performance #260

Increase Database performance #260

Comments

MichaelRoeder commented Jun 26, 2018 • edited Loading

Description

Solution

RicardoUsbeck commented Jun 26, 2018 • edited Loading

MichaelRoeder commented Jun 27, 2018 • edited Loading

MichaelRoeder commented Jun 28, 2018 • edited Loading

RicardoUsbeck commented Jun 29, 2018 • edited Loading

MichaelRoeder commented Jun 29, 2018

RicardoUsbeck commented Jul 18, 2018

MichaelRoeder commented Jul 26, 2018

MichaelRoeder commented Jun 26, 2018 •

edited

Loading

RicardoUsbeck commented Jun 26, 2018 •

edited

Loading

MichaelRoeder commented Jun 27, 2018 •

edited

Loading

MichaelRoeder commented Jun 28, 2018 •

edited

Loading

RicardoUsbeck commented Jun 29, 2018 •

edited

Loading