Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Database performance #260

Closed
2 tasks done
MichaelRoeder opened this issue Jun 26, 2018 · 7 comments
Closed
2 tasks done

Increase Database performance #260

MichaelRoeder opened this issue Jun 26, 2018 · 7 comments

Comments

@MichaelRoeder
Copy link
Member

MichaelRoeder commented Jun 26, 2018

Description

The current structure of the database leads to a very slow performance, e.g., some experiments can not even be loaded anymore (http://gerbil.aksw.org/gerbil/experiment?id=201603140002). We need to increase the performance by restructuring the database and make it more flexible for future additions of data.

This should take #79 into account.

Solution

  • Restructure the database
  • Offer a class that transforms old databases of GERBIL into the new format
@RicardoUsbeck
Copy link
Collaborator

RicardoUsbeck commented Jun 26, 2018

  1. Is there an ETA?
  2. How many experiments are in the current DB?
  3. Can you need help programming? I could probably organize something.
  4. Can we in the meantime increase the RAM of the java process to help?

@MichaelRoeder
Copy link
Member Author

MichaelRoeder commented Jun 27, 2018

  1. I don't think that it is even restricted.

Restarting the application was helpful. At least the experiment is shown, now. However, this is not a long term solution 😉

@MichaelRoeder
Copy link
Member Author

MichaelRoeder commented Jun 28, 2018

I have an idea in mind how the new DB could look like:

-- table for experiments remains as it is
CREATE TABLE IF NOT EXISTS Experiments (
  id VARCHAR(300) NOT NULL,
  taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
  PRIMARY KEY (id, taskId)
);

-- table for experiment tasks will be shortened. annotatorName has been renamed to systemName
CREATE TABLE IF NOT EXISTS ExperimentTasks (
id int GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1) PRIMARY KEY,
experimentType VARCHAR(20),
matching VARCHAR(50),
systemName VARCHAR(100),
datasetName VARCHAR(100),
state int,
lastChanged TIMESTAMP,
version VARCHAR(20)
);

-- index on experiment tasks won't be changed (unless somebody has an idea how to improve it ;) )
DROP INDEX IF EXISTS ExperimentTaskConfig;
CREATE INDEX ExperimentTaskConfig ON ExperimentTasks (matching,experimentType,annotatorName,datasetName);

-- ExperimentTasks_Version table will be dropped

-- ExperimentTasks_AdditionalResults will be renamed to ExperimentTasks_DoubleResults
CREATE TABLE IF NOT EXISTS ExperimentTasks_DoubleResults (
resultId int NOT NULL FOREIGN KEY REFERENCES ResultNames(id),
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
value double,
PRIMARY KEY (resultId, taskId)
);

-- New table added for int results (e.g., number of errors)
CREATE TABLE IF NOT EXISTS ExperimentTasks_IntResults (
resultId int NOT NULL FOREIGN KEY REFERENCES ResultNames(id),
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
value int,
PRIMARY KEY (resultId, taskId)
);

-- New table added for blob results (e.g., ROC)
CREATE TABLE IF NOT EXISTS ExperimentTasks_BlobResults (
resultId int NOT NULL FOREIGN KEY REFERENCES ResultNames(id),
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
value BLOB,
PRIMARY KEY (resultId, taskId)
);

-- New table added for mapping from resultId to resultName (optional but would make the solution cleaner)
CREATE TABLE IF NOT EXISTS ResultNames (
id int GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1) PRIMARY KEY,
name VARCHAR(50)
);

-- SubTask table remains the same
CREATE TABLE IF NOT EXISTS ExperimentTasks_SubTasks (
taskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
subTaskId int NOT NULL FOREIGN KEY REFERENCES ExperimentTasks(id),
PRIMARY KEY (taskId, subTaskId)
);

With the solution above we have a clear separation between the data every experiment task has and the results (which can vary between the different types of experiments). Additionally, we could query all the results for a given experiment (e.g., 123) easily with

SELECT type, name, value
FROM (
    SELECT 'double' AS type, name, value FROM ExperimentTasks_DoubleResults WHERE taskId=123
    UNION
    SELECT 'int' AS type, name, value FROM ExperimentTasks_IntResults WHERE taskId=123
    UNION
    SELECT 'blob' AS type, name, value FROM ExperimentTasks_BlobResults WHERE taskId=123
) JOIN ResultNames ON resultId

and the program code use the type to determine how to handle the content of value.

Opinions? Ideas? @RicardoUsbeck ?
We should also make sure that we have made usage of indexes where possible 🤔

@RicardoUsbeck
Copy link
Collaborator

RicardoUsbeck commented Jun 29, 2018

I very much like that solution.

  1. Maybe here we can change annotatorName to systemName
  2. Also, we should add more indices than the one above based on the queries in the system. For example for the query above, an index over taskID in all three tables would make sense.
  3. Will the database system be the same?

@MichaelRoeder
Copy link
Member Author

True. Following the HSQLDB documentation the database would create indexes for foreign keys. I updated the schema accordingly.

I think the indexes are simply missing at the moment which causes these huge delays... 🤔

@RicardoUsbeck
Copy link
Collaborator

Can this be closed?

@MichaelRoeder
Copy link
Member Author

Documentation:
Instead of a class, there is a SQL script for updating old databases: https://github.com/dice-group/gerbil/blob/master/src/main/resources/spring/database/schema/update-experiment-database.sql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants