Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postprocessdeduplication #129

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

abhinavhegde
Copy link

Implementation for post process deduplication added. Sparql based solution has been given for storing hash values, graphId in metadata and fetching uris based on hash values and graph ids.
Also tests have been written for the same.
Dropping of graphs based on graph id and updation of with the old graph id also achieved.

abhinavhegde and others added 18 commits November 18, 2019 15:34
1. Triple links of old URI added to new URI.
2. sendTriples() added to interface AdvancedTripleBasedSink.
3. New Test class created for DeduplicatorComponent.
4. Refactored some code.
…integrated.

2. Test Cases written to test deduplication functionality.
…adata and the uris with same hashes are fetched. Also, tests are written for the same
…a, removing the duplicate graphs based on hash value check and updation of graph id of the new uris to the graph id of old uri.
Copy link
Member

@MichaelRoeder MichaelRoeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work seems to progress into the right direction. I added some comments regarding possible improvements.

@@ -50,7 +50,7 @@ services:
# - ADMIN_PASSWORD=pw123
# - JVM_ARGS=-Xmx2g

mongodb:
db:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be careful with renaming the containers in the compose-file. Renaming this container for example will lead to a failure in the frontier. Please undo this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted.

@@ -39,6 +39,8 @@

public static final String URI_HASH_KEY = "HashValue";
public static final String UUID_KEY = "UUID";
public static final String URI_GRAPH = "graphID";
public static final String URI_DUPLICATE = "duplicate-uri";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URI_DUPLICATE shouldn't be necessary, right? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

/**
Set<CrawleableUri> getUrisWithSameHashValues(String hashValue);

/**<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a < symbol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a mistake, removed.

@@ -21,7 +21,9 @@
*/
Set<CrawleableUri> getUrisWithSameHashValues(Set<HashValue> hashValuesForComparison);

/**
Set<CrawleableUri> getUrisWithSameHashValues(String hashValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Javadoc comment describing the method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc added.

private String graphID;

private String hashValue;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these two new attributes and use the data map instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

queryString.append(uri + "> } } WHERE { GRAPH <");
queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <");
queryString.append(Squirrel.containsDataOf + "> <");
queryString.append(uri + "> } }");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ in append

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.

queryString.append("SELECT ?subject WHERE { GRAPH <");
queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <");
queryString.append(Squirrel.containsDataOf + "> <");
queryString.append(uri.getUri().toString() + ">} }");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected.


protected UpdateExecutionFactory updateExecFactory = null;
protected static UpdateExecutionFactory updateExecFactory = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should the two factories be static? This shouldn't be necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

LOGGER.error("Exception during updating", ex);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, these methods shouldn't be part of the sink.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparqlBasedSink restored to earlier version.

if(numberOfTriples == 0) {
//in case of adding a triple at a later stage, numberOfTriples will be 0. So fetching it from the CrawlingActivity
numberOfTriples = ((CrawlingActivity) uri.getData().get(Constants.URI_CRAWLING_ACTIVITY)).getNumberOfTriples();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of adding a triple at a later stage, numberOfTriples will be 0

That is not true. Please have a look at the usage of the TripleBuffer class in the AbstractBufferingTripleBasedSink. The setting the internal counter numberOfTriples to the value of the activity does not really make sense to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted.

DeduplicationSink renamed to SparqlBasedGraphHandler
getTriplesForGraph() moved to deduplication module since it has been used only there. AdvancedTripleBasedSink has been deleted since it only contained getTriplesForGraph().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants