-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postprocessdeduplication #129
base: master
Are you sure you want to change the base?
Postprocessdeduplication #129
Conversation
1. Triple links of old URI added to new URI. 2. sendTriples() added to interface AdvancedTripleBasedSink. 3. New Test class created for DeduplicatorComponent. 4. Refactored some code.
…connection and URI filter.
…integrated. 2. Test Cases written to test deduplication functionality.
…adata and the uris with same hashes are fetched. Also, tests are written for the same
…ostprocessdeduplication
…a, removing the duplicate graphs based on hash value check and updation of graph id of the new uris to the graph id of old uri.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work seems to progress into the right direction. I added some comments regarding possible improvements.
docker-compose-sparql.yml
Outdated
@@ -50,7 +50,7 @@ services: | |||
# - ADMIN_PASSWORD=pw123 | |||
# - JVM_ARGS=-Xmx2g | |||
|
|||
mongodb: | |||
db: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be careful with renaming the containers in the compose-file. Renaming this container for example will lead to a failure in the frontier. Please undo this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted.
@@ -39,6 +39,8 @@ | |||
|
|||
public static final String URI_HASH_KEY = "HashValue"; | |||
public static final String UUID_KEY = "UUID"; | |||
public static final String URI_GRAPH = "graphID"; | |||
public static final String URI_DUPLICATE = "duplicate-uri"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URI_DUPLICATE
shouldn't be necessary, right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted.
/** | ||
Set<CrawleableUri> getUrisWithSameHashValues(String hashValue); | ||
|
||
/**< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a <
symbol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was a mistake, removed.
@@ -21,7 +21,9 @@ | |||
*/ | |||
Set<CrawleableUri> getUrisWithSameHashValues(Set<HashValue> hashValuesForComparison); | |||
|
|||
/** | |||
Set<CrawleableUri> getUrisWithSameHashValues(String hashValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Javadoc comment describing the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc added.
private String graphID; | ||
|
||
private String hashValue; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these two new attributes and use the data map instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
queryString.append(uri + "> } } WHERE { GRAPH <"); | ||
queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <"); | ||
queryString.append(Squirrel.containsDataOf + "> <"); | ||
queryString.append(uri + "> } }"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+
in append
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected.
queryString.append("SELECT ?subject WHERE { GRAPH <"); | ||
queryString.append(Constants.DEFAULT_META_DATA_GRAPH_URI + "> { ?subject <"); | ||
queryString.append(Squirrel.containsDataOf + "> <"); | ||
queryString.append(uri.getUri().toString() + ">} }"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected.
|
||
protected UpdateExecutionFactory updateExecFactory = null; | ||
protected static UpdateExecutionFactory updateExecFactory = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the two factories be static? This shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
LOGGER.error("Exception during updating", ex); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated above, these methods shouldn't be part of the sink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparqlBasedSink restored to earlier version.
if(numberOfTriples == 0) { | ||
//in case of adding a triple at a later stage, numberOfTriples will be 0. So fetching it from the CrawlingActivity | ||
numberOfTriples = ((CrawlingActivity) uri.getData().get(Constants.URI_CRAWLING_ACTIVITY)).getNumberOfTriples(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of adding a triple at a later stage, numberOfTriples will be 0
That is not true. Please have a look at the usage of the TripleBuffer class in the AbstractBufferingTripleBasedSink. The setting the internal counter numberOfTriples
to the value of the activity does not really make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted.
DeduplicationSink renamed to SparqlBasedGraphHandler getTriplesForGraph() moved to deduplication module since it has been used only there. AdvancedTripleBasedSink has been deleted since it only contained getTriplesForGraph().
Implementation for post process deduplication added. Sparql based solution has been given for storing hash values, graphId in metadata and fetching uris based on hash values and graph ids.
Also tests have been written for the same.
Dropping of graphs based on graph id and updation of with the old graph id also achieved.