-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task/refactor proto handler #1830
base: integration
Are you sure you want to change the base?
Conversation
…SecurityAgency/datawave into bugfix/edgePrecondition
Conflicts: warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
...t-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
Outdated
Show resolved
Hide resolved
...t-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
Outdated
Show resolved
Hide resolved
...t-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
Outdated
Show resolved
Hide resolved
...t-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
Outdated
Show resolved
Hide resolved
…alSecurityAgency/datawave into task/refactorProtoHandler
* Fix nested field group names, edge efficiencies * More efficient pattern check plus tests * formatter * modifications for trim and extract methods * PR feedback * Address #1905 more surgical approach to group trimming for edge generation * Address #1905 MR comments --------- Co-authored-by: hlgp <hlgp@users.noreply.github.com> Co-authored-by: hgklohr <hgklohr@gmail.com> Conflicts: warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDeleteModeTest.java
Conflicts: warehouse/ingest-core/src/main/java/datawave/ingest/data/TypeRegistry.java warehouse/ingest-core/src/main/java/datawave/ingest/data/normalizer/SimpleGroupFieldNameParser.java warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/define/EdgeDataBundle.java warehouse/ingest-core/src/test/java/datawave/ingest/data/normalizer/SimpleGroupFieldNameParserTest.java warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDeleteModeTest.java warehouse/ingest-csv/src/main/java/datawave/ingest/csv/config/helper/ExtendedCSVIngestHelper.java warehouse/ingest-json/src/main/java/datawave/ingest/json/config/helper/JsonIngestHelper.java
Conflicts: warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
Conflicts: warehouse/ingest-configuration/src/main/resources/config/edge-definitions.xml warehouse/ingest-configuration/src/main/resources/config/edge-ingest-config.xml warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/EdgeHandlerTestUtil.java warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgePreconditionTest.java warehouse/ingest-core/src/test/resources/config/edge-ingest-config.xml warehouse/ingest-csv/src/test/resources/config/ingest/edge-ingest-config.xml warehouse/ingest-json/src/test/resources/config/ingest/edge-ingest-config.xml
@@ -1,7 +1,10 @@ | |||
package datawave.ingest.mapreduce.handler.edge; | |||
|
|||
import java.io.IOException; | |||
import java.util.ArrayList; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new imports appear to be unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this and a few other suggested removals in https://github.com/NationalSecurityAgency/datawave/pull/2269/files
@@ -97,6 +99,17 @@ public EdgeDataBundle(EdgeDefinition edgeDef, NormalizedContentInterface ifaceSo | |||
this.initMarkings(getSource().getMarkings(), getSink().getMarkings()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor public EdgeDataBundle(EdgeDefinition edgeDef...
can be deleted now that it is unused. Its code now seems to exist in the data type handler's createEdge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this and a few other suggested removals in https://github.com/NationalSecurityAgency/datawave/pull/2269/files
@@ -983,28 +623,6 @@ private NormalizedContentInterface getNullKeyedNCI(String fieldValue, Multimap<S | |||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNullKeyedNCI
in PEDTHandler.java is now unused and can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this and a few other suggested removals in https://github.com/NationalSecurityAgency/datawave/pull/2269/files
public void setEdgeDefinition(EdgeDefinition edgeDef) { | ||
this.edgeDefinition = edgeDef; | ||
this.edgeDirection = edgeDef.getDirection(); | ||
this.edgeType = edgeDef.getEdgeType().toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.edgeType = edgeDef.getEdgeType().toString(); | |
this.edgeType = edgeDef.getEdgeType(); |
@@ -243,141 +230,12 @@ public void setup(TaskAttemptContext context) { | |||
public long process(KEYIN key, RawRecordContainer event, Multimap<String,NormalizedContentInterface> fields, | |||
TaskInputOutputContext<KEYIN,? extends RawRecordContainer,KEYOUT,VALUEOUT> context, ContextWriter<KEYOUT,VALUEOUT> contextWriter) | |||
throws IOException, InterruptedException { | |||
long edgesCreated = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.deleteMode
(above) is never used. I'm wondering if MyDerivedProtobufEdgeDataTypeHandler could be removed.
|
||
trimFieldGroup = context.getConfiguration().getBoolean(typeName + TRIM_FIELD_GROUP, false); | ||
// Track metadata for this event | ||
eventMetadataRegistry = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest initializing this at its declaration and clearing the Map each time process is called, to avoid having to create a new HashMap in each call to process.
loadDateStr = getLoadDateString(fields); | ||
|
||
boolean trimGroups = context.getConfiguration().getBoolean(typeName + EdgeIngestConfiguration.TRIM_FIELD_GROUP, false); | ||
EdgeEventFieldUtil edgeEventFieldUtil = new EdgeEventFieldUtil(trimGroups); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to avoid object creation and collection, I would consider creating a reusable EdgeEventFieldUtil and clearing it here instead of creating a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this suggestion is overkill. Either way, with the refactor it is probably worth observing performance impact. It would be cool if we had a JMH test handy for the datatype handler that would give us microbenchmarking information for process, given that it's called so much.
postProcessEdges(event, context, contextWriter, edgesCreated, baseEdgeBundle); | ||
|
||
return edgesCreated; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much cleaner, thank you!
} | ||
|
||
public EdgeDataBundle setEdgeInfoFromEventFields(EdgeDataBundle bundle, EdgeDefinitionConfigurationHelper edgeDefConfigs, RawRecordContainer event, | ||
EdgeIngestConfiguration edgeConfig, long newFormatStartDate, Configuration conf, String typeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typeName
is unused. Perhaps it should be?
bundle.setDateType(dateType); | ||
} | ||
|
||
public EdgeDataBundle setEdgeInfoFromEventFields(EdgeDataBundle bundle, EdgeDefinitionConfigurationHelper edgeDefConfigs, RawRecordContainer event, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method returns the bundle it received by reference. The return value itself is never used. Perhaps the method should just create a new bundle from the provided event and drop the first parameter.
bundle.setEdgeAttribute2(getEdgeAttr2(edgeDefConfigs)); | ||
bundle.setEdgeAttribute3(getEdgeAttr3(edgeDefConfigs)); | ||
bundle.setRequiresMasking(event.isRequiresMasking()); | ||
bundle.setHelper(event.getDataType().getIngestHelper(conf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking for opportunities to drop more method parameters - it looks like the conf
parameter is only used here. I suggest updating EdgeIngestConfiguration to save its constructor argument (which is the same as this conf, retrieved from context.getConfiguration()), so that it's only necessary to use EdgeIngestConfiguration.
import datawave.ingest.time.Now; | ||
import datawave.util.time.DateHelper; | ||
|
||
public class EdgeEventFieldUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this class. I wonder if the name could be improved, mostly because Util raises some flags in my mind - but naming is hard so honestly I'm not sure what's better. Its primary behavior is to extract edge information from fields. So maybe FieldToEdgeDataGenerator or EdgeFieldExtractor. Feel free to ignore.
Conflicts: warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
Some efficiency gains by not performing JexlEvaluation for all edge definitions on an event that doesn't even contain the necessary fields, general cleanup, etc.