Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task/refactor proto handler #1830

Open
wants to merge 34 commits into
base: integration
Choose a base branch
from
Open

Conversation

hlgp
Copy link
Collaborator

@hlgp hlgp commented Jan 31, 2023

Some efficiency gains by not performing JexlEvaluation for all edge definitions on an event that doesn't even contain the necessary fields, general cleanup, etc.

matthpeterson and others added 13 commits February 28, 2023 12:30
* Fix nested field group names, edge efficiencies

* More efficient pattern check plus tests

* formatter

* modifications for trim and extract methods

* PR feedback

* Address #1905 more surgical approach to group trimming for edge generation

* Address #1905 MR comments

---------

Co-authored-by: hlgp <hlgp@users.noreply.github.com>
Co-authored-by: hgklohr <hgklohr@gmail.com>
Conflicts:
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDeleteModeTest.java
Conflicts:
	warehouse/ingest-core/src/main/java/datawave/ingest/data/TypeRegistry.java
	warehouse/ingest-core/src/main/java/datawave/ingest/data/normalizer/SimpleGroupFieldNameParser.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/define/EdgeDataBundle.java
	warehouse/ingest-core/src/test/java/datawave/ingest/data/normalizer/SimpleGroupFieldNameParserTest.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDeleteModeTest.java
	warehouse/ingest-csv/src/main/java/datawave/ingest/csv/config/helper/ExtendedCSVIngestHelper.java
	warehouse/ingest-json/src/main/java/datawave/ingest/json/config/helper/JsonIngestHelper.java
Conflicts:
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
hlgp and others added 2 commits January 12, 2024 17:32
Conflicts:
	warehouse/ingest-configuration/src/main/resources/config/edge-definitions.xml
	warehouse/ingest-configuration/src/main/resources/config/edge-ingest-config.xml
	warehouse/ingest-core/src/main/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgeDataTypeHandler.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/EdgeHandlerTestUtil.java
	warehouse/ingest-core/src/test/java/datawave/ingest/mapreduce/handler/edge/ProtobufEdgePreconditionTest.java
	warehouse/ingest-core/src/test/resources/config/edge-ingest-config.xml
	warehouse/ingest-csv/src/test/resources/config/ingest/edge-ingest-config.xml
	warehouse/ingest-json/src/test/resources/config/ingest/edge-ingest-config.xml
@hlgp hlgp changed the title WIP: Task/refactor proto handler Task/refactor proto handler Jan 24, 2024
@@ -1,7 +1,10 @@
package datawave.ingest.mapreduce.handler.edge;

import java.io.IOException;
import java.util.ArrayList;
Copy link
Collaborator

@matthpeterson matthpeterson Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new imports appear to be unused.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this and a few other suggested removals in https://github.com/NationalSecurityAgency/datawave/pull/2269/files

@@ -97,6 +99,17 @@ public EdgeDataBundle(EdgeDefinition edgeDef, NormalizedContentInterface ifaceSo
this.initMarkings(getSource().getMarkings(), getSink().getMarkings());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor public EdgeDataBundle(EdgeDefinition edgeDef... can be deleted now that it is unused. Its code now seems to exist in the data type handler's createEdge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this and a few other suggested removals in https://github.com/NationalSecurityAgency/datawave/pull/2269/files

@@ -983,28 +623,6 @@ private NormalizedContentInterface getNullKeyedNCI(String fieldValue, Multimap<S
return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNullKeyedNCI in PEDTHandler.java is now unused and can be deleted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this and a few other suggested removals in https://github.com/NationalSecurityAgency/datawave/pull/2269/files

public void setEdgeDefinition(EdgeDefinition edgeDef) {
this.edgeDefinition = edgeDef;
this.edgeDirection = edgeDef.getDirection();
this.edgeType = edgeDef.getEdgeType().toString();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.edgeType = edgeDef.getEdgeType().toString();
this.edgeType = edgeDef.getEdgeType();

@@ -243,141 +230,12 @@ public void setup(TaskAttemptContext context) {
public long process(KEYIN key, RawRecordContainer event, Multimap<String,NormalizedContentInterface> fields,
TaskInputOutputContext<KEYIN,? extends RawRecordContainer,KEYOUT,VALUEOUT> context, ContextWriter<KEYOUT,VALUEOUT> contextWriter)
throws IOException, InterruptedException {
long edgesCreated = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.deleteMode (above) is never used. I'm wondering if MyDerivedProtobufEdgeDataTypeHandler could be removed.


trimFieldGroup = context.getConfiguration().getBoolean(typeName + TRIM_FIELD_GROUP, false);
// Track metadata for this event
eventMetadataRegistry = new HashMap<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest initializing this at its declaration and clearing the Map each time process is called, to avoid having to create a new HashMap in each call to process.

loadDateStr = getLoadDateString(fields);

boolean trimGroups = context.getConfiguration().getBoolean(typeName + EdgeIngestConfiguration.TRIM_FIELD_GROUP, false);
EdgeEventFieldUtil edgeEventFieldUtil = new EdgeEventFieldUtil(trimGroups);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to avoid object creation and collection, I would consider creating a reusable EdgeEventFieldUtil and clearing it here instead of creating a new one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this suggestion is overkill. Either way, with the refactor it is probably worth observing performance impact. It would be cool if we had a JMH test handy for the datatype handler that would give us microbenchmarking information for process, given that it's called so much.

postProcessEdges(event, context, contextWriter, edgesCreated, baseEdgeBundle);

return edgesCreated;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner, thank you!

}

public EdgeDataBundle setEdgeInfoFromEventFields(EdgeDataBundle bundle, EdgeDefinitionConfigurationHelper edgeDefConfigs, RawRecordContainer event,
EdgeIngestConfiguration edgeConfig, long newFormatStartDate, Configuration conf, String typeName) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typeName is unused. Perhaps it should be?

bundle.setDateType(dateType);
}

public EdgeDataBundle setEdgeInfoFromEventFields(EdgeDataBundle bundle, EdgeDefinitionConfigurationHelper edgeDefConfigs, RawRecordContainer event,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns the bundle it received by reference. The return value itself is never used. Perhaps the method should just create a new bundle from the provided event and drop the first parameter.

bundle.setEdgeAttribute2(getEdgeAttr2(edgeDefConfigs));
bundle.setEdgeAttribute3(getEdgeAttr3(edgeDefConfigs));
bundle.setRequiresMasking(event.isRequiresMasking());
bundle.setHelper(event.getDataType().getIngestHelper(conf));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking for opportunities to drop more method parameters - it looks like the conf parameter is only used here. I suggest updating EdgeIngestConfiguration to save its constructor argument (which is the same as this conf, retrieved from context.getConfiguration()), so that it's only necessary to use EdgeIngestConfiguration.

import datawave.ingest.time.Now;
import datawave.util.time.DateHelper;

public class EdgeEventFieldUtil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this class. I wonder if the name could be improved, mostly because Util raises some flags in my mind - but naming is hard so honestly I'm not sure what's better. Its primary behavior is to extract edge information from fields. So maybe FieldToEdgeDataGenerator or EdgeFieldExtractor. Feel free to ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants