Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5191] Fix compatibility with avro 1.10 #7175

Merged
merged 5 commits into from
Nov 12, 2022

Conversation

Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Nov 10, 2022

Change Logs

After spark3.2, avro was upgraded to 1.10.2+, some functions changed, which which can cause some test failures.
E.g:

  1. If the Key is not found, GenericData#get will throw a  AvroRuntimeException("Not a valid schema field: " + key), while it will  return null before;

  2. When Schema is initialized, the default value is checked for validity;

  3. Schema object's size changed;

Impact

Fix test cases with avro 1.10.2+
Add hudi-commen test in CI

Risk level (write none, low medium or high below)

low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@Zouxxyy Zouxxyy changed the title [5191] Fix avro compatibility with spark3.2+ [HUDI-5191] Fix avro compatibility with spark3.2+ Nov 10, 2022
@Zouxxyy Zouxxyy changed the title [HUDI-5191] Fix avro compatibility with spark3.2+ [HUDI-5191] Fix avro compatibility with 1.10 Nov 10, 2022
@Zouxxyy Zouxxyy changed the title [HUDI-5191] Fix avro compatibility with 1.10 [HUDI-5191] Fix compatibility with avro 1.10 Nov 10, 2022
@YannByron
Copy link
Contributor

nice work. Checks and CI haven't covered spark-common and spark-client yet. cc @xushiyan

@@ -253,7 +254,13 @@ public void testRemoveFields() {
assertEquals("key1", rec1.get("_row_key"));
assertEquals("val1", rec1.get("non_pii_col"));
assertEquals(3.5, rec1.get("timestamp"));
assertNull(rec1.get("pii_col"));
Object removed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe test by the different avro version, like:

if (HoodieAvroUtils.gteqAvro1_10()) {
  // assert to throw AvroRuntimeException.
  // some codes...
} else {
    assertNull(rec1.get("pii_col"));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

new Schema.Field("name", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
new Schema.Field("age", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE),
new Schema.Field("job", Schema.createUnion(Schema.create(Schema.Type.STRING), Schema.create(Schema.Type.NULL)), "", JsonProperties.NULL_VALUE)
new Schema.Field("id", Schema.createUnion(Schema.create(Schema.Type.NULL), Schema.create(Schema.Type.STRING)), "", JsonProperties.NULL_VALUE),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After avro 1.10, the defaultValue is checked when the Schema is initialized, and here are the rules of the union

Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.
(Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing “null”, the “null” is usually listed first, since the default value of such unions is typically null.)

@@ -55,7 +55,7 @@ public class TestPartialUpdateAvroPayload {
+ " {\"name\": \"id\", \"type\": [\"null\", \"string\"]},\n"
+ " {\"name\": \"partition\", \"type\": [\"null\", \"string\"]},\n"
+ " {\"name\": \"ts\", \"type\": [\"null\", \"long\"]},\n"
+ " {\"name\": \"_hoodie_is_deleted\", \"type\": [\"null\", \"boolean\"], \"default\":false},\n"
+ " {\"name\": \"_hoodie_is_deleted\", \"type\": [\"boolean\", \"null\"], \"default\":false},\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually _hoodie_is_deleted should never be null. let's make it just boolean type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually _hoodie_is_deleted should never be null. let's make it just boolean type?

done

@@ -73,7 +74,8 @@ public void testGetObjectSize() {
assertEquals(32, getObjectSize(emptyClass));
assertEquals(40, getObjectSize(stringClass));
assertEquals(40, getObjectSize(payloadClass));
assertEquals(1240, getObjectSize(Schema.create(Schema.Type.STRING)));
assertEquals(HoodieAvroUtils.gteqAvro1_10() ? 1320 : 1240,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please comment the avro pr to explain this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Object value;
try {
value = record.get(DebeziumConstants.FLATTENED_OP_COL_NAME);
} catch (AvroRuntimeException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe wrap a static method that have a param to indicate whether return null if key is not exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to cover these in CI. can you add a section in .github/workflows/bot.yml before Spark SQL Test ? we should run mvn test -Punit-tests -pl hudi-common for these changes

@@ -55,7 +55,7 @@ public class TestPartialUpdateAvroPayload {
+ " {\"name\": \"id\", \"type\": [\"null\", \"string\"]},\n"
+ " {\"name\": \"partition\", \"type\": [\"null\", \"string\"]},\n"
+ " {\"name\": \"ts\", \"type\": [\"null\", \"long\"]},\n"
+ " {\"name\": \"_hoodie_is_deleted\", \"type\": [\"null\", \"boolean\"], \"default\":false},\n"
+ " {\"name\": \"_hoodie_is_deleted\", \"type\": [\"boolean\", \"null\"], \"default\":false},\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually _hoodie_is_deleted should never be null. let's make it just boolean type?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Nov 12, 2022

would be great to cover these in CI. can you add a section in .github/workflows/bot.yml before Spark SQL Test ? we should run mvn test -Punit-tests -pl hudi-common for these changes

@xushiyan done

@apache apache deleted a comment from hudi-bot Nov 12, 2022
@xushiyan xushiyan added priority:critical production down; pipelines stalled; Need help asap. spark Issues related to spark labels Nov 12, 2022
@xushiyan xushiyan self-assigned this Nov 12, 2022
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@YannByron
Copy link
Contributor

LGTM. Let's land this.

@YannByron YannByron merged commit 4da18ab into apache:master Nov 12, 2022
satishkotha pushed a commit to satishkotha/incubator-hudi that referenced this pull request Dec 12, 2022
[HUDI-5191] Fix avro compatibility with spark3.2+
@@ -73,6 +73,14 @@ jobs:
run: |
HUDI_VERSION=$(mvn help:evaluate -Dexpression=project.version -q -DforceStdout)
./packaging/bundle-validation/ci_run.sh $HUDI_VERSION
- name: Common Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zouxxyy what are we specifically looking for to be tested in here? We need to be careful in expanding the scope here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the cmd, it seems to test the hudi-common module specifically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin Because the patch repairs the avro compatibility of the test cases in the hudi-common module under spark3, the test of the hudi-common module is added in bot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zouxxyy my point being -- hudi-common components (or tests) should not depend on particular Spark version. If something like in this case relies on particular Spark version (or other dependency that Spark brings) we should move it from hudi-common to hudi-spark

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a deeper issue with this - hudi common is tightly coupled with avro models, which variates wrt spark profiles. currently hudi-common jar won't be compatible across all engine profiles: e.g., if built with spark3.3 (avro 1.11), it won't work with spark 2 (avro 1.8) or flink (avro 1.10). this needs to be decoupled first.

alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
[HUDI-5191] Fix avro compatibility with spark3.2+
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
[HUDI-5191] Fix avro compatibility with spark3.2+
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
[HUDI-5191] Fix avro compatibility with spark3.2+
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
[HUDI-5191] Fix avro compatibility with spark3.2+
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
[HUDI-5191] Fix avro compatibility with spark3.2+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap. spark Issues related to spark
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants