[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

acezen · 2024-02-01T03:34:28Z

Proposed changes

related issue #347

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the CLA
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...

Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

lixueclaire · 2024-02-01T07:52:24Z

spark/src/main/scala/com/alibaba/graphar/example/Neo4j2GraphAr.scala

    val person_df = spark.read
      .format("org.neo4j.spark.DataSource")
      .option("query", "MATCH (n:Person) RETURN n.name AS name, n.born as born")
+      .option("schema.flatten.limit", 1)


I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism
of neo4j, this is the best solution we can provide for example.(except use the APOC)

What case you suggest to add a test case?

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)

What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority）. Sadly we can not guarantee that the test would always sample a Long record if we add.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority）. Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority）. Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority）. Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority）. Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.

I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?

we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?

My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.

The schema.strategy would make neo4j sample schema.flatten.limit records to infer type. This change is just minish the schema.flatten.limit(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority）. Sadly we can not guarantee that the test would always sample a Long record if we add.

I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.

As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).

I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.

In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.

Understand you perspective. With respect to the schema.flatten.limit setting, although adjusting it might help in this particular scenario, and I agree with modifying it in this example, I think it does not signify a bug within GraphAr.

agree, I would retitle the PR to improvement.

Fix neo4j generate DataFrame got wrong data type bug

12a0ca9

Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>

acezen requested a review from lixueclaire February 1, 2024 03:34

lixueclaire reviewed Feb 1, 2024

View reviewed changes

acezen changed the title ~~[BugFix][Spark] Fix neo4j generate DataFrame got wrong data type bug~~ [BugFix][Spark] Try to fix neo4j generate DataFrame got wrong data type bug Feb 1, 2024

lixueclaire approved these changes Feb 1, 2024

View reviewed changes

Update comment

e4e5665

acezen changed the title ~~[BugFix][Spark] Try to fix neo4j generate DataFrame got wrong data type bug~~ [Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type Feb 1, 2024

acezen merged commit 4889596 into apache:main Feb 1, 2024
4 checks passed

acezen deleted the 347-neo4j2graph-fix branch February 1, 2024 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

acezen commented Feb 1, 2024

lixueclaire Feb 1, 2024

acezen Feb 1, 2024 •

edited

Loading

lixueclaire Feb 1, 2024

acezen Feb 1, 2024 •

edited

Loading

lixueclaire Feb 1, 2024

acezen Feb 1, 2024

lixueclaire Feb 1, 2024

acezen Feb 1, 2024 •

edited

Loading

lixueclaire Feb 1, 2024

acezen Feb 1, 2024

[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353

Conversation

acezen commented Feb 1, 2024

Proposed changes

Checklist

Further comments

lixueclaire Feb 1, 2024

Choose a reason for hiding this comment

acezen Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

lixueclaire Feb 1, 2024

Choose a reason for hiding this comment

acezen Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

lixueclaire Feb 1, 2024

Choose a reason for hiding this comment

acezen Feb 1, 2024

Choose a reason for hiding this comment

lixueclaire Feb 1, 2024

Choose a reason for hiding this comment

acezen Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

lixueclaire Feb 1, 2024

Choose a reason for hiding this comment

acezen Feb 1, 2024

Choose a reason for hiding this comment

acezen Feb 1, 2024 •

edited

Loading

acezen Feb 1, 2024 •

edited

Loading

acezen Feb 1, 2024 •

edited

Loading