-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement][Spark] Try to make neo4j generate DataFrame with the correct data type #353
Conversation
Signed-off-by: acezen <qiaozi.zwb@alibaba-inc.com>
val person_df = spark.read | ||
.format("org.neo4j.spark.DataSource") | ||
.option("query", "MATCH (n:Person) RETURN n.name AS name, n.born as born") | ||
.option("schema.flatten.limit", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism
of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?
My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
The schema.strategy would make neo4j sample schema.flatten.limit
records to infer type. This change is just minish the schema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
The schema.strategy would make neo4j sample
schema.flatten.limit
records to infer type. This change is just minish theschema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.
I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
The schema.strategy would make neo4j sample
schema.flatten.limit
records to infer type. This change is just minish theschema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.
As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
The schema.strategy would make neo4j sample
schema.flatten.limit
records to infer type. This change is just minish theschema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.
As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).
I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
The schema.strategy would make neo4j sample
schema.flatten.limit
records to infer type. This change is just minish theschema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.
As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).
I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.
In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that if the first value is null, the returned result might be incorrect. Could we add a test case to verify this case?
we can not avoid such case since Neo4j use result to deduce the schema and we can not change the mechanism of neo4j, this is the best solution we can provide for example.(except use the APOC)
What case you suggest to add a test case?My concern is specifically with cases where the first value is null, but subsequent values are of type Long. Ideally, we should still infer the type as Long. However, I suspect that with schema.flatten.limit=1, the inferred type might default to String. I recommend we add a test to confirm the actual behavior in this scenario.
The schema.strategy would make neo4j sample
schema.flatten.limit
records to infer type. This change is just minish theschema.flatten.limit
(default is 10) to let neo4j sample the right record as far as possible( we assume that correct records are in majority). Sadly we can not guarantee that the test would always sample a Long record if we add.I understand Neo4j's current approach to type inference, and that correct records are in majority. Is it possible to use the first non-null value for type inference? This could enhance the robustness of the current schema.
As I know, Neo4j does not provide such approach to infer data type except using APOC(but apoc feature is ononly available in enterprise edition).
I think a possible way is to apply the cast method (of spark) with DataType on the column of the DataFrame. If implementing this solution is deemed too complex for our current scope, we can record this issue for future reference and potential enhancement.
In my opinion, it's Neo4j's duty to ensure the schema's correct( and they put the feature to enterprise edition, is reasonable, to make user to pay money), we should not put these part to GraphAr and it's not GraphAr's scope. GraphAr just need to make sure that the schema is consistent from DataFrame to the graphar files.
Understand you perspective. With respect to the schema.flatten.limit setting, although adjusting it might help in this particular scenario, and I agree with modifying it in this example, I think it does not signify a bug within GraphAr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I would retitle the PR to improvement
.
Proposed changes
related issue #347
Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Further comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc...