Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use parameters to allow special characters in neo4j cypher statement #382

Merged
merged 1 commit into from
Oct 23, 2020

Conversation

rogertangcn
Copy link
Contributor

@rogertangcn rogertangcn commented Oct 14, 2020

Summary of Changes

This change fixes the issue of Use query parameters rather than string building in Neo4jCsvPublisher

  • Replace csv package with pandas so that it can infer data type out of csv files
  • Replace string.Template with jinja2.Template so that it can support conditional tokens.
  • Use parameters instead of string concatenation to construct cypher statement

Tests

N/A

Documentation

N/A

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

@feng-tao feng-tao added the keep fresh Disables stalebot from closing an issue label Oct 14, 2020
@@ -232,7 +212,7 @@ def _create_indices(self, node_file: str) -> None:
LOGGER.info('Creating indices. (Existing indices will be ignored)')

with open(node_file, 'r', encoding='utf8') as node_csv:
for node_record in csv.DictReader(node_csv):
for node_record in pandas.read_csv(node_csv).to_dict(orient='records'):
Copy link
Contributor Author

@rogertangcn rogertangcn Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record field types are string when using csv.DictReader which is inadequate when switching to use parameter. It will create issue that property in numeric type in neo4j is suppled with string value.

pandas has ability to infer data type when reading csv. Parameters generated from pandas record for neo4j will have correct type that can prevent the issue.

template = Template("""
MERGE (node:{{ LABEL }} {key: $KEY})
ON CREATE SET {{ PROP_BODY }}
{% if update %} ON MATCH SET {{ PROP_BODY }} {% endif %}
Copy link
Contributor Author

@rogertangcn rogertangcn Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace string.Template with jinja.Template in which if-condition is supported. Putting condition clause in template is to help readability

Also, {{ }} syntax in jinja is different from $VAR syntax in cypher (which is also the syntax for string.Template, source of confusion). The distinction is helpful for readability too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info

@rogertangcn rogertangcn changed the title fix - use parameters to allow special characters in neo4j cypher statement fix: use parameters to allow special characters in neo4j cypher statement Oct 14, 2020

template_params[k] = v
props.append('{id}.{key} = {val}'.format(id=identifier, key=k, val=f'${k}'))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property value is taken out wherein a param token (i.e. $param) is put in.

@@ -3,3 +3,4 @@ hive,gold,test_schema,test_table1,"1st test table","tag1,tag2,pii,high_quality",
dynamo,gold,test_schema,test_table2,"2nd test table","high_quality,recommended",false,
hive,gold,test_schema,test_view1,"1st test view","tag1",true,
hive,gold,test_schema,test_table3,"3rd test","needs_documentation",false,
hive,gold,test_schema,"test's_table4","4th test","needs_documentation",false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test data from here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogertangcn rogertangcn force-pushed the special-char-as-params branch 4 times, most recently from 3cb86fe to d7bb9df Compare October 14, 2020 23:34
@feng-tao
Copy link
Member

@allisonsuarez will do some test in Lyft staging.

Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @allisonsuarez for quick test, lgtm as well with a few small nits

# Setting field_size_limit to solve the error below
# _csv.Error: field larger than field limit (131072)
# https://stackoverflow.com/a/54517228/5972935
csv.field_size_limit(int(ctypes.c_ulong(-1).value // 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, we no longer set this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. csv module is no longer used in this module therefore taking it out.

I'm wondering if csv.field_size_limit() is a global setting. If that's the case, it could impact other modules that uses csv though. I will put it back to make sure it's backward compatible. It could be better if it's moved into somewhere that is "global" later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, let's keep it here for now and we could remove it later. cc @jinhyukchang

MERGE (n1)-[r1:$TYPE]->(n2)-[r2:$REVERSE_TYPE]->(n1)
$PROP_STMT RETURN n1.key, n2.key""")

CREATE_UNIQUE_INDEX_TEMPLATE = Template('CREATE CONSTRAINT ON (node:${LABEL}) ASSERT node.key IS UNIQUE')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @rogertangcn , any reason to delete all the templates constant here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nvm, read your latter comment

template = Template("""
MERGE (node:{{ LABEL }} {key: $KEY})
ON CREATE SET {{ PROP_BODY }}
{% if update %} ON MATCH SET {{ PROP_BODY }} {% endif %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the info

requirements.txt Outdated
@@ -57,6 +57,8 @@ unicodecsv==0.14.1,<1.0

httplib2>=0.18.0
unidecode
Jinja2==2.11.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the databuilder is a lib, we should make it a range

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about larger than 2.10.0 and less than 2.12 for now

requirements.txt Outdated
@@ -57,6 +57,8 @@ unicodecsv==0.14.1,<1.0

httplib2>=0.18.0
unidecode
Jinja2==2.11.2
pandas==1.1.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about larger than 0.21.0 and less than 1.2.0

@@ -3,3 +3,4 @@ hive,gold,test_schema,test_table1,"1st test table","tag1,tag2,pii,high_quality",
dynamo,gold,test_schema,test_table2,"2nd test table","high_quality,recommended",false,
hive,gold,test_schema,test_view1,"1st test view","tag1",true,
hive,gold,test_schema,test_table3,"3rd test","needs_documentation",false,
hive,gold,test_schema,"test's_table4","4th test","needs_documentation",false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Roger Tang <roger.tang@workday.com>
@rogertangcn rogertangcn force-pushed the special-char-as-params branch from d7bb9df to 3653b25 Compare October 23, 2020 05:32
@feng-tao
Copy link
Member

thanks @rogertangcn for the update! @instazackwu the pr is going to merge, let us know if it resolves your issue!

@feng-tao feng-tao merged commit 6fd5035 into amundsen-io:master Oct 23, 2020
@rogertangcn rogertangcn deleted the special-char-as-params branch October 23, 2020 16:59
dikshathakur3119 pushed a commit that referenced this pull request Oct 26, 2020
…ement (#382)

Signed-off-by: Roger Tang <roger.tang@workday.com>

Co-authored-by: Roger Tang <roger.tang@workday.com>
Signed-off-by: dikshathakur3119 <dikshathakur@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep fresh Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants