Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tweaks for gremlin support #60

Merged
merged 3 commits into from
Sep 17, 2020

Conversation

friendtocephalopods
Copy link
Contributor

@friendtocephalopods friendtocephalopods commented Sep 14, 2020

Summary of Changes

add optional key parameter to column, table. rename Statistics class to Stat.
add fixtures, streaming, utils to common.

Update mypy to 761 to fix false errors.
Added ignore_missing_imports to ignore marshmallow dependency typing complaints.
Fix flake8 complaints about extra line in init.py files.
Added type ignore for thorny mypy complaint in User.py

As with the rest of the gremlin code I'm upstreaming, this was a team effort. Kudos to @dsimms, @alran, @worldwise001.

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.

add optional key parameter to column, table. rename Statistics -> Stat.
add fixtures, streaming, utils to common.
Update mypy to 761 to fix false errors.
Fix flake8 complaints about extra line in __init__.py files

Signed-off-by: Joshua Hoskins <hoskins@squareup.com>


class Fixtures:
counter = 1000
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to add some documentation somewhere on how to use this Fixtures class (a summary of some kind so that people don't have to read all the code to figure it out)

Copy link
Contributor Author

@friendtocephalopods friendtocephalopods Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • document fixtures

@alran
Copy link

alran commented Sep 15, 2020

This looks really good to me @friendtocephalopods! Had one comment about potentially adding more documentation for the casual browser. Also I know that I have tried to upstream that Statistics -> Stat change before. I'd love to see it go through :)

@cpu
Copy link

cpu commented Sep 16, 2020

This looked good to me as well. Nice work!

One top-level question: why the rename of the Statistics class? It might be nice to call out the rationale for that change since it's less obvious than the other diffs in this branch (imo).

@friendtocephalopods
Copy link
Contributor Author

This looked good to me as well. Nice work!

One top-level question: why the rename of the Statistics class? It might be nice to call out the rationale for that change since it's less obvious than the other diffs in this branch (imo).

Oh, good question! Each Statistics class just contains just one statistic. Additionally, the parameters are all named stat_x and the field is referenced as a list as stats -- so Stat seems a bit more appropriate.

Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for putting up the pr. haven't time to go through, put some initial comments, let me know what you think.

from amundsen_common.models.user import User


class Fixtures:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fixtures only for testing purpose? if so, we should move it under the tests folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, it's for testing! But it's intended to be useful in multiple repos to create test objects easily. I don't think the tests/ folder gets carried over cleanly when importing in other repos.

Maybe amundsen_common/tests/fixtures.py would be preferable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, amundsen_common/tests/fixtures.py would be better.

Copy link
Contributor Author

@friendtocephalopods friendtocephalopods Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • amundsen_common/tests/fixtures.py

@@ -61,26 +61,27 @@ class Meta:


@attr.s(auto_attribs=True, kw_only=True)
class Statistics:
class Stat:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, this will break all the other services (https://github.com/amundsen-io/amundsenmetadatalibrary/search?q=Statistics&unscoped_q=Statistics) . I know this is not perfect. But let's keep it as it is and do it differently (e.g add another class named Stat and mark Statistics deprecated etc)

Copy link
Contributor Author

@friendtocephalopods friendtocephalopods Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the suggestion is to change it all over to a new Stat class here and only remove Statistics after references are gone in the other repos? Sure, that works!

  • new class, not rename

stat_type: str
stat_val: Optional[str] = None
start_epoch: Optional[int] = None
end_epoch: Optional[int] = None


class StatisticsSchema(AttrsSchema):
class StatSchema(AttrsSchema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for the rename

@@ -151,6 +153,7 @@ class Table:
cluster: str
schema: str
name: str
key: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share more what the key is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! This is the table_uri. We store it on the object in the graph for indexing anyhow, so this saves us from having to recompute it everywhere each time we fetch it.

If we fully roll this out and make it non-optional, it also lets us not recompute the key downstream in the frontend as well: https://github.com/amundsen-io/amundsenfrontendlibrary/blob/ee0474fdcce6c8c1b01c9032b5cd2750b1d0d92c/amundsen_application/api/utils/metadata_utils.py#L110

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

return 1


class PeekingIterator(Iterator[V]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I don't have much context for all the code. If it is for gremlin lib, could we move it to https://github.com/amundsen-io/amundsengremlin ? The common lib is only used to store common model for all the repos. Let me know what you thinks.

Copy link
Contributor Author

@friendtocephalopods friendtocephalopods Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...I was wondering myself if this is useful in anyone else's databuilder. We use this in the gremlin code, but also for some of our databuilder jobs to chunk the results (e.g. groups of 1,000 or 10,000. Unlimited size batches can be problematic in some circumstances). Do y'all ever use airflow to explicitly set batch sizes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define the batch / transaction size through config during neo4j ingestion (https://github.com/amundsen-io/amundsendatabuilder/blob/master/databuilder/publisher/neo4j_csv_publisher.py#L95). Otherwise the neo4j will hang without batch size setting. So given we haven't had a clear use case, I would suggest moving the code to gremlin repo? we could always use gremlin as an extra library deps in databuilder if we plan to enhance and build neptune publisher there. cc @AndrewCiambrone

Copy link
Contributor Author

@friendtocephalopods friendtocephalopods Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good to me! Thanks for the colour, Tao!

  • streams to gremlin repo

Remove streams
Move fixtures to amundsen_common/tests
Stat as additional class, deprecate Statistics

Signed-off-by: Joshua Hoskins <hoskins@squareup.com>
Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm mostly, a few more nits, let me know what you think

setup.cfg Outdated
@@ -35,6 +35,7 @@ check_untyped_defs = true
disallow_any_generics = true
disallow_incomplete_defs = true
disallow_untyped_defs = true
ignore_missing_imports = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share more on why we set this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! When I run mypy on the entire repo, I get it complaining at me about three libraries missing stubs.

e.g.

setup.py:4: error: No library stub file for module 'setuptools'
amundsen_common/models/user.py:7: error: Cannot find implementation or library stub for module named 'marshmallow'
amundsen_common/models/user.py:8: error: Cannot find implementation or library stub for module named 'marshmallow_annotations.ext.attrs'

I can set it specifically for those libraries in the cfg though, which is a less stringent option.

# Copyright Contributors to the Amundsen project.
# SPDX-License-Identifier: Apache-2.0

from typing import Optional, TypeVar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move the utils folder to gremlin as well ?

Copy link
Contributor Author

@friendtocephalopods friendtocephalopods Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

  • move the utils folder to gremlin

@@ -0,0 +1,2 @@
# Copyright Contributors to the Amundsen project.
# SPDX-License-Identifier: Apache-2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -151,6 +153,7 @@ class Table:
cluster: str
schema: str
name: str
key: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

remove utils, make mypy ignore missing imports module-specific

Signed-off-by: Joshua Hoskins <hoskins@squareup.com>
@feng-tao
Copy link
Member

thanks

@feng-tao feng-tao merged commit 1a2733b into amundsen-io:master Sep 17, 2020
@alran
Copy link

alran commented Sep 17, 2020

🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants