Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New helper module with some reusable functions #724

Closed
wants to merge 2 commits into from
Closed

Conversation

shaunagm
Copy link
Collaborator

@shaunagm shaunagm commented Jul 28, 2022

I would like for Parsons to include more reusable code which helps users with transformations and analysis. I have placed some examples in a new etl_helpers folder.

I am looking for feedback on:

  1. the overall goal of providing more helper code for users
  2. the architecture here. is etl_helpers a decent name? I wanted to distinguish from the existing utilities which is primarily for code that helps Parsons contributors enhance Parsons connectors.
  3. the specific helper functions I added

Once the overall approach is approved, and I have feedback on the specific helper functions, I will write some tests for them as well as incorporate them into the documentation.

Note: I copied two datetime functions from utilities/datetime.py which I didn't even know were there before. Eventually with enough notice we should remove the utilities/datetime.py file.

Checklist:

  • initial PR for review
  • confirm architecture
  • confirm specific functions
  • add tests
  • add docs

@shaunagm shaunagm changed the title First stab at helper module with some reusable functions New helper module with some reusable functions Jul 28, 2022
Copy link
Contributor

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions + feedback.

# Datetime conversion functions


def date_to_timestamp(value, tzinfo=datetime.timezone.utc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add type annotations. For example:

import typing as t
def date_to_timestamp(
    value: t.Union[int, str, datetime.datetime],
    tzinfo: datetime.tzinfo = datetime.timezone.utc) -> t.Optional[int]:


parsed_date = parse_date(value)

if not parsed_date:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the function is misdocumented, or this is incorrect. What if you used -1 as a sentinel values instead of None?

return int(parsed_date.timestamp())


def parse_date(value, tzinfo=datetime.timezone.utc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type annotations.

return parsed


def timestamp_to_readable(value, tzinfo=datetime.timezone.utc, format_as='%Y-%m-%d %H:%M:%S'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the ISO format as the default format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. the built-in one?

Copy link
Collaborator Author

@shaunagm shaunagm Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISO is a little less readable, though certainly more readable than unix timestamps! But it might be worth the tradeoff to use a more standardized format.

# Flatten contacts

def get_primary_contact_from_nested(contact_list, get_first=True, selector=None):
"""Extracts single contact value from list of dictionaries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a shifting in dostring styles between this style (which I prefer) and the description on a new line, e.g.:

"""
Extract single contact ...

Which is the correct style?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a preference. I was going to say "let's go with what Python prefers" but it looks like Python itself doesn't have a preference. I know it's a little inconsistent but I don't think it's a big deal to switch back and forth within parsons, and that's one less thing that PR contributors have to worry about.

Comment on lines +101 to +102
This helper method helps "flatten" the dictionary by returning the primary number (or,
if no primary number is found and get_first is True, the first number found).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: what if we made the keys of the dictionary a tuple with all the numbers? e.g.

result = {
   ('5554444444'): {'foo': 'bar'},
   ('4441234567', '7771234567'): {'bin': 'ban'},
   # ...
}

I bring this up, because there would be no exceptional cases here. If users wanted to get the first number, they would simply access the tuple and get the first element.

if not selector: # if selector still not found, look in dict
dict_keys = list(contact_list[0].keys())
dict_keys.pop("primary")
selector = dict_keys[0] # NOTE: this will break on dicts that have additional keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs fixing / addressing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. it could be addressed with a try-catch block, or by raising an error msg.


# Flatten contacts

def get_primary_contact_from_nested(contact_list, get_first=True, selector=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this function used? What problem does it solve?

Comment on lines +176 to +177
parsed_value = re.compile(r'\d+(?:\.\d+)?').findall(value) # extracts digits only
value = "".join(parsed_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit more readable / easier to maintain (regexes, in my view, should be avoided at all costs):

Suggested change
parsed_value = re.compile(r'\d+(?:\.\d+)?').findall(value) # extracts digits only
value = "".join(parsed_value)
value = "".join([v for v in value if v.isdigit()])

@alxmrs
Copy link
Contributor

alxmrs commented Sep 17, 2022

the overall goal of providing more helper code for users

What problems are you trying to solve?

@shaunagm
Copy link
Collaborator Author

shaunagm commented Sep 17, 2022 via email

@shaunagm
Copy link
Collaborator Author

@shaunagm
Copy link
Collaborator Author

shaunagm commented Jun 6, 2023

Closing this but linking it from #836 so that the code/discussion here isn't lost.

@shaunagm shaunagm closed this Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants