Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_tld(), get_components(), and other #17

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
106 changes: 106 additions & 0 deletions UseCases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
This file describes exact behavior of methods for different edge cases and
explains general logic. This description covers the behavior of get_tld,
get_tld_unsafe, get_sld, get_sld_unsafe, split_domain, split_domain_unsafe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t suggest to add here a why sentence:
Unsafe versions of the methods will significantly save resources on large-scale applications of the library where the data has already been converted to lowercase and missing data has a None value. This can be done in Spark/Dask, for example, and result in a significant reduction in computational resources. For adhoc usage, the original functions are sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Unsafe versions of the methods will significantly save resources on large-scale
applications of the library where the data has already been converted to
lowercase and missing data has a None value. This can be done in Spark/Dask,
for example, and result in a significant reduction in computational resources.
For adhoc usage, the original functions are sufficient.

1. general difference of get_*() and get_*_unsafe() methods:
get_*_unsafe() does not perform if the input string is None and does not
transforms it to the lower case.

2. The listed above methods works only with non-canonical FQDN strings -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listed above means all or just the unsafe methods?

trailing dot must be removed before call the method. This restriction allows
get rid of fuzzy logic in edge cases.

3. DNS does not support empty labels - if some label detected to be empty,
None will be returned.

4. Every method processes provided FQDN in the reverse order, from the last
label towards the start of the string. It stops when the specific task is
completed. Therefore no validation occurs outside of this scope. For example,
```
get_tld('......com') -> 'com'
```
as leading dots are not processed.
split_domain method is based on get_sld method - it returns everything in
front of get_sld() as a prefix.
Specifically to example above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadym-t you might also add here a non-edge case example. the split_domain() method offers a new capability to the library-- one that folks might get from other libraries -- but your only example is the edge. suggest something like:
split_domain allows you to recover the host, or prefix, of an SLD, for use in aggregation or analysis based on the labels. e.g., split_domain('www.googl.com')

```
split_domain('......com') -> ('....',None,'com')
```
Edge cases and expected behavior
The behavior of the library can be illustrated best on the small examples:
(boolean arguments are omitted if does not affect behavior )

Copy link
Contributor