-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArraySubset defined, first draft #2383
ArraySubset defined, first draft #2383
Conversation
Hello! I'm trying to figure out a couple of things with this PR:
Thanks and have a great day! |
Yeah, happy to accept an class ArraySubsetLevel(ComparisonLevelCreator):
def __init__(self, col_name: str | ColumnExpression):
"""Represents a comparison level where the smaller array is an
exact subset of the larger array.
"""
self.col_expression = ColumnExpression.instantiate_if_str(col_name)
@unsupported_splink_dialects(["sqlite"])
def create_sql(self, sql_dialect: SplinkDialect) -> str:
sqlglot_dialect_name = sql_dialect.sqlglot_name
sqlglot_base_dialect_sql = """
ARRAY_SIZE(
ARRAY_INTERSECT(___col____l, ___col____r)) /
LEAST(ARRAY_SIZE(___col____l), ARRAY_SIZE(___col____r))
== 1
"""
translated = _translate_sql_string(
sqlglot_base_dialect_sql, sqlglot_dialect_name
)
self.col_expression.sql_dialect = sql_dialect
col = self.col_expression
col = self.col_expression
translated = translated.replace("___col____l", col.name_l)
translated = translated.replace("___col____r", col.name_r)
return translated
def create_label_for_charts(self) -> str:
return "Array subset" You then need to expose here for the user-facing API In tersm of testing, you should be able to use an approach like this: i.e. add tests to the I've checked the above suggestion against duckdb but not spark, so not 100% it'll work there as well If you're fighting with ruff don't worry too much, i can add a commit once everything else is ready to fix. But yeah, you shouldn't need to change the |
Hey, Robin! Thanks for your help with where to place the tests and the corrections you mentioned. I've made the necessary amendments, and added some edgecase handling for cases where one array of the two was empty (we had a bit of a division-by-zero problem but it now runs through a preliminary check to make sure that the smaller of the two arrays is not empty). A couple of questions here:
|
@mihir-packmoose sorry for the delay. Thanks - yes I'm happy with this and agree re: arrays with zero elements. Are you happy for me to add some commits for tests, and go ahead and merge? |
Sounds good! Thanks again for your help, Robin! |
P.S. I see what you mean by adding commits for tests now, I tried to resolve those conflicts on |
I've moved this changeset over to #2416 to simplify the merge. I cherry picked your commits so you'll still appear in the contributors :-) |
Sounds good! Thanks for helping me through the process, @RobinL ! |
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
Closes #2382
Give a brief description for the solution you have provided
Defines a new class in splink.internals.comparison_level_library called ArrayIntersectPercentage. This class takes two arguments, col_name and percentage_threshold. The functionality here has been tested locally.
No testing yet, will continue adding components as I go along.
PR Checklist