Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(snowflake): implement Table.sample #9071

Merged
merged 4 commits into from
May 6, 2024

Conversation

IndexSeek
Copy link
Member

This PR aims to add support for SAMPLE/TABLESAMPLE for Snowflake to build on the work done in #7377.

As the SAMPLE/TABLESAMPLE clause seems to be written the same way between Snowflake and DuckDB in SQL, I was able reuse code from:

# TODO(kszucs): this could be moved to the base SQLGlotCompiler
def visit_Sample(
self, op, *, parent, fraction: float, method: str, seed: int | None, **_
):
sample = sge.TableSample(
this=parent,
method="bernoulli" if method == "row" else "system",
percent=sge.convert(fraction * 100.0),
seed=None if seed is None else sge.convert(seed),
)
return sg.select(STAR).from_(sample)

Snowflake Docs: SAMPLE/TABLESAMPLE

Code Snippet I used to validate it was working:

import ibis

con = ibis.snowflake.connect(create_object_udfs=False)

t = con.table("WOWAH_DATA_RAW", database=("WOWAH", "PUBLIC"))

print(t.count().execute())

for expression in (
    t.sample(fraction=0.01),
    t.sample(fraction=0.01, seed=42),
    t.sample(fraction=0.01, method="block", seed=42),
):
    print(ibis.to_sql(expression, dialect="snowflake"))
    print(expression.count().execute())

Results:

10826734
SELECT
  *
FROM "WOWAH"."PUBLIC"."WOWAH_DATA_RAW" AS "t0" TABLESAMPLE bernoulli (1.0)
107897
SELECT
  *
FROM "WOWAH"."PUBLIC"."WOWAH_DATA_RAW" AS "t0" TABLESAMPLE bernoulli (1.0) SEED (42)
107953
SELECT
  *
FROM "WOWAH"."PUBLIC"."WOWAH_DATA_RAW" AS "t0" TABLESAMPLE system (1.0) SEED (42)
107953

@cpcloud cpcloud added the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Apr 28, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Apr 28, 2024
@IndexSeek
Copy link
Member Author

A few thoughts/questions I have on this:

Should we wait to move this logic to the SQLGlotCompiler?

Should we break apart the first sample test? It made it past the first assertion but failed on df = t.sample(0.1, method="block").execute() due to this error: SAMPLE clause on views only supports row wise sampling without seed. I marked that to be skipped in the interim, but thinking we would want to cover the first scenario at least or test it without a view.

CC: @cromano8 as we were discussing this behavior.

@cpcloud cpcloud added feature Features or general enhancements snowflake The Snowflake backend ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI labels Apr 29, 2024
@ibis-docs-bot ibis-docs-bot bot removed the ci-run-cloud Add this label to trigger a run of BigQuery, Snowflake, and Databricks backends in CI label Apr 29, 2024
@cromano8
Copy link

Definitely going to check this out tomorrow/wednesday

@IndexSeek IndexSeek marked this pull request as draft May 5, 2024 18:11
@IndexSeek
Copy link
Member Author

I am marking this as a draft for the time being; I haven't been able to dig back in yet.

I'm currently working on setting things up to properly test locally against this backend rather than relying on the cloud-ci-runs. This should make future PRs a little smoother for me as well.

@cpcloud
Copy link
Member

cpcloud commented May 6, 2024

@IndexSeek I can help here, there's only two kinds of failures and I don't think there's anything else to do here:

  1. XPASS: these are tests that fail when they start succeeding. Have xfails in strict mode so that when tests that were previously failing start succeeding we know.
  2. Transient BigQuery errors: nothing to do here for the moment except restart the test, which I can do.

@cpcloud cpcloud force-pushed the feat/snowflake-sample branch from 036b683 to 6bd9e12 Compare May 6, 2024 13:35
@cpcloud cpcloud force-pushed the feat/snowflake-sample branch from 6bd9e12 to c795fc5 Compare May 6, 2024 13:35
@cpcloud cpcloud marked this pull request as ready for review May 6, 2024 13:35
@cpcloud cpcloud added this to the 9.1 milestone May 6, 2024
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cpcloud
Copy link
Member

cpcloud commented May 6, 2024

Snowflake sample tests are passing:

cloud in 🌐 falcon in …/ibis on  feat/snowflake-sample is 📦 v9.0.0 via 🐍 v3.12.2 via ❄️   impure (ibis-3.12.2-env) took 20s
❯ pytest 'ibis/backends/tests/test_generic.py' -k sample -m snowflake -vv
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.12.2, pytest-8.2.0, pluggy-1.5.0 -- /nix/store/cdmvsa912j1xj2d6kfsv0b63ffnylkvv-python3-3.12.2-env/bin/python3.12
cachedir: .pytest_cache
hypothesis profile 'dev' -> deadline=None, max_examples=50, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase(PosixPath('/home/cloud/ibis/.hypothesis/examples'))
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=1209062198
rootdir: /home/cloud/ibis
configfile: pyproject.toml
plugins: hypothesis-6.100.2, benchmark-4.0.0, pytest_httpserver-1.0.10, clarity-1.0.1, randomly-3.15.0, repeat-0.9.3, timeout-2.3.1, xdist-3.6.1, cov-5.0.0, anyio-4.3.0, mock-3.14.0, snapshot-0.9.0
collected 3660 items / 3656 deselected / 4 selected

ibis/backends/tests/test_generic.py::test_sample[snowflake-block] XFAIL (SAMPLE clause on views only supports row wise sampling without seed.) [ 25%]
ibis/backends/tests/test_generic.py::test_sample_memtable[snowflake] PASSED [ 50%]
ibis/backends/tests/test_generic.py::test_sample[snowflake-row] PASSED   [ 75%]
ibis/backends/tests/test_generic.py::test_sample_with_seed[snowflake] PASSED [100%]

@cpcloud cpcloud enabled auto-merge (squash) May 6, 2024 13:39
@cpcloud cpcloud merged commit 307334b into ibis-project:main May 6, 2024
81 checks passed
@IndexSeek
Copy link
Member Author

That's awesome; thank you so much for helping me get this across the finish line!

  1. XPASS: these are tests that fail when they start succeeding. Have xfails in strict mode so that when tests that were previously failing start succeeding we know.

Ahh, that makes sense! I was digging around to see if I had forgotten to remove it/add it from/to an iterable somewhere.

@IndexSeek IndexSeek deleted the feat/snowflake-sample branch October 30, 2024 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements snowflake The Snowflake backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants