Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Cache Dataset.schema #37103

Merged
merged 1 commit into from
Jul 5, 2023

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

Cache the computed schema to avoid re-executing.

Related issue number

Closes #37077.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

if self._stages_after_snapshot:
# TODO(swang): There are several other stage types that could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was wondering if this should be a feature of the LogicalPlan API (cc @raulchen ).

@@ -241,6 +241,22 @@ def test_schema_lazy(ray_start_regular_shared):
assert ds._plan.execute()._num_computed() == 0


def test_schema_cached(ray_start_regular_shared):
def check_schema_cached(ds):
schema = ds.schema()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that repeated calls to schema() are cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it is checked under this.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 5, 2023
@stephanie-wang stephanie-wang merged commit 41c4b9d into ray-project:master Jul 5, 2023
@stephanie-wang stephanie-wang deleted the cache-schema branch July 5, 2023 23:22
stephanie-wang added a commit to stephanie-wang/ray that referenced this pull request Jul 7, 2023
Cache the computed schema to avoid re-executing.

Closes ray-project#37077.
bveeramani pushed a commit that referenced this pull request Jul 7, 2023
Cache the computed schema to avoid re-executing.

Closes #37077.
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Cache the computed schema to avoid re-executing.

Closes ray-project#37077.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] Dataset.schema() may get recomputed each time
3 participants