-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign base path file structure for caching and data exploration #368
Comments
I don't see why we can't solve one of these requirements using a glob pattern? I agree that the structure can be improved, but I would keep it simple:
Then you can find all the components for a given run of a pipeline (1) as |
I guess I just wasn't aware of glob patterns 😂 |
PR that creates a metadata class, this will make it easier to implement #368 (was originally part of #325 but decided to break it down to make it easier to review). Few other notable changes: - The `run_id` between both runners has now an identical format (name_timestamp), we no longer need the uid of kfp since it's just used to store the native output artifacts - The `safe_component_name` has been moved from the local runner to the component spec to avoid having to plug it everywhere --------- Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
@RobbeSneyders It seems like I tried the following two glob patterns from fondant.filesystem import get_filesystem
glob_path = "gs://soy-audio-379412_kfp-artifacts/custom_artifact/commoncrawl-pipeline/*/common_crawl_download_component/manifest_6de9482e66e860779942466ba7b0d215.json"
glob_path = "gs://soy-audio-379412_kfp-artifacts/custom_artifact/commoncrawl-pipeline/commoncrawl-pipeline-20230829150955/*/manifest_61c839281b6c80e65c68f414b09dd260.json"
fs = get_filesystem(glob_path)
print(fs.glob(glob_path)) And they all seem to run quite slow. For reference this is the pipeline where you created almost 10k partitions for the subsets. I would expect the search to stop at the manifest level but it seems to be also searching through the subsets directories. I think it's related to this:
Might be better in that case to implement the initial suggestion above. Wdyt? |
That indeed seems slow 🐌 I would propose this as an iteration on your proposal above:
Where each Since we cache per pipeline name and the hash key should be unique across components, I think this will do, and it's very explicit. |
Alright, one other important consideration is concerning the way we currently work with the |
Yes, agree. The |
PR that standarizes ffspec file access to use the `Abstractfilesystem` class for all filesystem related functionalities. Previously we were using the `Abstractfilesystem` and custom functions `fs_open()` to read/write/list the manifest(s). This caused some issue since `fs_open` expects the protocol of the file system to be present in the URL. However, files returns by `Abstractfilesystem().ls` returned a normal file system url. When passed to `fs_open()` it causes it to fail to read the file properly. Still some issues with using glob patterns. More details here: #368
PR that updates the way we check for cached executions. Previous proposal was to use glob patterns to check for a manifest under a certain directory, but this caused it to be too slow since there glob patterns behave differently in remote filesystems where they check for all nested directories. More details about the issue [here](#368 (comment)). This PR changes this by writing a reference to a manifest file in a folder separate from the subsets. ``` base_path/ ├── pipelines/ └── <pipeline_name>/ ├── <pipeline_id>/ │ ├── component_1/ │ │ ├── manifest.json │ │ └── subsets/ │ │ ├── <subset_1>/ │ │ │ └── part.0.parquet │ │ └── <subset_2>/ │ │ └── part.0.parquet │ └── component_2/ │ └── manifest.json ├── cache/ ├── <hash_key1> ├── <hash_key2> ├── <hash_key3> ``` In consequence, we also reverted back to using basic fsspec functionality which makes it easier to resolve files without having to explicitly to which protocol they belong to. Closing [this ticket](#402) in favor of this one. Tested with all different combinations (remote/local runner/basepath)
First review #427 PR that updates the interface of the data explorer ![image](https://github.com/ml6team/fondant/assets/47530815/964f640a-af24-46a2-b595-e1a24f168ab1) Notes: * Right now the base path that we currently have has a mix of subsets and fields. Related to #244 * The current base paths contains a mix of subsets and pipelines due to the old way of organizing. In the future it should only include pipeline names. #368 (comment) Perhaps we can do some cleanup to our current base path and remove unnecessary subsets ``` gs://soy-audio-379412_kfp-artifacts/custom_artifact ```
PR that creates a metadata class, this will make it easier to implement #368 (was originally part of #325 but decided to break it down to make it easier to review). Few other notable changes: - The `run_id` between both runners has now an identical format (name_timestamp), we no longer need the uid of kfp since it's just used to store the native output artifacts - The `safe_component_name` has been moved from the local runner to the component spec to avoid having to plug it everywhere --------- Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
PR that standarizes ffspec file access to use the `Abstractfilesystem` class for all filesystem related functionalities. Previously we were using the `Abstractfilesystem` and custom functions `fs_open()` to read/write/list the manifest(s). This caused some issue since `fs_open` expects the protocol of the file system to be present in the URL. However, files returns by `Abstractfilesystem().ls` returned a normal file system url. When passed to `fs_open()` it causes it to fail to read the file properly. Still some issues with using glob patterns. More details here: #368
PR that updates the way we check for cached executions. Previous proposal was to use glob patterns to check for a manifest under a certain directory, but this caused it to be too slow since there glob patterns behave differently in remote filesystems where they check for all nested directories. More details about the issue [here](#368 (comment)). This PR changes this by writing a reference to a manifest file in a folder separate from the subsets. ``` base_path/ ├── pipelines/ └── <pipeline_name>/ ├── <pipeline_id>/ │ ├── component_1/ │ │ ├── manifest.json │ │ └── subsets/ │ │ ├── <subset_1>/ │ │ │ └── part.0.parquet │ │ └── <subset_2>/ │ │ └── part.0.parquet │ └── component_2/ │ └── manifest.json ├── cache/ ├── <hash_key1> ├── <hash_key2> ├── <hash_key3> ``` In consequence, we also reverted back to using basic fsspec functionality which makes it easier to resolve files without having to explicitly to which protocol they belong to. Closing [this ticket](#402) in favor of this one. Tested with all different combinations (remote/local runner/basepath)
The current structure of the base path is as follows:
Note:
We need to have a required file structure that makes it easy to:
pipeline_name
andpipeline id
The existing file structure only enables us to accomplish (2). For the data explore, the user has to now manually give the manifest path which is not very intuitive. It can also get very cluttered with an increase number of subsets and components.
In order to enables both, I would propose to move to the following file structure:
The
pipelines
directory enables (1) and themanifest_references
directory enables (2).references.json
is a file that contains references to all the manifests of a certain component starting from the base path it can be updated.manifest_references/component_1/references.json
An alternative solution could be to just write all the manifests as well under
manifest_references
but that would lead to duplicated manifests.The subset directory is also restructured to avoid storing subsets across different pipelines in the same directory and makes the base path less cluttered.
The text was updated successfully, but these errors were encountered: