Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign base path file structure for caching and data exploration #368

Closed
PhilippeMoussalli opened this issue Aug 17, 2023 · 6 comments
Closed
Labels
Core Core framework

Comments

@PhilippeMoussalli
Copy link
Contributor

PhilippeMoussalli commented Aug 17, 2023

The current structure of the base path is as follows:

base_path
├── subset_1
│   └── <pipeline_name>-<pipeline_id>
│       └── data.0.parquet
├── subset_2
│   └── <pipeline_name>-<pipeline_id>
│       └── data.0.parquet
├── component_1
│   └── manifest_<hash_key_1>.json
└── component_2
    └── manifest_<hash_key_2>.json

Note:

  • hash key is still not included in the main branch
  • Keep in mind that we now have a unified structure between the local and remote runner

We need to have a required file structure that makes it easy to:

  1. Find all the corresponding components of a given pipeline_name and pipeline id
    • This is needed for the data explorer where the user might select a pipeline name -> pipeline_id (drop down based on pipeline name) -> component and gets a list of all possible subsets to explore given the component's manifest
  2. Final all the manifests for a certain component name (see Implement caching workflow #325)
    • This is needed for the caching mechanism where we check for a cached pipeline run within a certain component folder based on the hash key appended to the manifest name.

The existing file structure only enables us to accomplish (2). For the data explore, the user has to now manually give the manifest path which is not very intuitive. It can also get very cluttered with an increase number of subsets and components.

In order to enables both, I would propose to move to the following file structure:

base_path/
├── pipelines/
│   └── <pipeline_name>/
│       └── <pipeline_id>/
│           ├── component_1/
│           │   ├── manifest_<hash_key_1>.json
│           │   └── subsets/
│           │       ├── <subset_1>/
│           │       │   └── part.0.parquet
│           │       └── <subset_2>/
│           │           └── part.0.parquet
│           └── component_2/
│               └── manifest_<hash_key_2>.json
└── manifest_references/
    ├── component_1/
    │   └── references.json
    └── component_2/
        └── references.json

The pipelines directory enables (1) and the manifest_references directory enables (2). references.json is a file that contains references to all the manifests of a certain component starting from the base path it can be updated.

manifest_references/component_1/references.json

{ "manifest_<hash_key_1>": "pipelines/pipeline_name/pipeline_id/component_1/manifest_<hash_key_1>.json", 
 "manifest_<hash_key_2>": "pipelines/pipeline_name/pipeline_id/component_2/manifest_<hash_key_2>.json"
}

An alternative solution could be to just write all the manifests as well under manifest_references but that would lead to duplicated manifests.

The subset directory is also restructured to avoid storing subsets across different pipelines in the same directory and makes the base path less cluttered.

@PhilippeMoussalli PhilippeMoussalli converted this from a draft issue Aug 17, 2023
@RobbeSneyders
Copy link
Member

I don't see why we can't solve one of these requirements using a glob pattern?

I agree that the structure can be improved, but I would keep it simple:

- pipeline name
  |- pipeline run
     |- component 1
     |  |- manifest.json
     |  |- subset 1
     |     |- *.parquet
     |  |- subset 2
     |     |- *.parquet
     |- component 2
        |- ...

Then you can find all the components for a given run of a pipeline (1) as pipeline_name/pipeline_run/* and all the manifests of a component of a pipeline (2) as pipeline_name/*/component_name/manifest.json

@PhilippeMoussalli
Copy link
Contributor Author

PhilippeMoussalli commented Aug 17, 2023

I don't see why we can't solve one of these requirements using a glob pattern?

I agree that the structure can be improved, but I would keep it simple:

- pipeline name
  |- pipeline run
     |- component 1
     |  |- manifest.json
     |  |- subset 1
     |     |- *.parquet
     |  |- subset 2
     |     |- *.parquet
     |- component 2
        |- ...

Then you can find all the components for a given run of a pipeline (1) as pipeline_name/pipeline_run/* and all the manifests of a component of a pipeline (2) as pipeline_name/*/component_name/manifest.json

I guess I just wasn't aware of glob patterns 😂
That simplifies things quite a bit indeed

RobbeSneyders pushed a commit that referenced this issue Aug 22, 2023
PR that creates a metadata class, this will make it easier to implement
#368 (was originally part of #325 but decided to break it down to make
it easier to review).

Few other notable changes: 

- The `run_id` between both runners has now an identical format
(name_timestamp), we no longer need the uid of kfp since it's just used
to store the native output artifacts
- The `safe_component_name` has been moved from the local runner to the
component spec to avoid having to plug it everywhere

---------

Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
RobbeSneyders pushed a commit that referenced this issue Aug 23, 2023
@github-project-automation github-project-automation bot moved this from Ready for development to Done in Fondant development Aug 28, 2023
@PhilippeMoussalli
Copy link
Contributor Author

@RobbeSneyders It seems like fs.glob() is very slow when working with blobs.

I tried the following two glob patterns

from fondant.filesystem import get_filesystem


glob_path = "gs://soy-audio-379412_kfp-artifacts/custom_artifact/commoncrawl-pipeline/*/common_crawl_download_component/manifest_6de9482e66e860779942466ba7b0d215.json"

glob_path = "gs://soy-audio-379412_kfp-artifacts/custom_artifact/commoncrawl-pipeline/commoncrawl-pipeline-20230829150955/*/manifest_61c839281b6c80e65c68f414b09dd260.json"

fs = get_filesystem(glob_path)

print(fs.glob(glob_path))

And they all seem to run quite slow. For reference this is the pipeline where you created almost 10k partitions for the subsets. I would expect the search to stop at the manifest level but it seems to be also searching through the subsets directories.

I think it's related to this:

In contrast, gs://bucket/*abc.txt asks the server for the complete list of objects in the bucket root, and then filters for those objects whose name ends with abc.txt. 

Might be better in that case to implement the initial suggestion above. Wdyt?

@RobbeSneyders
Copy link
Member

That indeed seems slow 🐌

I would propose this as an iteration on your proposal above:

base_path/
├── pipelines/
    └── <pipeline_name>/
        ├── <pipeline_id>/
        │   ├── component_1/
        │   │   ├── manifest.json
        │   │   └── subsets/
        │   │       ├── <subset_1>/
        │   │       │   └── part.0.parquet
        │   │       └── <subset_2>/
        │   │           └── part.0.parquet
        │   └── component_2/
        │       └── manifest.json
        ├── cache/   
            ├── <hash_key1>
            ├── <hash_key2>
            ├── <hash_key3>

Where each <hash_keyx> file just contains the path to the relevant manifest.

Since we cache per pipeline name and the hash key should be unique across components, I think this will do, and it's very explicit.

@PhilippeMoussalli
Copy link
Contributor Author

Alright, one other important consideration is concerning the way we currently work with the dev tag. I think we should still overwrite the manifest in cache/hash_key in case it's already exists and caching is set to False. Mainly to circumvent the way we work with dev tags. This explains it a bit better

image

@RobbeSneyders
Copy link
Member

Yes, agree. The cache parameter should only define if the component should check if it can be cached, not if it needs to write cache information for the future.

PhilippeMoussalli added a commit that referenced this issue Aug 31, 2023
PR that standarizes ffspec file access to use the `Abstractfilesystem`
class for all filesystem related functionalities.

Previously we were using the `Abstractfilesystem` and custom functions
`fs_open()` to read/write/list the manifest(s). This caused some issue
since `fs_open` expects the protocol of the file system to be present in
the URL. However, files returns by `Abstractfilesystem().ls` returned a
normal file system url. When passed to `fs_open()` it causes it to fail
to read the file properly.

Still some issues with using glob patterns. More details here:
#368
@RobbeSneyders RobbeSneyders added the Core Core framework label Aug 31, 2023
This was referenced Aug 31, 2023
PhilippeMoussalli added a commit that referenced this issue Sep 5, 2023
PR that updates the way we check for cached executions. Previous
proposal was to use glob patterns to check for a manifest under a
certain directory, but this caused it to be too slow since there glob
patterns behave differently in remote filesystems where they check for
all nested directories. More details about the issue
[here](#368 (comment)).

This PR changes this by writing a reference to a manifest file in a
folder separate from the subsets.

```
base_path/
├── pipelines/
    └── <pipeline_name>/
        ├── <pipeline_id>/
        │   ├── component_1/
        │   │   ├── manifest.json
        │   │   └── subsets/
        │   │       ├── <subset_1>/
        │   │       │   └── part.0.parquet
        │   │       └── <subset_2>/
        │   │           └── part.0.parquet
        │   └── component_2/
        │       └── manifest.json
        ├── cache/   
            ├── <hash_key1>
            ├── <hash_key2>
            ├── <hash_key3>
```

In consequence, we also reverted back to using basic fsspec
functionality which makes it easier to resolve files without having to
explicitly to which protocol they belong to. Closing [this
ticket](#402) in favor of this
one.

Tested with all different combinations (remote/local runner/basepath)
This was referenced Sep 6, 2023
PhilippeMoussalli added a commit that referenced this issue Sep 15, 2023
First review  #427

PR that updates the interface of the data explorer 

![image](https://github.com/ml6team/fondant/assets/47530815/964f640a-af24-46a2-b595-e1a24f168ab1)

Notes: 
* Right now the base path that we currently have has a mix of subsets
and fields. Related to #244
* The current base paths contains a mix of subsets and pipelines due to
the old way of organizing. In the future it should only include pipeline
names.
#368 (comment)
Perhaps we can do some cleanup to our current base path and remove
unnecessary subsets

```
gs://soy-audio-379412_kfp-artifacts/custom_artifact
```
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
PR that creates a metadata class, this will make it easier to implement
#368 (was originally part of #325 but decided to break it down to make
it easier to review).

Few other notable changes: 

- The `run_id` between both runners has now an identical format
(name_timestamp), we no longer need the uid of kfp since it's just used
to store the native output artifacts
- The `safe_component_name` has been moved from the local runner to the
component spec to avoid having to plug it everywhere

---------

Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
PR that standarizes ffspec file access to use the `Abstractfilesystem`
class for all filesystem related functionalities.

Previously we were using the `Abstractfilesystem` and custom functions
`fs_open()` to read/write/list the manifest(s). This caused some issue
since `fs_open` expects the protocol of the file system to be present in
the URL. However, files returns by `Abstractfilesystem().ls` returned a
normal file system url. When passed to `fs_open()` it causes it to fail
to read the file properly.

Still some issues with using glob patterns. More details here:
#368
Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023
PR that updates the way we check for cached executions. Previous
proposal was to use glob patterns to check for a manifest under a
certain directory, but this caused it to be too slow since there glob
patterns behave differently in remote filesystems where they check for
all nested directories. More details about the issue
[here](#368 (comment)).

This PR changes this by writing a reference to a manifest file in a
folder separate from the subsets.

```
base_path/
├── pipelines/
    └── <pipeline_name>/
        ├── <pipeline_id>/
        │   ├── component_1/
        │   │   ├── manifest.json
        │   │   └── subsets/
        │   │       ├── <subset_1>/
        │   │       │   └── part.0.parquet
        │   │       └── <subset_2>/
        │   │           └── part.0.parquet
        │   └── component_2/
        │       └── manifest.json
        ├── cache/   
            ├── <hash_key1>
            ├── <hash_key2>
            ├── <hash_key3>
```

In consequence, we also reverted back to using basic fsspec
functionality which makes it easier to resolve files without having to
explicitly to which protocol they belong to. Closing [this
ticket](#402) in favor of this
one.

Tested with all different combinations (remote/local runner/basepath)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

No branches or pull requests

2 participants