Redesign base path file structure for caching and data exploration #368

PhilippeMoussalli · 2023-08-17T14:18:23Z

The current structure of the base path is as follows:

base_path
├── subset_1
│   └── <pipeline_name>-<pipeline_id>
│       └── data.0.parquet
├── subset_2
│   └── <pipeline_name>-<pipeline_id>
│       └── data.0.parquet
├── component_1
│   └── manifest_<hash_key_1>.json
└── component_2
    └── manifest_<hash_key_2>.json

Note:

hash key is still not included in the main branch
Keep in mind that we now have a unified structure between the local and remote runner

We need to have a required file structure that makes it easy to:

Find all the corresponding components of a given pipeline_name and pipeline id
- This is needed for the data explorer where the user might select a pipeline name -> pipeline_id (drop down based on pipeline name) -> component and gets a list of all possible subsets to explore given the component's manifest
Final all the manifests for a certain component name (see Implement caching workflow #325)
- This is needed for the caching mechanism where we check for a cached pipeline run within a certain component folder based on the hash key appended to the manifest name.

The existing file structure only enables us to accomplish (2). For the data explore, the user has to now manually give the manifest path which is not very intuitive. It can also get very cluttered with an increase number of subsets and components.

In order to enables both, I would propose to move to the following file structure:

base_path/
├── pipelines/
│   └── <pipeline_name>/
│       └── <pipeline_id>/
│           ├── component_1/
│           │   ├── manifest_<hash_key_1>.json
│           │   └── subsets/
│           │       ├── <subset_1>/
│           │       │   └── part.0.parquet
│           │       └── <subset_2>/
│           │           └── part.0.parquet
│           └── component_2/
│               └── manifest_<hash_key_2>.json
└── manifest_references/
    ├── component_1/
    │   └── references.json
    └── component_2/
        └── references.json

The pipelines directory enables (1) and the manifest_references directory enables (2). references.json is a file that contains references to all the manifests of a certain component starting from the base path it can be updated.

manifest_references/component_1/references.json

{ "manifest_<hash_key_1>": "pipelines/pipeline_name/pipeline_id/component_1/manifest_<hash_key_1>.json", 
 "manifest_<hash_key_2>": "pipelines/pipeline_name/pipeline_id/component_2/manifest_<hash_key_2>.json"
}

An alternative solution could be to just write all the manifests as well under manifest_references but that would lead to duplicated manifests.

The subset directory is also restructured to avoid storing subsets across different pipelines in the same directory and makes the base path less cluttered.

The text was updated successfully, but these errors were encountered:

RobbeSneyders · 2023-08-17T14:42:31Z

I don't see why we can't solve one of these requirements using a glob pattern?

I agree that the structure can be improved, but I would keep it simple:

- pipeline name
  |- pipeline run
     |- component 1
     |  |- manifest.json
     |  |- subset 1
     |     |- *.parquet
     |  |- subset 2
     |     |- *.parquet
     |- component 2
        |- ...

Then you can find all the components for a given run of a pipeline (1) as pipeline_name/pipeline_run/* and all the manifests of a component of a pipeline (2) as pipeline_name/*/component_name/manifest.json

PhilippeMoussalli · 2023-08-17T14:51:57Z

I don't see why we can't solve one of these requirements using a glob pattern?

I agree that the structure can be improved, but I would keep it simple:
- pipeline name
  |- pipeline run
     |- component 1
     |  |- manifest.json
     |  |- subset 1
     |     |- *.parquet
     |  |- subset 2
     |     |- *.parquet
     |- component 2
        |- ...
Then you can find all the components for a given run of a pipeline (1) as pipeline_name/pipeline_run/* and all the manifests of a component of a pipeline (2) as pipeline_name/*/component_name/manifest.json

I guess I just wasn't aware of glob patterns 😂
That simplifies things quite a bit indeed

PR that creates a metadata class, this will make it easier to implement #368 (was originally part of #325 but decided to break it down to make it easier to review). Few other notable changes: - The `run_id` between both runners has now an identical format (name_timestamp), we no longer need the uid of kfp since it's just used to store the native output artifacts - The `safe_component_name` has been moved from the local runner to the component spec to avoid having to plug it everywhere --------- Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>

Addresses #368 First review this #372

PhilippeMoussalli · 2023-08-30T07:27:25Z

@RobbeSneyders It seems like fs.glob() is very slow when working with blobs.

I tried the following two glob patterns

from fondant.filesystem import get_filesystem


glob_path = "gs://soy-audio-379412_kfp-artifacts/custom_artifact/commoncrawl-pipeline/*/common_crawl_download_component/manifest_6de9482e66e860779942466ba7b0d215.json"

glob_path = "gs://soy-audio-379412_kfp-artifacts/custom_artifact/commoncrawl-pipeline/commoncrawl-pipeline-20230829150955/*/manifest_61c839281b6c80e65c68f414b09dd260.json"

fs = get_filesystem(glob_path)

print(fs.glob(glob_path))

And they all seem to run quite slow. For reference this is the pipeline where you created almost 10k partitions for the subsets. I would expect the search to stop at the manifest level but it seems to be also searching through the subsets directories.

I think it's related to this:

In contrast, gs://bucket/*abc.txt asks the server for the complete list of objects in the bucket root, and then filters for those objects whose name ends with abc.txt.

Might be better in that case to implement the initial suggestion above. Wdyt?

RobbeSneyders · 2023-08-31T10:59:05Z

That indeed seems slow 🐌

I would propose this as an iteration on your proposal above:

base_path/
├── pipelines/
    └── <pipeline_name>/
        ├── <pipeline_id>/
        │   ├── component_1/
        │   │   ├── manifest.json
        │   │   └── subsets/
        │   │       ├── <subset_1>/
        │   │       │   └── part.0.parquet
        │   │       └── <subset_2>/
        │   │           └── part.0.parquet
        │   └── component_2/
        │       └── manifest.json
        ├── cache/   
            ├── <hash_key1>
            ├── <hash_key2>
            ├── <hash_key3>

Where each <hash_keyx> file just contains the path to the relevant manifest.

Since we cache per pipeline name and the hash key should be unique across components, I think this will do, and it's very explicit.

PhilippeMoussalli · 2023-08-31T11:51:00Z

Alright, one other important consideration is concerning the way we currently work with the dev tag. I think we should still overwrite the manifest in cache/hash_key in case it's already exists and caching is set to False. Mainly to circumvent the way we work with dev tags. This explains it a bit better

RobbeSneyders · 2023-08-31T12:35:01Z

Yes, agree. The cache parameter should only define if the component should check if it can be cached, not if it needs to write cache information for the future.

PR that standarizes ffspec file access to use the `Abstractfilesystem` class for all filesystem related functionalities. Previously we were using the `Abstractfilesystem` and custom functions `fs_open()` to read/write/list the manifest(s). This caused some issue since `fs_open` expects the protocol of the file system to be present in the URL. However, files returns by `Abstractfilesystem().ls` returned a normal file system url. When passed to `fs_open()` it causes it to fail to read the file properly. Still some issues with using glob patterns. More details here: #368

PR that updates the way we check for cached executions. Previous proposal was to use glob patterns to check for a manifest under a certain directory, but this caused it to be too slow since there glob patterns behave differently in remote filesystems where they check for all nested directories. More details about the issue [here](#368 (comment)). This PR changes this by writing a reference to a manifest file in a folder separate from the subsets. ``` base_path/ ├── pipelines/ └── <pipeline_name>/ ├── <pipeline_id>/ │ ├── component_1/ │ │ ├── manifest.json │ │ └── subsets/ │ │ ├── <subset_1>/ │ │ │ └── part.0.parquet │ │ └── <subset_2>/ │ │ └── part.0.parquet │ └── component_2/ │ └── manifest.json ├── cache/ ├── <hash_key1> ├── <hash_key2> ├── <hash_key3> ``` In consequence, we also reverted back to using basic fsspec functionality which makes it easier to resolve files without having to explicitly to which protocol they belong to. Closing [this ticket](#402) in favor of this one. Tested with all different combinations (remote/local runner/basepath)

First review #427 PR that updates the interface of the data explorer ![image](https://github.com/ml6team/fondant/assets/47530815/964f640a-af24-46a2-b595-e1a24f168ab1) Notes: * Right now the base path that we currently have has a mix of subsets and fields. Related to #244 * The current base paths contains a mix of subsets and pipelines due to the old way of organizing. In the future it should only include pipeline names. #368 (comment) Perhaps we can do some cleanup to our current base path and remove unnecessary subsets ``` gs://soy-audio-379412_kfp-artifacts/custom_artifact ```

PR that creates a metadata class, this will make it easier to implement #368 (was originally part of #325 but decided to break it down to make it easier to review). Few other notable changes: - The `run_id` between both runners has now an identical format (name_timestamp), we no longer need the uid of kfp since it's just used to store the native output artifacts - The `safe_component_name` has been moved from the local runner to the component spec to avoid having to plug it everywhere --------- Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>

Addresses #368 First review this #372

PR that standarizes ffspec file access to use the `Abstractfilesystem` class for all filesystem related functionalities. Previously we were using the `Abstractfilesystem` and custom functions `fs_open()` to read/write/list the manifest(s). This caused some issue since `fs_open` expects the protocol of the file system to be present in the URL. However, files returns by `Abstractfilesystem().ls` returned a normal file system url. When passed to `fs_open()` it causes it to fail to read the file properly. Still some issues with using glob patterns. More details here: #368

PR that updates the way we check for cached executions. Previous proposal was to use glob patterns to check for a manifest under a certain directory, but this caused it to be too slow since there glob patterns behave differently in remote filesystems where they check for all nested directories. More details about the issue [here](#368 (comment)). This PR changes this by writing a reference to a manifest file in a folder separate from the subsets. ``` base_path/ ├── pipelines/ └── <pipeline_name>/ ├── <pipeline_id>/ │ ├── component_1/ │ │ ├── manifest.json │ │ └── subsets/ │ │ ├── <subset_1>/ │ │ │ └── part.0.parquet │ │ └── <subset_2>/ │ │ └── part.0.parquet │ └── component_2/ │ └── manifest.json ├── cache/ ├── <hash_key1> ├── <hash_key2> ├── <hash_key3> ``` In consequence, we also reverted back to using basic fsspec functionality which makes it easier to resolve files without having to explicitly to which protocol they belong to. Closing [this ticket](#402) in favor of this one. Tested with all different combinations (remote/local runner/basepath)

PhilippeMoussalli added this to Fondant development Aug 17, 2023

PhilippeMoussalli converted this from a draft issue Aug 17, 2023

This was referenced Aug 21, 2023

Create separate class for metadata #372

Merged

Redesign base path file structure #373

Merged

RobbeSneyders pushed a commit that referenced this issue Aug 23, 2023

Redesign base path file structure (#373)

b9f3dea

Addresses #368 First review this #372

RobbeSneyders closed this as completed Aug 28, 2023

github-project-automation bot moved this from Ready for development to Done in Fondant development Aug 28, 2023

PhilippeMoussalli reopened this Aug 30, 2023

PhilippeMoussalli mentioned this issue Aug 30, 2023

Standarize fsspec file access #397

Merged

RobbeSneyders added the Core Core framework label Aug 31, 2023

This was referenced Aug 31, 2023

Bugfix filesystem kfp #402

Closed

Update caching strategy #407

Merged

This was referenced Sep 6, 2023

Feature/vertex compiler #411

Merged

Update data explorer interface #431

Merged

RobbeSneyders closed this as completed Oct 5, 2023

Hakimovich99 pushed a commit that referenced this issue Oct 16, 2023

Redesign base path file structure (#373)

15e39ff

Addresses #368 First review this #372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign base path file structure for caching and data exploration #368

Redesign base path file structure for caching and data exploration #368

PhilippeMoussalli commented Aug 17, 2023 •

edited

Loading

RobbeSneyders commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023 •

edited

Loading

PhilippeMoussalli commented Aug 30, 2023

RobbeSneyders commented Aug 31, 2023

PhilippeMoussalli commented Aug 31, 2023

RobbeSneyders commented Aug 31, 2023

Redesign base path file structure for caching and data exploration #368

Redesign base path file structure for caching and data exploration #368

Comments

PhilippeMoussalli commented Aug 17, 2023 • edited Loading

RobbeSneyders commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023 • edited Loading

PhilippeMoussalli commented Aug 30, 2023

RobbeSneyders commented Aug 31, 2023

PhilippeMoussalli commented Aug 31, 2023

RobbeSneyders commented Aug 31, 2023

PhilippeMoussalli commented Aug 17, 2023 •

edited

Loading

PhilippeMoussalli commented Aug 17, 2023 •

edited

Loading