Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code for reusable load from files component #290 #296

Merged
merged 29 commits into from
Jul 26, 2023

Conversation

satishjasthi
Copy link
Contributor

This PR contains code for load from files component related to #290

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @satishjasthi! Great to see you already tackled some different compression formats as well.

Some high level comments:

  • Can you convert every docstring to the google format?
  • Could you add typing information to the function signature instead of the docstring
  • Could you add some unit tests for this component? You can create a tests subdirectory in your component directory.

I left some more detailed comments inline.

Did you already test the component, and if so, how did you test it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dependencies are installed with fondant, so we don't need to include them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then ill remove the requirements.txt file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobbeSneyders For the test cases in the new commits I have used pytest and purest-mock should I add them to the requirements.txt?

logger = logging.getLogger(__name__)


def decode_bytes(bytes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file contents are produced as bytes according to the component specification, so I don't think we need to decode the file contents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya that's better.

directory_path:
description: Local or remote path to the directory containing the files
type: str
fs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to split this instead of including this in the directory_path? I think it can be quite elegant for the user to just provide the gs:// or s3:// urls as path. I would then propose to rename directory_path to directory_uri though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that makes sense. that is much better and one single argument can capture both information

Comment on lines 202 to 207
if filepath.endswith(".gz"):
return GzipFileHandler(filepath=filepath, fs=fs)
if filepath.endswith(".zip"):
return ZipFileHandler(filepath=filepath, fs=fs)
if filepath.endswith(".tar"):
return TarFileHandler(filepath=filepath, fs=fs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will process .tar.gz files using the GzipFileHandler, while I think it should be the TarFileHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right I need to update the pattern with .tar and .tar.gz

filenames = self.fs.glob(os.path.join(self.filepath, "*"))
for filename in filenames:
with self.fs.open(filename, "r") as f:
yield filename.split("/")[-1], decode_bytes(f.read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the extensions again for each file to see if we should extract / decompress them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be safer, because if a zip file has zip files inside then this would break. Ill update the code to check at file level too

Comment on lines 261 to 262
if record["Content"].dtype != "object":
record["Content"] = record["Content"].astype("str")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you testing for object here? and casting to str? I think we want bytes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because I was storing them as string in the dask data frame. But this will be changed as Ill be converting the file content to bytes format itself

content:
description: content of the file in binary form
type: binary
id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to include the id field here, the index is handled separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok Cool, Ill remove that


# Use the delayed objects to create a Dask DataFrame. Dask knows how to handle these and
# will efficiently compute them when needed.
dataframe = dd.from_delayed(records, meta=metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this creates a partition for each record (docs). So we should probably repartition the resulting dataframe.

CC: @PhilippeMoussalli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is it, Ill look into it and refractor it

@satishjasthi
Copy link
Contributor Author

satishjasthi commented Jul 17, 2023

Thanks for patience and such detailed reviews. I'll do those changes and push. I couldn't test the component locally but I tested its core components like get_file_handler and FilesToDaskConverter on a local zip file. In fact, I wanted to ask you how can I test the component using docker on my local machine

@RobbeSneyders
Copy link
Member

Have you followed our Getting Started tutorial? It should help you run a pipeline locally. You can then create a pipeline with your component to test it the same way.

RobbeSneyders and others added 13 commits July 22, 2023 00:47
This PR adds a `post-build.sh` script which reverts the actions of the
`pre-build.sh` script and runs it after testing with tox to revert the
original state of the src/fondant/components symlink.
Fixes ml6team#251, ml6team#252, ml6team#253

Before this PR, the operation for a generic component (a reusable
component that dynamically takes a component specification), needed to
be created as follows:

```python
component_op = ComponentOp.from_registry(
    name="generic_component",
    component_spec_path="components/custom_generic_component/fondant_component.yaml",
)
```

But the provided name wasn't actually used, since the component
specification already contains a reference to the reusable image that it
should use. Now we can define both custom and generic components as
follows:

```python
component_op = ComponentOp(
    component_dir="components/custom_generic_component",
)
```

There is still a difference in how we want to handle custom and generic
components though. Custom components should be built by the local
runner, while generic components should not, since they use a reusable
image. To make this differentiation, we now simply check if a
`Dockerfile` is present in the provided `Component_dir`. This will be
the case for a custom component, but not for a generic component.

---

This has a nice implicit result for reusable components as well, which
can still be defined as:

```python
component_op = ComponentOp.from_registry(
    name="reusable_component",
)
```

Which `fondant` now resolves to:
```python
component_op = ComponentOp(
    component_dir="{fondant_install_dir}/components/custom_generic_component",
)
```

I added a change to this PR which no longer packages the reusable
component code with the `fondant` package, but only the component
specifications, as those are all that is needed since they contain a
reference to the reusable image on the registry. This means that the
`component_dir` above doesn't contain a `Dockerfile` when `fondant` is
installed from PyPI, but does when you locally install `fondant` using
`poetry install`. So the local runner doesn't build reusable components
when users install `fondant` from PyPI, but it does when you're working
on the `fondant` repo, which is useful for us `fondant` developers.

---

The only thing we still need is an option on the runner to provide
`build_args`, so we can pass in the `fondant` version to build into the
image. But I'll open a separate PR for that.
This component generates MinHashes of text. 
The MinHash similarity will be used to determine duplicated text
passages.

---------

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This PR follows up on the PoC presented in ml6team#268

---

Fixes ml6team#257 

It splits the implementation and execution of components, this has some
advantages:

- Pandas components can use `__init__` instead of setup, which is
probably more familiar to users
- Other components can use `__init__` as well instead of receiving all
arguments to their transform or equivalent method, aligning
implementation of different component types
- Component implementation and execution should be easier to test
separately

I borrowed the executor terminology from KfP.

---

Fixes ml6team#203 

Since I had to update all the components, I also switched some of them
to subclass `PandasTransformComponent` instead of
`DaskTransformComponent`.

---

These changes open some opportunities for additional improvements, but I
propose to tackle those as separate PRs as this PR is already quite huge
due to all the changes to the components.

- [ ] ml6team#300
- [ ] ml6team#301
Fixes default values set at `0` that were wrongly evaluated to `None`
I have tested `fondant explore` by using local manifest on a M1.
Therefore, I ran into some issues.

- I updated the data explorer Dockerfile according to this solution
proposal
(streamlit/streamlit#4842 (comment))
which allows the execution of the streamlit container on a M1

- If a user has not named the local data directory as `artifacts`, the
local path can not be resolved, and the data explorer can not load the
parquet files. Therefore, I propose keeping the original name of the
data directory when mounting it to the Docker container.

- Some smaller typo fixes of the data explorer documentation

---------

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Fixes ml6team#305 

Can be used to test a pipeline with fondant installed from a specific
git hash available on github:
```bash
fondant run pipeline:pipeline --local --build-arg FONDANT_VERSION=551f7b1
```
During the last fix I have incorporate a new bug in mounting the correct
path to the docker container.
I have adapted the implementation. The path in `fondant explore
--data-directory <path>` can also be an relative path now.
The path will be resolved to an absolute path. Inside the docker
container the folder will be located in the root folder.

---------

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
…eam#282)

This 3rd component extracts the image url, alt text and license metadata
from the webpage url and html code.
@satishjasthi
Copy link
Contributor Author

@RobbeSneyders, I have update the component with all suggested changes. Please review it

@GeorgesLorre
Copy link
Collaborator

Did you manage to run your component in a pipeline locally? (using the local runner)

@satishjasthi
Copy link
Contributor Author

Do you have a Dockerfile in ./components/load_from_files ?

No Should we create one? Because it is not mentioned in Docs

@GeorgesLorre
Copy link
Collaborator

Do you have a Dockerfile in ./components/load_from_files ?

No Should we create one? Because it is not mentioned in Docs

So normally the images already exist for reusable components (in your case ghcr.io/ml6team/load_from_files:dev) but this is not yet the case so for easy testing of your component I would treat it as a custom one:

So something like this: (you can

.
├── components
│   ├── load_from_files
│   │   ├── Dockerfile 
│   │   ├── fondant_component.yaml
│   │   ├── requirements.txt
│   │   └── src
│   │       └── main.py
└── pipeline.py

The local runner will then build the component locally and run it which allows you to test it once everything works we can abstract it to a reusable component.

@GeorgesLorre
Copy link
Collaborator

I'm also having a bit of difficulties reviewing your code since the files changed section contains a lot of changes that are already in our main branch. I'm not 100% sure why these changes are also displayed. Can you try updating your branch so only the real changes are visible?

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
https://stackoverflow.com/questions/16306012/github-pull-request-showing-commits-that-are-already-in-target-branch

Again I'm also unsure why this happens but I'm sure we can figure this out together.

@satishjasthi
Copy link
Contributor Author

I'm also having a bit of difficulties reviewing your code since the files changed section contains a lot of changes that are already in our main branch. I'm not 100% sure why these changes are also displayed. Can you try updating your branch so only the real changes are visible?

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork https://stackoverflow.com/questions/16306012/github-pull-request-showing-commits-that-are-already-in-target-branch

Again I'm also unsure why this happens but I'm sure we can figure this out together.

That's because I have been taking pull from main branch before I pushed the code, so that during merge it will easier without any issues since my branch will be uptodate with the new changes in main

@satishjasthi
Copy link
Contributor Author

I'm also having a bit of difficulties reviewing your code since the files changed section contains a lot of changes that are already in our main branch. I'm not 100% sure why these changes are also displayed. Can you try updating your branch so only the real changes are visible?
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork https://stackoverflow.com/questions/16306012/github-pull-request-showing-commits-that-are-already-in-target-branch
Again I'm also unsure why this happens but I'm sure we can figure this out together.

That's because I have been taking pull from main branch before I pushed the code, so that during merge it will easier without any issues since my branch will be uptodate with the new changes in main

Actual the last commit I pushed has been synched with upstream
image

@satishjasthi
Copy link
Contributor Author

Or Shall I do one thing I'll delete this PR and re-fork the repo and I'll submit the code related to this issue. That would make it easier for you to review the code

@satishjasthi
Copy link
Contributor Author

Do you have a Dockerfile in ./components/load_from_files ?

No Should we create one? Because it is not mentioned in Docs

So normally the images already exist for reusable components (in your case ghcr.io/ml6team/load_from_files:dev) but this is not yet the case so for easy testing of your component I would treat it as a custom one:

So something like this: (you can

.
├── components
│   ├── load_from_files
│   │   ├── Dockerfile 
│   │   ├── fondant_component.yaml
│   │   ├── requirements.txt
│   │   └── src
│   │       └── main.py
└── pipeline.py

The local runner will then build the component locally and run it which allows you to test it once everything works we can abstract it to a reusable component.

Is there a template to create docker or can I just use a simple Docker with python3.8-slim buster with fondant installed in it

@GeorgesLorre
Copy link
Collaborator

Is there a template to create docker or can I just use a simple Docker with python3.8-slim buster with fondant installed in it

You can use the files you already created from fondant/components/load_from_file

@satishjasthi
Copy link
Contributor Author

Hi @GeorgesLorre, I have created docker file and run it locally, Im getting this error
But as per the main branch there is DaskLoadComponent in fondant.component. Doesn't the latest fondant package have these changes?

                        Created0.1s 
 ! load_from_files The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s 
Attaching to my_pipeline-load_from_files-1
my_pipeline-load_from_files-1  | <jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
my_pipeline-load_from_files-1  | <jemalloc>: (This is the expected behaviour if you are running under QEMU)
my_pipeline-load_from_files-1  | Traceback (most recent call last):
my_pipeline-load_from_files-1  |   File "main.py", line 21, in <module>
my_pipeline-load_from_files-1  |     from fondant.component import DaskLoadComponent
my_pipeline-load_from_files-1  | ImportError: cannot import name 'DaskLoadComponent' from 'fondant.component' (/usr/local/lib/python3.8/site-packages/fondant/component.py)
my_pipeline-load_from_files-1 exited with code 1

@GeorgesLorre
Copy link
Collaborator

Hi @GeorgesLorre, I have created docker file and run it locally, Im getting this error But as per the main branch there is DaskLoadComponent in fondant.component. Doesn't the latest fondant package have these changes?

                        Created0.1s 
 ! load_from_files The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested 0.0s 
Attaching to my_pipeline-load_from_files-1
my_pipeline-load_from_files-1  | <jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
my_pipeline-load_from_files-1  | <jemalloc>: (This is the expected behaviour if you are running under QEMU)
my_pipeline-load_from_files-1  | Traceback (most recent call last):
my_pipeline-load_from_files-1  |   File "main.py", line 21, in <module>
my_pipeline-load_from_files-1  |     from fondant.component import DaskLoadComponent
my_pipeline-load_from_files-1  | ImportError: cannot import name 'DaskLoadComponent' from 'fondant.component' (/usr/local/lib/python3.8/site-packages/fondant/component.py)
my_pipeline-load_from_files-1 exited with code 1

Hmm strange, can you check which Fondant version gets installed? You should have this in your Dockerfile:

ARG FONDANT_VERSION=main
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

Maybe there is some caching issue ?

@satishjasthi
Copy link
Contributor Author

satishjasthi commented Jul 25, 2023

@GeorgesLorre , I tried removing the Docker Image and container and re-ran
fondant run pipeline:my_pipeline --local but still getting the same error. And my docker file does point to main branch to pull latest version of fondant

fondant version in container
fondant @ git+https://github.com/ml6team/fondant@79df895e9d62d2010ccb8d40ee7e4fd4c68f117d
Which is the latest commit in main branch

Dockerfile

FROM --platform=linux/amd64 python:3.8-slim

# System dependencies
RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install git -y

# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt

# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}

# Set the working directory to the component folder
WORKDIR /component/src

# Copy over src-files
COPY src/ .

ENTRYPOINT ["python", "main.py"]

@GeorgesLorre
Copy link
Collaborator

Hmm, I'm not sure what the problem is then. Could you share the code you have for running it locally ?

@GeorgesLorre GeorgesLorre changed the base branch from main to add_dedup_component_bis July 25, 2023 08:51
@GeorgesLorre GeorgesLorre changed the base branch from add_dedup_component_bis to main July 25, 2023 08:51
@satishjasthi
Copy link
Contributor Author

satishjasthi commented Jul 25, 2023

Hmm, I'm not sure what the problem is then. Could you share the code you have for running it locally ?

@GeorgesLorre ,Sure, are you on fondant's discord?

@GeorgesLorre
Copy link
Collaborator

BTW: I changed the base of this PR to something random and then back to main which seems to have removed all the already merged changes

@GeorgesLorre
Copy link
Collaborator

Hmm, I'm not sure what the problem is then. Could you share the code you have for running it locally ?

@GeorgesLorre ,Sure, are you on fondant's discord?

Yes I am !

@satishjasthi
Copy link
Contributor Author

BTW: I changed the base of this PR to something random and then back to main which seems to have removed all the already merged changes

@GeorgesLorre, But usually fetching upstream and merging it with the working branch before raising PR is good practice right? So that PR code is in sync with main branch

@satishjasthi
Copy link
Contributor Author

Hmm, I'm not sure what the problem is then. Could you share the code you have for running it locally ?

@GeorgesLorre ,Sure, are you on fondant's discord?

Yes I am !

@GeorgesLorre, Cool Ill share you my local code there

@satishjasthi
Copy link
Contributor Author

satishjasthi commented Jul 25, 2023 via email

@@ -0,0 +1,295 @@
import gzip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you have tests. They are however not ran in the current CI/CD pipeline since we only call pytest on the fondant/tests/ folder. This is something you could solve in your other PR.

I'm assuming you ran these manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , I ran it manually and @RobbeSneyders asked me to create tests within the components itself so I created within that same folder

@@ -0,0 +1,23 @@
FROM --platform=linux/amd64 python:3.8-slim
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a readme that detailes what this component does, what it supports and how to use it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Ill add Readme along with these details

Copy link
Contributor Author

@satishjasthi satishjasthi Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgesLorre ,I have pushed README.md, please review it.

"directory_uri": "./data.zip", # change this to your
# directory_uri, remote or local
},
output_partition_size="10MB",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed, it was fixed in a recent bugfix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok Done

@GeorgesLorre GeorgesLorre merged commit 0abc597 into ml6team:main Jul 26, 2023
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR contains code for load from files component related to #290

---------

Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com>
Co-authored-by: Sharon Grundmann <sharon.grundmann@ml6.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants