-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add code for reusable load from files component #290 #296
Add code for reusable load from files component #290 #296
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @satishjasthi! Great to see you already tackled some different compression formats as well.
Some high level comments:
- Can you convert every docstring to the google format?
- Could you add typing information to the function signature instead of the docstring
- Could you add some unit tests for this component? You can create a
tests
subdirectory in your component directory.
I left some more detailed comments inline.
Did you already test the component, and if so, how did you test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These dependencies are installed with fondant, so we don't need to include them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then ill remove the requirements.txt file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RobbeSneyders For the test cases in the new commits I have used pytest and purest-mock should I add them to the requirements.txt?
logger = logging.getLogger(__name__) | ||
|
||
|
||
def decode_bytes(bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file contents are produced as bytes according to the component specification, so I don't think we need to decode the file contents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya that's better.
directory_path: | ||
description: Local or remote path to the directory containing the files | ||
type: str | ||
fs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to split this instead of including this in the directory_path
? I think it can be quite elegant for the user to just provide the gs://
or s3://
urls as path. I would then propose to rename directory_path
to directory_uri
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that makes sense. that is much better and one single argument can capture both information
if filepath.endswith(".gz"): | ||
return GzipFileHandler(filepath=filepath, fs=fs) | ||
if filepath.endswith(".zip"): | ||
return ZipFileHandler(filepath=filepath, fs=fs) | ||
if filepath.endswith(".tar"): | ||
return TarFileHandler(filepath=filepath, fs=fs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will process .tar.gz
files using the GzipFileHandler
, while I think it should be the TarFileHandler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right I need to update the pattern with .tar and .tar.gz
filenames = self.fs.glob(os.path.join(self.filepath, "*")) | ||
for filename in filenames: | ||
with self.fs.open(filename, "r") as f: | ||
yield filename.split("/")[-1], decode_bytes(f.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check the extensions again for each file to see if we should extract / decompress them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be safer, because if a zip file has zip files inside then this would break. Ill update the code to check at file level too
if record["Content"].dtype != "object": | ||
record["Content"] = record["Content"].astype("str") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you testing for object
here? and casting to str
? I think we want bytes
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because I was storing them as string in the dask data frame. But this will be changed as Ill be converting the file content to bytes format itself
content: | ||
description: content of the file in binary form | ||
type: binary | ||
id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to include the id
field here, the index is handled separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok Cool, Ill remove that
|
||
# Use the delayed objects to create a Dask DataFrame. Dask knows how to handle these and | ||
# will efficiently compute them when needed. | ||
dataframe = dd.from_delayed(records, meta=metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this creates a partition for each record (docs). So we should probably repartition the resulting dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is it, Ill look into it and refractor it
Thanks for patience and such detailed reviews. I'll do those changes and push. I couldn't test the component locally but I tested its core components like get_file_handler and FilesToDaskConverter on a local zip file. In fact, I wanted to ask you how can I test the component using docker on my local machine |
Have you followed our Getting Started tutorial? It should help you run a pipeline locally. You can then create a pipeline with your component to test it the same way. |
We use `fsspec` [directly](https://github.com/ml6team/fondant/blob/main/src/fondant/manifest.py#L10) in fondant, so we should add it as a dependency.
This PR adds a `post-build.sh` script which reverts the actions of the `pre-build.sh` script and runs it after testing with tox to revert the original state of the src/fondant/components symlink.
Fixes ml6team#251, ml6team#252, ml6team#253 Before this PR, the operation for a generic component (a reusable component that dynamically takes a component specification), needed to be created as follows: ```python component_op = ComponentOp.from_registry( name="generic_component", component_spec_path="components/custom_generic_component/fondant_component.yaml", ) ``` But the provided name wasn't actually used, since the component specification already contains a reference to the reusable image that it should use. Now we can define both custom and generic components as follows: ```python component_op = ComponentOp( component_dir="components/custom_generic_component", ) ``` There is still a difference in how we want to handle custom and generic components though. Custom components should be built by the local runner, while generic components should not, since they use a reusable image. To make this differentiation, we now simply check if a `Dockerfile` is present in the provided `Component_dir`. This will be the case for a custom component, but not for a generic component. --- This has a nice implicit result for reusable components as well, which can still be defined as: ```python component_op = ComponentOp.from_registry( name="reusable_component", ) ``` Which `fondant` now resolves to: ```python component_op = ComponentOp( component_dir="{fondant_install_dir}/components/custom_generic_component", ) ``` I added a change to this PR which no longer packages the reusable component code with the `fondant` package, but only the component specifications, as those are all that is needed since they contain a reference to the reusable image on the registry. This means that the `component_dir` above doesn't contain a `Dockerfile` when `fondant` is installed from PyPI, but does when you locally install `fondant` using `poetry install`. So the local runner doesn't build reusable components when users install `fondant` from PyPI, but it does when you're working on the `fondant` repo, which is useful for us `fondant` developers. --- The only thing we still need is an option on the runner to provide `build_args`, so we can pass in the `fondant` version to build into the image. But I'll open a separate PR for that.
This component generates MinHashes of text. The MinHash similarity will be used to determine duplicated text passages. --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This PR follows up on the PoC presented in ml6team#268 --- Fixes ml6team#257 It splits the implementation and execution of components, this has some advantages: - Pandas components can use `__init__` instead of setup, which is probably more familiar to users - Other components can use `__init__` as well instead of receiving all arguments to their transform or equivalent method, aligning implementation of different component types - Component implementation and execution should be easier to test separately I borrowed the executor terminology from KfP. --- Fixes ml6team#203 Since I had to update all the components, I also switched some of them to subclass `PandasTransformComponent` instead of `DaskTransformComponent`. --- These changes open some opportunities for additional improvements, but I propose to tackle those as separate PRs as this PR is already quite huge due to all the changes to the components. - [ ] ml6team#300 - [ ] ml6team#301
Fixes default values set at `0` that were wrongly evaluated to `None`
I have tested `fondant explore` by using local manifest on a M1. Therefore, I ran into some issues. - I updated the data explorer Dockerfile according to this solution proposal (streamlit/streamlit#4842 (comment)) which allows the execution of the streamlit container on a M1 - If a user has not named the local data directory as `artifacts`, the local path can not be resolved, and the data explorer can not load the parquet files. Therefore, I propose keeping the original name of the data directory when mounting it to the Docker container. - Some smaller typo fixes of the data explorer documentation --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Fixes ml6team#305 Can be used to test a pipeline with fondant installed from a specific git hash available on github: ```bash fondant run pipeline:pipeline --local --build-arg FONDANT_VERSION=551f7b1 ```
During the last fix I have incorporate a new bug in mounting the correct path to the docker container. I have adapted the implementation. The path in `fondant explore --data-directory <path>` can also be an relative path now. The path will be resolved to an absolute path. Inside the docker container the folder will be located in the root folder. --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
…eam#282) This 3rd component extracts the image url, alt text and license metadata from the webpage url and html code.
@RobbeSneyders, I have update the component with all suggested changes. Please review it |
Did you manage to run your component in a pipeline locally? (using the local runner) |
No Should we create one? Because it is not mentioned in Docs |
So normally the images already exist for reusable components (in your case So something like this: (you can
The local runner will then build the component locally and run it which allows you to test it once everything works we can abstract it to a reusable component. |
I'm also having a bit of difficulties reviewing your code since the https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork Again I'm also unsure why this happens but I'm sure we can figure this out together. |
That's because I have been taking pull from main branch before I pushed the code, so that during merge it will easier without any issues since my branch will be uptodate with the new changes in main |
Actual the last commit I pushed has been synched with upstream |
Or Shall I do one thing I'll delete this PR and re-fork the repo and I'll submit the code related to this issue. That would make it easier for you to review the code |
Is there a template to create docker or can I just use a simple Docker with python3.8-slim buster with fondant installed in it |
You can use the files you already created from |
Hi @GeorgesLorre, I have created docker file and run it locally, Im getting this error
|
Hmm strange, can you check which Fondant version gets installed? You should have this in your Dockerfile:
Maybe there is some caching issue ? |
@GeorgesLorre , I tried removing the Docker Image and container and re-ran fondant version in container Dockerfile FROM --platform=linux/amd64 python:3.8-slim
# System dependencies
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install git -y
# Install requirements
COPY requirements.txt /
RUN pip3 install --no-cache-dir -r requirements.txt
# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=main
RUN pip3 install fondant[aws,azure,gcp]@git+https://github.com/ml6team/fondant@${FONDANT_VERSION}
# Set the working directory to the component folder
WORKDIR /component/src
# Copy over src-files
COPY src/ .
ENTRYPOINT ["python", "main.py"] |
Hmm, I'm not sure what the problem is then. Could you share the code you have for running it locally ? |
@GeorgesLorre ,Sure, are you on fondant's discord? |
BTW: I changed the base of this PR to something random and then back to main which seems to have removed all the already merged changes |
Yes I am ! |
@GeorgesLorre, But usually fetching upstream and merging it with the working branch before raising PR is good practice right? So that PR code is in sync with main branch |
@GeorgesLorre, Cool Ill share you my local code there |
I have shared you the code in Discord.
… On 25-Jul-2023, at 2:21 PM, Georges Lorré ***@***.***> wrote:
Hmm, I'm not sure what the problem is then. Could you share the code you have for running it locally ?
—
Reply to this email directly, view it on GitHub <#296 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTAUPF66JFHJBV6HI6I4R3XR6CHJANCNFSM6AAAAAA2MVGWMU>.
You are receiving this because you were mentioned.
|
@@ -0,0 +1,295 @@ | |||
import gzip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that you have tests. They are however not ran in the current CI/CD pipeline since we only call pytest on the fondant/tests/
folder. This is something you could solve in your other PR.
I'm assuming you ran these manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes , I ran it manually and @RobbeSneyders asked me to create tests within the components itself so I created within that same folder
@@ -0,0 +1,23 @@ | |||
FROM --platform=linux/amd64 python:3.8-slim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a readme that detailes what this component does, what it supports and how to use it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Ill add Readme along with these details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgesLorre ,I have pushed README.md, please review it.
"directory_uri": "./data.zip", # change this to your | ||
# directory_uri, remote or local | ||
}, | ||
output_partition_size="10MB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed, it was fixed in a recent bugfix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok Done
This PR contains code for load from files component related to #290 --------- Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com> Co-authored-by: Matthias Richter <matthias.r1092@gmail.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com> Co-authored-by: Georges Lorré <35808396+GeorgesLorre@users.noreply.github.com> Co-authored-by: Sharon Grundmann <sharon.grundmann@ml6.eu>
This PR contains code for load from files component related to #290