Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM pipeline] Language filter component #232

Merged
merged 15 commits into from
Jul 5, 2023
Merged

[LLM pipeline] Language filter component #232

merged 15 commits into from
Jul 5, 2023

Conversation

mrchtr
Copy link
Contributor

@mrchtr mrchtr commented Jun 23, 2023

This PR adds the first component for the LLM dataset creation pipeline. The component is a language filter which filters out rows in a provided dataframe that are not matching the provided language.
FastText is used for the language detection.

Changes

  • add component
  • add unit test to test the filter logic inside of the component

Note: Did not create a pipeline that uses this component yet.

mrchtr and others added 5 commits June 29, 2023 13:18
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
logger = logging.getLogger(__name__)


class LanguageIdentification:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than including the ftz file, can we load from the hub since FastText is now hosted there?

just:

import fasttext
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
model = fasttext.load_model(model_path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What speaks against to include the ftz file in the repository? Alternative we could download the file during the image build process. Just want to avoid the situation, if some external dependencies can not be reached that the execution of the component will fail.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrchtr! Some small comments.

components/language_filter/src/main.py Outdated Show resolved Hide resolved
components/language_filter/src/main.py Outdated Show resolved Hide resolved
components/language_filter/src/main.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting to see these tests.

This could probably also be easier if we split the general component behavior from the user implementation into separate classes as discussed in chat. Since then we could test the user implementation without having to provide dummy variables for all the general component behavior.

mrchtr and others added 2 commits July 3, 2023 16:29
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
@PhilippeMoussalli PhilippeMoussalli added the Components Implementation of components label Jul 3, 2023
@PhilippeMoussalli PhilippeMoussalli self-assigned this Jul 3, 2023
@PhilippeMoussalli PhilippeMoussalli added this to the 0.2.0 milestone Jul 3, 2023
@PhilippeMoussalli PhilippeMoussalli linked an issue Jul 3, 2023 that may be closed by this pull request
@PhilippeMoussalli PhilippeMoussalli removed this from the 0.2.0 milestone Jul 3, 2023
@PhilippeMoussalli PhilippeMoussalli removed their assignment Jul 3, 2023
@PhilippeMoussalli PhilippeMoussalli removed the Components Implementation of components label Jul 3, 2023
@RobbeSneyders RobbeSneyders merged commit d06b9e0 into ml6team:main Jul 5, 2023
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds the first component for the LLM dataset creation pipeline.
The component is a language filter which filters out rows in a provided
dataframe that are not matching the provided language.
FastText is used for the language detection. 

Changes
- add component
- add unit test to test the filter logic inside of the component

Note: Did not create a pipeline that uses this component yet.

---------

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run Controlnet use case at scale with custom LAION backend
4 participants