-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM pipeline] Language filter component #232
Conversation
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
logger = logging.getLogger(__name__) | ||
|
||
|
||
class LanguageIdentification: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than including the ftz file, can we load from the hub since FastText is now hosted there?
just:
import fasttext
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
model = fasttext.load_model(model_path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What speaks against to include the ftz file in the repository? Alternative we could download the file during the image build process. Just want to avoid the situation, if some external dependencies can not be reached that the execution of the component will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr! Some small comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting to see these tests.
This could probably also be easier if we split the general component behavior from the user implementation into separate classes as discussed in chat. Since then we could test the user implementation without having to provide dummy variables for all the general component behavior.
Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This PR adds the first component for the LLM dataset creation pipeline. The component is a language filter which filters out rows in a provided dataframe that are not matching the provided language. FastText is used for the language detection. Changes - add component - add unit test to test the filter logic inside of the component Note: Did not create a pipeline that uses this component yet. --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Robbe Sneyders <robbe.sneyders@gmail.com>
This PR adds the first component for the LLM dataset creation pipeline. The component is a language filter which filters out rows in a provided dataframe that are not matching the provided language.
FastText is used for the language detection.
Changes
Note: Did not create a pipeline that uses this component yet.