[LLM pipeline] Add normalize text component #246

mrchtr · 2023-06-29T11:37:04Z

Component which applies different text normalization (nfc, lowercasing and regex pattern replacements)

This component is needed for the LLM dataset creation pipeline.

NielsRogge · 2023-06-29T13:36:55Z

components/text_normalization/fondant_component.yaml

@@ -0,0 +1,20 @@
+name: Normalize text.
+description: A component that normalizes text.
+image: ghcr.io/ml6team/text_normalization:latest


Suggested change

image: ghcr.io/ml6team/text_normalization:latest

image: ghcr.io/ml6team/normalize_text:latest

As mentioned in chat, agree with consistent naming, but would actually prefer text_normalization here as it will group all text components alphabetically.

Hmm not necessarily agree on that one, we can group them ourselves in the docs?

Would prefer to have verb + noun for all our components

components/text_normalization/requirements.txt

RobbeSneyders

Thanks @mrchtr! One small change is needed to the logging, otherwise LGTM.

RobbeSneyders · 2023-07-03T09:01:59Z

components/text_normalization/fondant_component.yaml

@@ -0,0 +1,20 @@
+name: Normalize text.
+description: A component that normalizes text.
+image: ghcr.io/ml6team/text_normalization:latest


As mentioned in chat, agree with consistent naming, but would actually prefer text_normalization here as it will group all text components alphabetically.

components/text_normalization/requirements.txt

RobbeSneyders · 2023-07-03T09:02:38Z

components/text_normalization/src/main.py

+from fondant.component import PandasTransformComponent
+from fondant.logger import configure_logging
+
+configure_logging()


This is outdated, please rebase on / merge with main.

RobbeSneyders

Thanks @mrchtr!

Component which applies different text normalization (nfc, lowercasing and regex pattern replacements) This component is needed for the LLM dataset creation pipeline.

Add text normalization component

3623c3f

mrchtr requested a review from NielsRogge June 29, 2023 11:37

mrchtr added 2 commits June 29, 2023 13:37

Merge branch 'main' into feature/text_normalization_rebased

be9d1fc

Fixing ruff after merging main into feature branch

3fcd585

NielsRogge reviewed Jun 29, 2023

View reviewed changes

components/text_normalization/requirements.txt Show resolved Hide resolved

RobbeSneyders reviewed Jul 3, 2023

View reviewed changes

Removing configure_logger

794d7ae

RobbeSneyders approved these changes Jul 5, 2023

View reviewed changes

RobbeSneyders merged commit b544cd4 into ml6team:main Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM pipeline] Add normalize text component #246

[LLM pipeline] Add normalize text component #246

mrchtr commented Jun 29, 2023

NielsRogge Jun 29, 2023

RobbeSneyders Jul 3, 2023

NielsRogge Jul 4, 2023

RobbeSneyders left a comment

RobbeSneyders Jul 3, 2023

RobbeSneyders Jul 3, 2023

RobbeSneyders left a comment

	image: ghcr.io/ml6team/text_normalization:latest
	image: ghcr.io/ml6team/normalize_text:latest

[LLM pipeline] Add normalize text component #246

[LLM pipeline] Add normalize text component #246

Conversation

mrchtr commented Jun 29, 2023

NielsRogge Jun 29, 2023

Choose a reason for hiding this comment

RobbeSneyders Jul 3, 2023

Choose a reason for hiding this comment

NielsRogge Jul 4, 2023

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders Jul 3, 2023

Choose a reason for hiding this comment

RobbeSneyders Jul 3, 2023

Choose a reason for hiding this comment

RobbeSneyders left a comment

Choose a reason for hiding this comment