Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Starcoder example pipeline + base components #175

Merged
merged 12 commits into from
Jun 14, 2023

Conversation

NielsRogge
Copy link
Contributor

This PR:

  • adds a set of components aimed to process text/code datasets to the component registry. These components can be used to 1) filter code based on code to comments ratio 2) filter code based on line length 3) detect and redact (replace) PII or personal identifiable information from code.
  • adds an example pipeline that adds one specific component called load_from_hub_stack that loads a code dataset from the hub. Next, it uses the 3 components in sequence to process and filter the code dataset.

components/filter_comments/src/utils/text_extraction.py Outdated Show resolved Hide resolved
components/filter_metadata/README.md Outdated Show resolved Hide resolved
components/filter_metadata/fondant_component.yaml Outdated Show resolved Hide resolved
components/pii_redaction/src/main.py Outdated Show resolved Hide resolved
components/pii_redaction/src/pii_redaction.py Show resolved Hide resolved
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels!

Left a few comments, I think there are a few things still GCP specific that need to be removed.

I;m guessing this code and components will only work for our personally created ML6 version of the stack dataset (loading it from json and dumping it to a parquet under a new dataset repo). What if the user wants to work with the big stack datasets? Can we generalize well with those components?

components/filter_comments/fondant_component.yaml Outdated Show resolved Hide resolved
components/pii_redaction/build_image.sh Outdated Show resolved Hide resolved
from fondant.component import TransformComponent
from fondant.logger import configure_logging

from pii_detection import scan_pii
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from pii_detection import scan_pii
from pii_detection import scan_pii, redact_pii

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this one? I'm importing redact_pii in the line below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a matter of personal preference but I think both are valid :)

examples/pipelines/starcoder/components/pii_redaction Outdated Show resolved Hide resolved
)
result.columns = ["code_secrets", "code_has_secrets", "code_number_secrets"]

dataframe = dataframe.merge(result, left_index=True, right_index=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we merging? can't you extend the original dataframe with some columns in the apply() method since we're working with the same subset

Copy link
Contributor Author

@NielsRogge NielsRogge Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are merging since Dask does not support multiple columns assignment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, @RobbeSneyders is that something we could resolve when we move to the pandas interface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in pandas this works.

replacements=replacements,
),
axis=1,
meta=(None, "str"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the first argument None? usually it's an index

IP addresses: replace with one of n synthetic private IP addresses (IPv4 or IPv6)
Keys: replace with one of n [sequence of 32 random characters/digits]

TODO: add IPv6 and IPv4 separation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that for us or part of the source code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was taken from the BigCode project

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not for this project but it would be interesting to have a way of working on updating components like this where the source code is taken from another repo. Right now it's just a snapshot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be ideal if this was a library.

examples/pipelines/starcoder/components/filter_comments Outdated Show resolved Hide resolved
examples/pipelines/starcoder/components/filter_line_length Outdated Show resolved Hide resolved
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge! Left some comments

In general:

  • Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.
  • I would remove anything related to the run_locally script and symlinks. This will be replaced by the local runner.
  • This PR has some conflicts with Custom component spec #191. We'll need to make sure everything is updated after merging both.

examples/pipelines/starcoder/components/filter_comments Outdated Show resolved Hide resolved
examples/pipelines/starcoder/run_locally.sh Outdated Show resolved Hide resolved
components/pii_redaction/src/main.py Outdated Show resolved Hide resolved
)
result.columns = ["code_secrets", "code_has_secrets", "code_number_secrets"]

dataframe = dataframe.merge(result, left_index=True, right_index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in pandas this works.

IP addresses: replace with one of n synthetic private IP addresses (IPv4 or IPv6)
Keys: replace with one of n [sequence of 32 random characters/digits]

TODO: add IPv6 and IPv4 separation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be ideal if this was a library.

examples/pipelines/starcoder/README.md Outdated Show resolved Hide resolved
@RobbeSneyders
Copy link
Member

And can you also update the readme to include these components in the list?

@RobbeSneyders
Copy link
Member

#191 is merged, so we should make the necessary updates here.

@NielsRogge
Copy link
Contributor Author

@RobbeSneyders I've made the updates

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge.

What do you think of my previous proposal:

Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.

And left one more comment.

components/pii_redaction/src/gibberish_data/big.txt Outdated Show resolved Hide resolved
@NielsRogge
Copy link
Contributor Author

Can we add a code_ prefix to the names of these components? Or would they work out of the box on any text? I guess at least the filter_comments component wouldnt'.

The pii redaction component would work on text as well, but the filter comments and filter metadata components are meant to be run on code.

@PhilippeMoussalli PhilippeMoussalli merged commit bcdea6a into main Jun 14, 2023
@PhilippeMoussalli PhilippeMoussalli deleted the add_text_components branch June 14, 2023 12:48
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR:

- adds a set of components aimed to process text/code datasets to the
component registry. These components can be used to 1) filter code based
on code to comments ratio 2) filter code based on line length 3) detect
and redact (replace) PII or personal identifiable information from code.
- adds an example pipeline that adds one specific component called
`load_from_hub_stack` that loads a code dataset from the hub. Next, it
uses the 3 components in sequence to process and filter the code
dataset.

---------

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants