-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft for getting-started-preprocessing #183
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found a few small typos. Overall, I really like the tutorial 👍 . Things I'd change:
- the "stemming" part at the bottom (I think it's not from you) is not that nice, I'd probably redo that (although this should probably be moved to
nlp.py
anyways soon so maybe it'll just be left out here) - I would maybe include 2-3 more preprocessing functions that are characteristic of the module so users can familiarize themselves with our function style (e.g.
remove_whitespace
,remove_html_tags
, and definitelytokenize
) - Maybe drop lines 11-13 and call the first section "Overview" (there's the "--" in line 11, I'm not sure why?)
- The "Custom Pipelines" section belongs to clean (it's also just calling clean), so I'd move it under the clean section
- I think something like the following structure would be nice:
Overview
Key Functions
Clean
Custom Pipelines
Tokenize
Preprocessing API
Add quick examples with remove_whitespace / remove_html_tags so users know how to use the module. Then just what you already have here.
Recap
3-sentence summary with inline code.
Great comments Henri, and good catches on the typos. |
Hi Guys! Thank you Giovanni for the great start and Henri for the comments! Sorry for having reviewed that late! As a general comment, I think we need to make it more technical and concise. The end goal of the getting started preprocessing tutorial is to teach how to use Texthero to actually do text preprocessing. As we want to guide the user through Texthero preprocessing core, it's important to show them how to actually do the stuff. Giovanni, do you think you can start from the comment below, test the code in a Juptyer Notebook, and then write around to this a getting-started tutorial? I didn't go into the details to give you more freedom; If you want more advice or something is unclear just let me know! Kind regards, (overview + what's important to keep in mind)
PreprocessingOverviewIntroduction to this new "chapter" and menstion what we have seen before + introduction sentence about preprocessing ... something like: "By now you should have a general overview of what's Texthero is about, in the next sections we will dig a bit deeper into Texthero's core and see what we can get out of our beautiful text data." Preprocessing APILink + introduction Doing it right
Standard vs Custom pipeline ( old key function)Mention there is the FAQFAQ questions, mostly to improve SEO. Text preprocessing, From zero to heroPreprocessing is about data cleaning, let's assume we got some dirty data we want to clean, especially, we want to keep only relevant and clean content. df = pd.DataFrame(["I have the power! $$ (wow!)", "Flame on!", Let's start by calling clean ... see what happens. hero.preprocessing.clean(df['text']) ... comment ... Now, assume we want to keep the punctuation marks but remove parenthesis ... open the "preprocessing API" page and look for the "remove_brackets" Show a custom pipeline and explain it: df['clean'] = ( Going furthertwo-three high-quality links to other pages about text-preprocessing + a getting started tutorial on regex with python Recap |
Sounds good, Jonathan! I reviewed your comments and suggestions, they are perfectly aligned with what discussed on the call. Working on it! |
Concise version. Structure should be final. More examples can be added.
To be completed. Need preliminary feedbacks on: