Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for getting-started-preprocessing #183

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Iota87
Copy link

@Iota87 Iota87 commented Sep 11, 2020

To be completed. Need preliminary feedbacks on:

  • Structure (also keeping in mind rendering on website)
  • Tone (e.g. use of examples)
  • Length / level of details

Copy link
Collaborator

@henrifroese henrifroese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found a few small typos. Overall, I really like the tutorial 👍 . Things I'd change:

  • the "stemming" part at the bottom (I think it's not from you) is not that nice, I'd probably redo that (although this should probably be moved to nlp.py anyways soon so maybe it'll just be left out here)
  • I would maybe include 2-3 more preprocessing functions that are characteristic of the module so users can familiarize themselves with our function style (e.g. remove_whitespace, remove_html_tags, and definitely tokenize)
  • Maybe drop lines 11-13 and call the first section "Overview" (there's the "--" in line 11, I'm not sure why?)
  • The "Custom Pipelines" section belongs to clean (it's also just calling clean), so I'd move it under the clean section
  • I think something like the following structure would be nice:

Overview

Key Functions

Clean

Custom Pipelines

Tokenize

Preprocessing API

Add quick examples with remove_whitespace / remove_html_tags so users know how to use the module. Then just what you already have here.

Recap

3-sentence summary with inline code.

website/docs/getting-started-preprocessing.md Outdated Show resolved Hide resolved
website/docs/getting-started-preprocessing.md Outdated Show resolved Hide resolved
website/docs/getting-started-preprocessing.md Outdated Show resolved Hide resolved
website/docs/getting-started-preprocessing.md Outdated Show resolved Hide resolved
website/docs/getting-started-preprocessing.md Outdated Show resolved Hide resolved
@Iota87
Copy link
Author

Iota87 commented Oct 7, 2020

Great comments Henri, and good catches on the typos.
I added tokenize, references and adjusted the structure in line with your input.
Let me know what you think.
I am a bit hesitant to add "remove_html_tags" here because I do not know if it is something that you can easily explain in plain words and in a succinct way to a complete beginner. It can be explained in a separate section/tutorial, but I am not sure you want to get into HTML tags in the getting started. What do you think?

Giovanni Liotta and others added 2 commits October 7, 2020 16:11
@jbesomi
Copy link
Owner

jbesomi commented Oct 9, 2020

Hi Guys!

Thank you Giovanni for the great start and Henri for the comments!

Sorry for having reviewed that late!

As a general comment, I think we need to make it more technical and concise. The end goal of the getting started preprocessing tutorial is to teach how to use Texthero to actually do text preprocessing.

As we want to guide the user through Texthero preprocessing core, it's important to show them how to actually do the stuff.

Giovanni, do you think you can start from the comment below, test the code in a Juptyer Notebook, and then write around to this a getting-started tutorial? I didn't go into the details to give you more freedom; If you want more advice or something is unclear just let me know!

Kind regards,
Jonathan


(overview + what's important to keep in mind)

  • one of Texthero's pillar is text preprocessing
  • need to mention the modularity approach (one function for one task), and that the user can customize the pipeline
  • preprocessing is task and domain-specific. The developer needs to know what he wants, Texthero provide a tool to quickly experiment. It's advised to start with the standard clean pipeline, see if that work, and otherwise iteratively try to solve the problem
  • The Texthero preprocessing is seen more as a pre-processing step for bag-of-words models, where what matters is the content (not the grammar or punctuation). In bag-of-words models, we want to get rid of punctuations and stopwords and we want to normalize (stem). This is different from the more advanced and complex neural network transformers architectures ... here we might want to keep the punctuation as well as the stopwords ... but, if the text data are very dirty, then a general cleaning might be useful anyway (for example removal of round brackets and content generally help + replacement of 12.3 numbers to NUM might help as well) ...
  • Users come here after having read the "getting started" page, they already know about the clean function, here we want to offer something more and explain to them how to clean some text data, it's important to give users examples as well as guide them through the process
  • We want to teach the users to use the API preprocessing, and we want to mention at least 50% of such functions
  • Tokenization part: hide for now, as we are making main changes there

Preprocessing

Overview

Introduction to this new "chapter" and menstion what we have seen before + introduction sentence about preprocessing ... something like: "By now you should have a general overview of what's Texthero is about, in the next sections we will dig a bit deeper into Texthero's core and see what we can get out of our beautiful text data."

Preprocessing API

Link + introduction

Doing it right

  • There is no magic formula that works in every situation, Texthero provides a modular approach to deal with data processing
  • The user needs to understand what it actually requires.
  • Texthero is mostly used to get a first feeling of the data, using bag-of-words approaches, in this case, the goal is to try to keep relevant and clean content
  • Mention bag-of-words approach, explain the difference between transformers. Here is really from raw data (maybe coming from an ocr or scraped from a website) to something cleaner.

Standard vs Custom pipeline ( old key function)

Mention there is the clean standard function or that we can customize, as, Mention chaining, all preprocessing's functions receive as input a Pandas Series and they return a Pandas Series. This allows chaining multiple functions in a pandas-pythonic fashion.

FAQ

FAQ questions, mostly to improve SEO.

Text preprocessing, From zero to hero

Preprocessing is about data cleaning, let's assume we got some dirty data we want to clean, especially, we want to keep only relevant and clean content.

df = pd.DataFrame(["I have the power! $$ (wow!)", "Flame on!",
"HULK SMASH!",...
Holy ____ Batman!
I am the vengeance, I am the night, I am BATMAN!
I am GROOT.
I’m going ghost!
I am the law!
SPOOOON!!!"])

Let's start by calling clean ... see what happens.

hero.preprocessing.clean(df['text'])

...

comment ...

Now, assume we want to keep the punctuation marks but remove parenthesis ... open the "preprocessing API" page and look for the "remove_brackets"

Show a custom pipeline and explain it:

df['clean'] = (
df['text']
.pipe(p.function1)
.pipe(p.function2)
.pipe(p.function3)
)

Going further

two-three high-quality links to other pages about text-preprocessing + a getting started tutorial on regex with python

Recap

@Iota87
Copy link
Author

Iota87 commented Oct 14, 2020

Sounds good, Jonathan! I reviewed your comments and suggestions, they are perfectly aligned with what discussed on the call. Working on it!
Thanks,
Giovanni

Concise version. Structure should be final. More examples can be added.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants