Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for getting-started-preprocessing #183

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.pythonPath": "/Users/giovanniliotta/opt/anaconda3/envs/texthero/bin/python"
}
7 changes: 7 additions & 0 deletions texthero.code-workspace
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"folders": [
{
"path": "."
}
]
}
96 changes: 95 additions & 1 deletion website/docs/getting-started-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,101 @@
id: getting-started-preprocessing
---

## Getting started with pre-processing
## Getting started with <span style="color: #ff8c42">pre-processing</span>

Pre-processing is a fundamental step in text analysis. Consistent, methodical and reproducible pre-processing operations are a necessary pre-requisite for success of any type of text-based analysis.

## Overview

--

## Intro

When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc.
The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words or paragraphs. The machine knows instead how to read _numerical vectors_.
Text data has good properties that allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require the input text in a form that is as clean and simple as possible, in other words **pre-processed**.
Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept).
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

How useful is this step?
Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% to actually analyze it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly and unambiguously implemented.

With text hero it only takes one command!
To clean text data in a reliable way all we have to do is:

```python
df['clean_text'] = hero.clean(df['text'])
```

> NOTE. In this section we use the same [BBC Sport Dataset](http://mlg.ucd.ie/datasets/bbc.html) as in **Getting Started**. To load the `bbc sport` dataset in a Pandas DataFrame run:
```python
df = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
```

## Clean

Texthero clean method allows a rapid implementation of key cleaning steps that are:
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

- Derived from review of relevant academic literature (#include citations)
- Validated by a group of NLP enthusiasts with applied experience in different contexts
- Accepted by the NLP community as standard and inescapable

The default steps do the following:

| Step | Description |
|----------------------|--------------------------------------------------------|
|`fillna()` |Replace missing values with empty spaces |
|`lowercase()` |Lowercase all text to make the analysis case-insensitive|
|`remove_digits()` |Remove numbers |
|`remove_punctuation()`|Remove punctuation symbols (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~) |
|`remove_diacritics()` |Remove accents
|`remove_stopwords()` |Remove the most common words ("i", "me", "myself", "we", "our", etc.) |

|`remove_whitespace()` |Remove spaces between words|



in just one command!

```python
df['clean_text'] = hero.clean(df['text'])
```

## Custom Pipeline
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That").
Iota87 marked this conversation as resolved.
Show resolved Hide resolved
If this is the case, you can easily customize the pre-processing pipeline by implementing only specifics cleaning steps:
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

```python
from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_punctuation
preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)
```

or alternatively

```python
df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline)
```

In the above example we want to pre-process the text despite keeping accents, digits and stop words.

##### Preprocessing API

Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs.


If you are interested in learning more about text cleaning, check out these resources:
(#Links list)







Expand Down