jbesomi · henrifroese · Jul 19, 2020 · Jul 19, 2020 · Jul 19, 2020 · Jul 19, 2020
diff --git a/README.md b/README.md
@@ -44,7 +44,7 @@
 
 <h2 align="center">From zero to hero</h2>
 
-Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic. 
+Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic.
 
 You can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines.
 
@@ -55,15 +55,15 @@ Texthero include tools for:
 * Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.
 * Text visualization: vector space visualization, place localization on maps (wip).
 
-Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). 
+Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!).
 
 We hope you will find pleasure working with Texthero as we had during his development.
 
 <h2 align="center">Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか？</h2>
 
 Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.
 
-Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文？ 日本語が話せるのか？ Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!
+Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sprechen Sie Deutsch? 你会说中文 日本語が話せるのか？Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!
 
 For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github [issue](https://github.com/jbesomi/texthero/issues), we will be glad to support you and help you.
 
@@ -92,7 +92,7 @@ pip install texthero
 
 <h2 align="center">Getting started</h2>
 
-The best way to learn Texthero is through the <a href="https://texthero.org/docs/getting-started">Getting Started</a> docs. 
+The best way to learn Texthero is through the <a href="https://texthero.org/docs/getting-started">Getting Started</a> docs.
 
 In case you are an advanced python user, then `help(texthero)` should do the work.
 
@@ -102,20 +102,21 @@ In case you are an advanced python user, then `help(texthero)` should do the wor
 
 
 ```python
-import texthero as hero
-import pandas as pd
-
-df = pd.read_csv(
-   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
-)
-
-df['pca'] = (
-   df['text']
-   .pipe(hero.clean)
-   .pipe(hero.tfidf)
-   .pipe(hero.pca)
-)
-hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")
+>>> import texthero as hero
+>>> import pandas as pd
+>>>
+>>> df = pd.read_csv(
+...   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
+... )
+>>>
+>>> df['pca'] = (
+...    df['text']
+...    .pipe(hero.clean)
+...    .pipe(hero.tokenize)
+...    .pipe(hero.tfidf)
+...    .pipe(hero.pca)
+... )
+>>> hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")
 ```
 
 <p align="center">
@@ -125,28 +126,29 @@ hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")
 <h3>2. Text preprocessing, TF-IDF, K-means and Visualization</h3>
 
 ```python
-import texthero as hero
-import pandas as pd
-
-df = pd.read_csv(
-    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
-)
-
-df['tfidf'] = (
-    df['text']
-    .pipe(hero.clean)
-    .pipe(hero.tfidf)
-)
-
-df['kmeans_labels'] = (
-    df['tfidf']
-    .pipe(hero.kmeans, n_clusters=5)
-    .astype(str)
-)
-
-df['pca'] = df['tfidf'].pipe(hero.pca)
-
-hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")
+>>> import texthero as hero
+>>> import pandas as pd
+>>>
+>>> df = pd.read_csv(
+...     "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
+... )
+
+>>> df['tfidf'] = (
+...     df['text']
+...     .pipe(hero.clean)
+...     .pipe(hero.tokenize)
+...     .pipe(hero.tfidf)
+... )
+>>>
+>>> df['kmeans_labels'] = (
+...     df['tfidf']
+...     .pipe(hero.kmeans, n_clusters=5)
+...     .astype(str)
+... )
+>>>
+>>> df['pca'] = df['tfidf'].pipe(hero.pca)
+>>>
+>>> hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")
 ```
 
 <p align="center">
@@ -180,7 +182,7 @@ Remove all types of brackets and their content.
 
 ```python
 >>> s = hero.remove_brackets(s)
->>> s 
+>>> s
 0    This sèntencé    needs to  be cleaned!
 dtype: object
 ```
@@ -189,7 +191,7 @@ Remove diacritics.
 
 ```python
 >>> s = hero.remove_diacritics(s)
->>> s 
+>>> s
 0    This sentence    needs to  be cleaned!
 dtype: object
 ```
@@ -198,7 +200,7 @@ Remove punctuation.
 
 ```python
 >>> s = hero.remove_punctuation(s)
->>> s 
+>>> s
 0    This sentence    needs to  be cleaned
 dtype: object
 ```
@@ -207,7 +209,7 @@ Remove extra white-spaces.
 
 ```python
 >>> s = hero.remove_whitespace(s)
->>> s 
+>>> s
 0    This sentence needs to be cleaned
 dtype: object
 ```
@@ -217,7 +219,16 @@ Sometimes we also want to get rid of stop-words.
 ```python
 >>> s = hero.remove_stopwords(s)
 >>> s
-0    This sentence needs cleaned
+0    This sentence needs   cleaned
+dtype: object
+```
+
+There is also the option to clean the text automatically by calling the "clean"-function instead of doing it step by step.
+```python
+>>> text = "This sèntencé    (123 /) needs to [OK!] be cleaned!   "
+>>> s = pd.Series(text)
+>>> hero.clean(s)
+0    sentence needs cleaned
 dtype: object
 ```
 
@@ -243,9 +254,11 @@ Full documentation: [nlp](https://texthero.org/docs/api-nlp)
 **Scope:** map text data into vectors and do dimensionality reduction.
 
 Supported **representation** algorithms:
-1. Term frequency (`count`)
+1. Term frequency (`term_frequency`)
 1. Term frequency-inverse document frequency (`tfidf`)
 
+For the "representation" functions it is strongly recommended to tokenize the input series first with the `hero.tokenize(s)` function from the texthero library.
+
 Supported **clustering** algorithms:
 1. K-means (`kmeans`)
 1. Density-Based Spatial Clustering of Applications with Noise (`dbscan`)
@@ -295,7 +308,7 @@ The website will be soon moved from Docusaurus to Sphinx: read the [open issue t
 
 **Are you good at writing?**
 
-Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide. 
+Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide.
 
 If you are good at writing you can help us! Why don't you start by [Adding a FAQ page to the website](https://github.com/jbesomi/texthero/issues/41) or explain how to [create a custom pipeline](https://github.com/jbesomi/texthero/issues/38)? Need help? We are there for you.
 
@@ -314,6 +327,8 @@ If you have just other questions or inquiry drop me a line at jonathanbesomi__AT
 - [bobfang1992](https://github.com/bobfang1992)
 - [Ishan Arora](https://github.com/ishanarora04)
 - [Vidya P](https://github.com/vidyap-xgboost)
+- [Henri Froese](https://github.com/henrifroese)
+- [Maximilian Krahn](https://github.com/mk2510)
 
 
 <h2 align="center"><a href="./LICENSE">License</a></h2>

diff --git a/texthero/nlp.py b/texthero/nlp.py
@@ -1,20 +1,20 @@
 """
-Common NLP tasks such as named_entities, noun_chunks, etc.
+The texthero.nlp module supports common NLP tasks such as named_entities, noun_chunks, ... on Pandas Series and DataFrame.
 """
 
 import spacy
 import pandas as pd
 
 
-def named_entities(s, package="spacy"):
+def named_entities(s: pd.Series, package="spacy") -> pd.Series:
     """
     Return named-entities.
 
     Return a Pandas Series where each rows contains a list of tuples containing information regarding the given named entities.
 
     Tuple: (`entity'name`, `entity'label`, `starting character`, `ending character`)
 
-    Under the hood, `named_entities` make use of Spacy name entity recognition.
+    Under the hood, `named_entities` makes use of `Spacy name entity recognition <https://spacy.io/usage/linguistic-features#named-entities>`_
 
     List of labels:
      - `PERSON`: People, including fictional.
@@ -36,6 +36,14 @@ def named_entities(s, package="spacy"):
      - `ORDINAL`: “first”, “second”, etc.
      - `CARDINAL`: Numerals that do not fall under another type.
 
+    Parameters
+    ----------
+        s : Pandas Series
+
+    Returns
+    -------
+        Pandas Series, where each rows contains a list of tuples containing information regarding the given named entities.
+
     Examples
     --------
     >>> import texthero as hero
@@ -57,7 +65,7 @@ def named_entities(s, package="spacy"):
     return pd.Series(entities, index=s.index)
 
 
-def noun_chunks(s):
+def noun_chunks(s: pd.Series) -> pd.Series:
     """
     Return noun chunks (noun phrases).
 
@@ -73,8 +81,12 @@ def noun_chunks(s):
 
     Parameters
     ----------
-    input : Pandas Series
-
+        s : Pandas Series
+
+    Returns
+    -------
+        Pandas Series, where each row contains a tuple that has information regarding the noun chunk.
+
     Examples
     --------
     >>> import texthero as hero
@@ -107,7 +119,15 @@ def count_sentences(s: pd.Series) -> pd.Series:
 
     Return a new Pandas Series with the number of sentences per cell.
 
-    This makes use of the SpaCy `sentencizer <https://spacy.io/api/sentencizer>`.
+    This makes use of the SpaCy `sentencizer <https://spacy.io/api/sentencizer>`_
+
+    Parameters
+    ----------
+        s : Pandas Series
+
+    Returns
+    -------
+        Pandas Series, with the number of sentences per document in every cell.
 
     Examples
     --------