From 8a3c8d46fae1845bc36675d39f44bb91a4171efe Mon Sep 17 00:00:00 2001 From: Giovanni Liotta Date: Fri, 11 Sep 2020 14:32:43 -0500 Subject: [PATCH 1/8] pre-processing schema draft --- website/docs/getting-started-preprocessing.md | 62 ++++++++++++++++++- 1 file changed, 61 insertions(+), 1 deletion(-) diff --git a/website/docs/getting-started-preprocessing.md b/website/docs/getting-started-preprocessing.md index 7da3bd44..9c52215d 100644 --- a/website/docs/getting-started-preprocessing.md +++ b/website/docs/getting-started-preprocessing.md @@ -2,8 +2,68 @@ id: getting-started-preprocessing --- -## Getting started with pre-processing +## Getting started with pre-processing +Pre-processing is a fundamental step in text analysis. Being consistent and methodical in pre-processing operations is a necessary condition for the success of text-based analysis. + +## Overview + +-- + +## Intro + +When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to allow us to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc... you get it! +The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words, paragraphs, etc. +The machine however knows how to read numerical vectors and text has good properties that easily allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require that the text given as input is as clean and simple as possible, in other words **pre-processed**. +Clean and simple basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept). + +How useful is this step? +Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% for actually using it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly implemented. + +In text hero it only takes one command: +To clean text data in a reliable way all we have to do is: +#Note for this section we use the same dataset as in **Getting Started** + +```python +df['clean_text'] = hero.clean(df['text']) +``` +or ... +[Pipeline explanation] + +## Clean + +Texthero clean method allows a rapid implementation of key cleaning steps that are: +- Derived from survey of relevant academic literature #cite +- Validated by a group of NLP enthusiasts with experience in applying these methods in different contexts #background +- Accepted by the NLP community as inescapable and standard + +The default steps do the following: + +#[TABLE] + +in just one command: + +```python +df['clean_text'] = hero.clean(df['text']) +``` +## Custom Pipeline + +Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond" or if you think that stopwords contain relevant information for your analysis setting. +If this is the case, you can easily edit the pre-processing pipeline by: +```python + +```#Comment/explain what it does + +If you are interested in learning more about text cleaning, check out these resources: +#Links list + + + + + +## Customize it + +Let's see how texthero STANDARDIZE this step... ### Stemming From f303f0bf8792280353e892e3f80cd395e46bebb5 Mon Sep 17 00:00:00 2001 From: Giovanni Liotta Date: Fri, 11 Sep 2020 14:33:51 -0500 Subject: [PATCH 2/8] pre-processing draft schema --- .vscode/settings.json | 3 + test_gio/FirstTest.ipynb | 251 ++++++++++++++++++ texthero.code-workspace | 7 + website/docs/getting-started-preprocessing.md | 78 ++++-- 4 files changed, 317 insertions(+), 22 deletions(-) create mode 100644 .vscode/settings.json create mode 100644 test_gio/FirstTest.ipynb create mode 100644 texthero.code-workspace diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 00000000..dfb2c1e0 --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "python.pythonPath": "/Users/giovanniliotta/opt/anaconda3/envs/texthero/bin/python" +} \ No newline at end of file diff --git a/test_gio/FirstTest.ipynb b/test_gio/FirstTest.ipynb new file mode 100644 index 00000000..a2aed8a7 --- /dev/null +++ b/test_gio/FirstTest.ipynb @@ -0,0 +1,251 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import texthero as hero\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": "/Users/giovanniliotta/Dev/texthero/test_gio\n" + } + ], + "source": [ + "!pwd" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\n", + " \"../dataset/bbcsport.csv\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": " text topic\n0 Claxton hunting first major medal\\n\\nBritish h... athletics\n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
texttopic
0Claxton hunting first major medal\\n\\nBritish h...athletics
1O'Sullivan could run in Worlds\\n\\nSonia O'Sull...athletics
\n
" + }, + "metadata": {}, + "execution_count": 4 + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "df['clean_text'] = hero.clean(df['text'])" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": " text topic \\\n0 Claxton hunting first major medal\\n\\nBritish h... athletics \n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics \n\n clean_text \n0 claxton hunting first major medal british hurd... \n1 sullivan could run worlds sonia sullivan indic... ", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
texttopicclean_text
0Claxton hunting first major medal\\n\\nBritish h...athleticsclaxton hunting first major medal british hurd...
1O'Sullivan could run in Worlds\\n\\nSonia O'Sull...athleticssullivan could run worlds sonia sullivan indic...
\n
" + }, + "metadata": {}, + "execution_count": 6 + } + ], + "source": [ + "df.head(2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "class Category:\n", + " BOOKS= \"BOOKS\"\n", + " CLOTHING = \"CLOTHING\"\n", + " \n", + "train_x = [\"I love the book\", \"this is a great book\", \"the fit is great\", \"I love the shoes\"]\n", + "train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.feature_extraction.text import CountVectorizer" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[1 0 0 0 1 0 1 0]\n", + " [1 0 1 1 0 0 0 1]\n", + " [0 1 1 1 0 0 1 0]\n", + " [0 0 0 0 1 1 1 0]]\n", + "['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']\n" + ] + } + ], + "source": [ + "vectorizer = CountVectorizer()\n", + "train_x_vectors = vectorizer.fit_transform(train_x)\n", + "print(vectors.toarray())\n", + "print(vectorizer.get_feature_names())" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn import svm" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "SVC(kernel='linear')" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "clf_svm = svm.SVC(kernel='linear')\n", + "clf_svm.fit(train_x_vectors, train_y)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['BOOKS'], dtype='pre-processing -Pre-processing is a fundamental step in text analysis. Being consistent and methodical in pre-processing operations is a necessary condition for the success of text-based analysis. +Pre-processing is a fundamental step in text analysis. Consistent, methodical and reproducible pre-processing operations are a necessary pre-requisite for success of any type of text-based analysis. ## Overview @@ -12,58 +12,92 @@ Pre-processing is a fundamental step in text analysis. Being consistent and meth ## Intro -When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to allow us to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc... you get it! -The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words, paragraphs, etc. -The machine however knows how to read numerical vectors and text has good properties that easily allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require that the text given as input is as clean and simple as possible, in other words **pre-processed**. -Clean and simple basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept). +When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc. +The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words or paragraphs. The machine knows instead how to read _numerical vectors_. +Text data has good properties that allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require the input text in a form that is as clean and simple as possible, in other words **pre-processed**. +Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept). How useful is this step? -Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% for actually using it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly implemented. +Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% to actually analyze it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly and unambiguously implemented. -In text hero it only takes one command: +With text hero it only takes one command! To clean text data in a reliable way all we have to do is: -#Note for this section we use the same dataset as in **Getting Started** ```python df['clean_text'] = hero.clean(df['text']) ``` -or ... -[Pipeline explanation] + +> NOTE. In this section we use the same [BBC Sport Dataset](http://mlg.ucd.ie/datasets/bbc.html) as in **Getting Started**. To load the `bbc sport` dataset in a Pandas DataFrame run: +```python +df = pd.read_csv( + "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +) +``` ## Clean Texthero clean method allows a rapid implementation of key cleaning steps that are: -- Derived from survey of relevant academic literature #cite -- Validated by a group of NLP enthusiasts with experience in applying these methods in different contexts #background -- Accepted by the NLP community as inescapable and standard + +- Derived from review of relevant academic literature (#include citations) +- Validated by a group of NLP enthusiasts with applied experience in different contexts +- Accepted by the NLP community as standard and inescapable The default steps do the following: -#[TABLE] +| Step | Description | +|----------------------|--------------------------------------------------------| +|`fillna()` |Replace missing values with empty spaces | +|`lowercase()` |Lowercase all text to make the analysis case-insensitive| +|`remove_digits()` |Remove numbers | +|`remove_punctuation()`|Remove punctuation symbols (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~) | +|`remove_diacritics()` |Remove accents +|`remove_stopwords()` |Remove the most common words ("i", "me", "myself", "we", "our", etc.) | + +|`remove_whitespace()` |Remove spaces between words| + -in just one command: + +in just one command! ```python df['clean_text'] = hero.clean(df['text']) ``` + ## Custom Pipeline -Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond" or if you think that stopwords contain relevant information for your analysis setting. -If this is the case, you can easily edit the pre-processing pipeline by: +Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That"). +If this is the case, you can easily customize the pre-processing pipeline by implementing only specifics cleaning steps: + +```python +from texthero import preprocessing + +custom_pipeline = [preprocessing.fillna, + preprocessing.lowercase, + preprocessing.remove_punctuation + preprocessing.remove_whitespace] +df['clean_text'] = hero.clean(df['text'], custom_pipeline) +``` + +or alternatively + ```python +df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline) +``` -```#Comment/explain what it does +In the above example we want to pre-process the text despite keeping accents, digits and stop words. -If you are interested in learning more about text cleaning, check out these resources: -#Links list +##### Preprocessing API + +Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs. +If you are interested in learning more about text cleaning, check out these resources: +(#Links list) + -## Customize it -Let's see how texthero STANDARDIZE this step... ### Stemming From f6c45343cb6161dfaf90e049f6e2d2a2b49f6137 Mon Sep 17 00:00:00 2001 From: Giovanni Liotta Date: Fri, 11 Sep 2020 14:46:34 -0500 Subject: [PATCH 3/8] Draft schema pre-processing --- test_gio/FirstTest.ipynb | 251 --------------------------------------- 1 file changed, 251 deletions(-) delete mode 100644 test_gio/FirstTest.ipynb diff --git a/test_gio/FirstTest.ipynb b/test_gio/FirstTest.ipynb deleted file mode 100644 index a2aed8a7..00000000 --- a/test_gio/FirstTest.ipynb +++ /dev/null @@ -1,251 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import texthero as hero\n", - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": "/Users/giovanniliotta/Dev/texthero/test_gio\n" - } - ], - "source": [ - "!pwd" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_csv(\n", - " \"../dataset/bbcsport.csv\"\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": " text topic\n0 Claxton hunting first major medal\\n\\nBritish h... athletics\n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
texttopic
0Claxton hunting first major medal\\n\\nBritish h...athletics
1O'Sullivan could run in Worlds\\n\\nSonia O'Sull...athletics
\n
" - }, - "metadata": {}, - "execution_count": 4 - } - ], - "source": [ - "df.head(2)" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "df['clean_text'] = hero.clean(df['text'])" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": " text topic \\\n0 Claxton hunting first major medal\\n\\nBritish h... athletics \n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics \n\n clean_text \n0 claxton hunting first major medal british hurd... \n1 sullivan could run worlds sonia sullivan indic... ", - "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
texttopicclean_text
0Claxton hunting first major medal\\n\\nBritish h...athleticsclaxton hunting first major medal british hurd...
1O'Sullivan could run in Worlds\\n\\nSonia O'Sull...athleticssullivan could run worlds sonia sullivan indic...
\n
" - }, - "metadata": {}, - "execution_count": 6 - } - ], - "source": [ - "df.head(2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "class Category:\n", - " BOOKS= \"BOOKS\"\n", - " CLOTHING = \"CLOTHING\"\n", - " \n", - "train_x = [\"I love the book\", \"this is a great book\", \"the fit is great\", \"I love the shoes\"]\n", - "train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.feature_extraction.text import CountVectorizer" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[[1 0 0 0 1 0 1 0]\n", - " [1 0 1 1 0 0 0 1]\n", - " [0 1 1 1 0 0 1 0]\n", - " [0 0 0 0 1 1 1 0]]\n", - "['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']\n" - ] - } - ], - "source": [ - "vectorizer = CountVectorizer()\n", - "train_x_vectors = vectorizer.fit_transform(train_x)\n", - "print(vectors.toarray())\n", - "print(vectorizer.get_feature_names())" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn import svm" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "SVC(kernel='linear')" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "clf_svm = svm.SVC(kernel='linear')\n", - "clf_svm.fit(train_x_vectors, train_y)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array(['BOOKS'], dtype=' Date: Wed, 7 Oct 2020 15:57:26 -0500 Subject: [PATCH 4/8] Added Tokenize function --- website/docs/getting-started-preprocessing.md | 61 +++++++++---------- 1 file changed, 29 insertions(+), 32 deletions(-) diff --git a/website/docs/getting-started-preprocessing.md b/website/docs/getting-started-preprocessing.md index 46c7abd2..67cd4f31 100644 --- a/website/docs/getting-started-preprocessing.md +++ b/website/docs/getting-started-preprocessing.md @@ -6,16 +6,13 @@ id: getting-started-preprocessing Pre-processing is a fundamental step in text analysis. Consistent, methodical and reproducible pre-processing operations are a necessary pre-requisite for success of any type of text-based analysis. -## Overview - --- -## Intro +## Overview When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc. The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words or paragraphs. The machine knows instead how to read _numerical vectors_. Text data has good properties that allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require the input text in a form that is as clean and simple as possible, in other words **pre-processed**. -Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept). +Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possible (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept). How useful is this step? Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% to actually analyze it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly and unambiguously implemented. @@ -34,9 +31,11 @@ df = pd.read_csv( ) ``` -## Clean +## Key Functions + +### Clean -Texthero clean method allows a rapid implementation of key cleaning steps that are: +Texthero's clean method allows a rapid implementation of key cleaning steps that are: - Derived from review of relevant academic literature (#include citations) - Validated by a group of NLP enthusiasts with applied experience in different contexts @@ -63,10 +62,10 @@ in just one command! df['clean_text'] = hero.clean(df['text']) ``` -## Custom Pipeline +##### Custom Pipelines -Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That"). -If this is the case, you can easily customize the pre-processing pipeline by implementing only specifics cleaning steps: +Sometimes, project specificities might require different approaches to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That"). +If this is the case, you can easily customize the pre-processing pipeline by implementing only specific cleaning steps: ```python from texthero import preprocessing @@ -86,40 +85,38 @@ df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline) In the above example we want to pre-process the text despite keeping accents, digits and stop words. -##### Preprocessing API - -Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs. - +### Tokenize -If you are interested in learning more about text cleaning, check out these resources: -(#Links list) +Given a character sequence, tokenization is the task of chopping it up into pieces, called tokens. Here is an example of tokenization: +Text: "Hulk is the greenest superhero!" +Tokens: "hulk", "is", "the", "greenest", "superhero", "!" +A token is a sequence of character grouped together as a useful semantic unit for processing. The major question of the tokenization step is how to make the split. In the example above it was quite straightforward: we chopped up the sentence on white spaces. But what would you do if the input text was: +"Hulk isn't the greenest superhero, Green Lantern is!" +Notice that the "isn't" contraction could lead to any of the following tokens: +"isnt", "isn't", "is" + "n't", "isn" + "t" +Tokenization issues are language specific and the process can involve ambiguity if tokens such as monetary amounts, numbers, hyphen-separated words or URLs are involved. +Texthero takes care of making the best set of choices based on the most reasonable assumptions...in just one command! +```python +from texthero import tokenize -### Stemming +s = pd.Series(["Hulk is the greenest superhero!"]) +tokenize(s) +``` -`do_stem` returns better results when used after `remove_punctuation`. +## Preprocessing API -Example: +Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs. -```python ->>> text = "I love climbing and running." ->>> hero .stem(pd.Series(text), stem="snowball") - 0 i love climb and running. - dtype: object -``` +If you are interested in learning more about text cleaning or NLP in general, check out these resources: -Whereas +- Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall. -```python +- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. ->>> text = "I love climbing and running" ->>> hero .stem(pd.Series(text), stem="snowball") - 0 i love climb and run - dtype: object -``` From 6c9cd5d3d39df82a291f3d32206c434ea16d6bab Mon Sep 17 00:00:00 2001 From: Giovanni Liotta Date: Wed, 7 Oct 2020 16:11:17 -0500 Subject: [PATCH 5/8] pandas import --- website/docs/getting-started-preprocessing.md | 1 + 1 file changed, 1 insertion(+) diff --git a/website/docs/getting-started-preprocessing.md b/website/docs/getting-started-preprocessing.md index 67cd4f31..c68f5dcc 100644 --- a/website/docs/getting-started-preprocessing.md +++ b/website/docs/getting-started-preprocessing.md @@ -103,6 +103,7 @@ Tokenization issues are language specific and the process can involve ambiguity Texthero takes care of making the best set of choices based on the most reasonable assumptions...in just one command! ```python +import pandas as pd from texthero import tokenize s = pd.Series(["Hulk is the greenest superhero!"]) From 7f8f614084263edb5f44d94f8b208a2a0836b7d4 Mon Sep 17 00:00:00 2001 From: Jonathan Besomi Date: Fri, 9 Oct 2020 18:53:20 +0200 Subject: [PATCH 6/8] Delete settings.json --- .vscode/settings.json | 3 --- 1 file changed, 3 deletions(-) delete mode 100644 .vscode/settings.json diff --git a/.vscode/settings.json b/.vscode/settings.json deleted file mode 100644 index dfb2c1e0..00000000 --- a/.vscode/settings.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "python.pythonPath": "/Users/giovanniliotta/opt/anaconda3/envs/texthero/bin/python" -} \ No newline at end of file From 4673b47dc0099158b20833f03122b5abac09dad5 Mon Sep 17 00:00:00 2001 From: Jonathan Besomi Date: Fri, 9 Oct 2020 18:53:36 +0200 Subject: [PATCH 7/8] Delete texthero.code-workspace --- texthero.code-workspace | 7 ------- 1 file changed, 7 deletions(-) delete mode 100644 texthero.code-workspace diff --git a/texthero.code-workspace b/texthero.code-workspace deleted file mode 100644 index 362d7c25..00000000 --- a/texthero.code-workspace +++ /dev/null @@ -1,7 +0,0 @@ -{ - "folders": [ - { - "path": "." - } - ] -} \ No newline at end of file From 44ab2e09ba98c0d04ddb7268dc0749f63294a501 Mon Sep 17 00:00:00 2001 From: Giovanni Liotta Date: Wed, 9 Dec 2020 10:40:24 -0600 Subject: [PATCH 8/8] Updated getting-started-preprocessing.md Concise version. Structure should be final. More examples can be added. --- website/docs/getting-started-preprocessing.md | 164 +++++++++--------- 1 file changed, 84 insertions(+), 80 deletions(-) diff --git a/website/docs/getting-started-preprocessing.md b/website/docs/getting-started-preprocessing.md index c68f5dcc..850d517b 100644 --- a/website/docs/getting-started-preprocessing.md +++ b/website/docs/getting-started-preprocessing.md @@ -1,123 +1,127 @@ --- id: getting-started-preprocessing +title: Getting started preprocessing --- -## Getting started with pre-processing - -Pre-processing is a fundamental step in text analysis. Consistent, methodical and reproducible pre-processing operations are a necessary pre-requisite for success of any type of text-based analysis. +## Getting started with preprocessing +By now you should have a general overview of what Texthero is about, in the next sections we will dig a bit deeper into Texthero's core to appreciate its super powers when it comes to text data. ## Overview -When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc. -The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words or paragraphs. The machine knows instead how to read _numerical vectors_. -Text data has good properties that allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require the input text in a form that is as clean and simple as possible, in other words **pre-processed**. -Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possible (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept). +Preprocessing is the stepping stone to any text analytics project, as well as one of Texthero's pillars. + +The Texthero's `clean` pipeline provides a great starting point to quickly implement standard preprocessing steps. If your project requires specific preprocessing steps, Texthero offers a `tool` to quickly experiment and find the best preprocessing solution. + +##### Preprocessing API + +Check-out the complete [preprocessing API](/docs/api-preprocessing) for a detailed overview of Texthero's preprocessing functions. Texthero's approach to preprocessing is modular, allowing you maximum flexibility in customizing the preprocessing steps for your project. -How useful is this step? -Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% to actually analyze it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly and unambiguously implemented. +##### Doing it right -With text hero it only takes one command! -To clean text data in a reliable way all we have to do is: +There is no magic formula that fits all preprocessing needs. Texthero offers a modular and customizable approach ideal to preprocess data for bag-of-words models. +> What is Bag-of-Words? +A bag-of-words model is a popular, simple and flexible way of extracting features from text for use in modeling, such as with machine learning algorithms. Feature extraction consists in converting text into numbers, specifically vectors of numbers, that a machine learning algorithm can read. A bag-of-words representation describe the occurrence of words within a document resorting on two elements: +1. A vocabulary of known words +2. A measure of presence of known words +For example, given the following two text documents: ```python -df['clean_text'] = hero.clean(df['text']) +doc1 = "Hulk likes to eat avocados. Green Lantern likes avocados too." +doc2 = "Green Lantern also likes bonfires." ``` - -> NOTE. In this section we use the same [BBC Sport Dataset](http://mlg.ucd.ie/datasets/bbc.html) as in **Getting Started**. To load the `bbc sport` dataset in a Pandas DataFrame run: +We can use bag-of-words representation to generate two dictionaries as follws: ```python -df = pd.read_csv( - "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -) +BoW1 = {"Hulk":1, "likes":2, "to":1, "eat":1, "avocados":2, "Green":1, "Lantern":1, "too":1} +BoW2 = {"Green":1, "Lantern":1, "also":1, "likes":1, "bonfires":1} ``` +After transforming the text into a "bag of words", we can calculate various measures to characterize the text. The most common type of characteristics, or features, calculated from the bag-of-words model relates to term frequency, namely, the number of times a term appears in the text. -## Key Functions - -### Clean - -Texthero's clean method allows a rapid implementation of key cleaning steps that are: +Texthero is a powerful tool to prepare data for bag-of-words modeling. It enables: +- Preliminary exploration of text data of any format and structure +- Extraction of relevant and clean content for use in bag-of-words models +- Flexibility in adapting to user-specific tasks and contexts -- Derived from review of relevant academic literature (#include citations) -- Validated by a group of NLP enthusiasts with applied experience in different contexts -- Accepted by the NLP community as standard and inescapable +### Text preprocessing, From zero to hero -The default steps do the following: +##### Standard pipeline -| Step | Description | -|----------------------|--------------------------------------------------------| -|`fillna()` |Replace missing values with empty spaces | -|`lowercase()` |Lowercase all text to make the analysis case-insensitive| -|`remove_digits()` |Remove numbers | -|`remove_punctuation()`|Remove punctuation symbols (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~) | -|`remove_diacritics()` |Remove accents -|`remove_stopwords()` |Remove the most common words ("i", "me", "myself", "we", "our", etc.) | - -|`remove_whitespace()` |Remove spaces between words| - - - -in just one command! +Let's see how Texthero can help with cleaning messy text data and get them ready for bag-of-words models. ```python -df['clean_text'] = hero.clean(df['text']) -``` - -##### Custom Pipelines - -Sometimes, project specificities might require different approaches to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That"). -If this is the case, you can easily customize the pre-processing pipeline by implementing only specific cleaning steps: - -```python -from texthero import preprocessing - -custom_pipeline = [preprocessing.fillna, - preprocessing.lowercase, - preprocessing.remove_punctuation - preprocessing.remove_whitespace] -df['clean_text'] = hero.clean(df['text'], custom_pipeline) +import texthero as hero +import pandas as pd +df = pd.DataFrame( + ["I have the power! $$ (wow!)", + "Flame on!
oh!
", + "HULK SMASH!"], columns=['text']) +>>> df.head() + text +0 "I have the power! $$ (wow!)" +1 "Flame on!
oh!
" +2 "HULK SMASH!" ``` -or alternatively +To implement Texthero's standard preprocessing pipeline, it only takes one command: ```python -df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline) +hero.preprocessing.clean(df['text']) +0 power wow +1 flame br oh br +2 hulk smash +Name: text, dtype: object ``` -In the above example we want to pre-process the text despite keeping accents, digits and stop words. - -### Tokenize +Texthero's `clean` pipeline takes as input the dataframe column containing the text to preprocess (df['text']) and returns a clean text series. For maximum compatibility with bag-of-words models, the standard cleaning process prioritizes pure text content over other aspects, such as grammar or puntuation. The text is cleaned from what is considered uninformative content, e.g. punctuation signs ("!", "()", ".", etc.), tags ("
", "
", etc.) and stopwords ("the", "on", etc.). -Given a character sequence, tokenization is the task of chopping it up into pieces, called tokens. Here is an example of tokenization: +##### Custom pipeline -Text: "Hulk is the greenest superhero!" -Tokens: "hulk", "is", "the", "greenest", "superhero", "!" +Assume that our project requires to keep all punctuation marks. For instance, because instead of bag-of-words we want to use a more advanced and complex neural network transformer where punctuation matters. +We might still have specific preprocessing steps to implement. Such as the removal of all stand-alone content within round brackets. +Let's see how Texthero can help in this case... -A token is a sequence of character grouped together as a useful semantic unit for processing. The major question of the tokenization step is how to make the split. In the example above it was quite straightforward: we chopped up the sentence on white spaces. But what would you do if the input text was: -"Hulk isn't the greenest superhero, Green Lantern is!" +The first step would be to search for the specific function that "removes content in parenthesis" in the [preprocessing API](/docs/api-preprocessing). +Turns out that "remove_round_brackets" is the function we are looking for as it "removes content within brackets and the brackets itself". +We now need to create a custom preprocessing pipeline where the only implemented step is "remove_round_brackets". In order to do this, we resort on Pandas "pipe" function as follows: -Notice that the "isn't" contraction could lead to any of the following tokens: -"isnt", "isn't", "is" + "n't", "isn" + "t" +```python +df['clean'] = ( + df['text'] + .pipe(hero.preprocessing.remove_round_brackets) +) +>>> df['clean'].head(2) +0 I have the power! $$ +1 Flame on!
oh!
+Name: clean, dtype: object +``` +The part of text within brackets "(wow!)" has been succesfully removed! -Tokenization issues are language specific and the process can involve ambiguity if tokens such as monetary amounts, numbers, hyphen-separated words or URLs are involved. -Texthero takes care of making the best set of choices based on the most reasonable assumptions...in just one command! +If our project required instead the removal of HTML tags only. We will proceed in a similar way: ```python -import pandas as pd -from texthero import tokenize - -s = pd.Series(["Hulk is the greenest superhero!"]) -tokenize(s) +df['clean'] = ( + df['text'] + .pipe(hero.preprocessing.remove_html_tags) +) +>>> df['clean'].head(2) +0 I have the power! $$ (wow!) +1 Flame on! oh! +Name: clean, dtype: object ``` +The HTML tags "
" and "
" have now been removed! -## Preprocessing API - -Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs. +If we were to apply both preprocessing steps above, the resulting custom pipeline will look like this: +```python +custom_pipeline = [hero.preprocessing.remove_round_brackets, + hero.preprocessing.remove_html_tags] +df['clean'] = hero.clean(df['text'], custom_pipeline) +``` +##### Going further If you are interested in learning more about text cleaning or NLP in general, check out these resources: - Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall. -- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. - +- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. \ No newline at end of file