Takes care of all of this boring NLP stuff
An open-source NLP library: fast text cleaning and preprocessing.
This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.
To download dobbi, either fork this GitHub repo or simply use Pypi via pip:
$ pip install dobbi
Import the library:
import dobbi
The library uses method chaining in order to simplify text processing:
import pandas as pd
d = {'text': ['#fun #lol Why @Alex33 is so funny here: https://some-url.com',
'#looool =) 😍 such lovely!?*!!!%&']}
df = pd.DataFrame(d)
cln_func = dobbi.clean() \
.hashtag() \
.nickname() \
.url() \
.function()
df['text'] = df['text'].map(cln_func)
repl_func = dobbi.replace() \
.emoji() \
.emoticon() \
.punctuation() \
.function()
df['text'] = df['text'].map(repl_func)
Result:
print(df['text'][0]) # 'Why is so funny here'
print(df['text'][1]) # 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'
The process consists of three stages:
- Initialization methods: initialize a dobbi Work object
- Intermediate methods: chain patterns in the needed order
- Terminal methods: choose if you need a function or a result
Initialization functions:
dobbi.clean()
dobbi.collect()
dobbi.replace()
Intermediate methods (pattern processing choice):
regexp()
- custom regular expressionsurl()
- URLshtml()
- HTML and "<...>" type markupspunctuation()
- punctuationhashtag()
- hashtagsemoji()
- emojiemoticons()
- emoticonswhitespace()
- any type of whitespacesnickname()
- @-starting nicknames
Terminal methods:
execute(str)
- executes chosen methods on the provided string.function()
- returns a function which is a combination of the chosen methods.
dobbi.clean() \
.hashtag() \
.nickname() \
.url() \
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why is so funny? Check here:'
dobbi.replace() \
.hashtag('') \
.nickname() \
.url('__CUSTOM_URL_TOKEN__') \
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'
func = dobbi.clean() \
.url() \
.hashtag() \
.punctuation() \
.whitespace() \
.html() \
.function()
func('\t #fun #lol Why @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')
Result:
'Why Alex33 is so funny Check here'
- Chain regexp methods
dobbi.clean() \
.regexp('#\w+') \
.regexp('@\w+') \
.regexp('https?://\S+') \
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why is so funny? Check here:'
- Remove emoji and emoticons
em_func = dobbi.clean() \
.emoji() \
.emoticon() \
.punctuation() \
.function()
em_func('Great! =) :D 😍 😋such lovely!?*!!!%&')
Result:
'Great such lovely'
Please pay attention that the functions are applied in the order you've specified them.
So, you're better to chain .punctuation()
as one of the last functions.
If you enjoyed the project I would be grateful if you supported it :)
Below is the list of useful features I would be happy to share with you:
- Finding bugs
- Making code optimizations
- Writing tests
- Help with new features development