🌴 dobbi 🦕

Takes care of all of this boring NLP stuff

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

import pandas as pd

d = {'text': ['#fun #lol   Why  @Alex33 is so funny here: https://some-url.com',
              '#looool     =)      😍 such lovely!?*!!!%&']}
df = pd.DataFrame(d)

cln_func = dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .function()
df['text'] = df['text'].map(cln_func)

repl_func = dobbi.replace() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
df['text'] = df['text'].map(repl_func)

Result:

print(df['text'][0])  # 'Why is so funny here'
print(df['text'][1])  # 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'

Supported methods and patterns

The process consists of three stages:

Initialization methods: initialize a dobbi Work object
Intermediate methods: chain patterns in the needed order
Terminal methods: choose if you need a function or a result

Initialization functions:

dobbi.clean()
dobbi.collect()
dobbi.replace()

Intermediate methods (pattern processing choice):

regexp() - custom regular expressions
url() - URLs
html() - HTML and "<...>" type markups
punctuation() - punctuation
hashtag() - hashtags
emoji() - emoji
emoticons() - emoticons
whitespace() - any type of whitespaces
nickname() - @-starting nicknames

Terminal methods:

execute(str) - executes chosen methods on the provided string.
function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function

func = dobbi.clean() \
    .url() \
    .hashtag() \
    .punctuation() \
    .whitespace() \
    .html() \
    .function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result:

'Why Alex33 is so funny Check here'

Chain regexp methods

dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

Remove emoji and emoticons

em_func = dobbi.clean() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
em_func('Great! =) :D  😍 😋such lovely!?*!!!%&')

Result:

'Great such lovely'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

Finding bugs
Making code optimizations
Writing tests
Help with new features development

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
dobbi		dobbi
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌴 dobbi 🦕

Description

TL;DR

Installation

Usage

Interaction

Supported methods and patterns

Examples

1) Clean a random Twitter message

2) Replace nicknames and urls with tokens

3) Get the text cleanup function

Additional

Call for collaboration 🤗

About

Releases 6

Packages

Languages

License

iaramer/dobbi

Folders and files

Latest commit

History

Repository files navigation

🌴 dobbi 🦕

Description

TL;DR

Installation

Usage

Interaction

Supported methods and patterns

Examples

1) Clean a random Twitter message

2) Replace nicknames and urls with tokens

3) Get the text cleanup function

Additional

Call for collaboration 🤗

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages