Skip to content

Sentiment and language detection for text analytics.

License

Notifications You must be signed in to change notification settings

TomBurdge/polari

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

41 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Polari ๐ŸŒˆ

Polari can perform two purposes:

  1. Detect the language of natural language text.
  2. Detect the sentiment of English language text with a basic pre-trained algorithm.

Python's simplicity with rust's speed and scale. ๐Ÿš€

What's in a name?

"Polari (from Italian parlare 'to talk') is a form of slang or cant historically used in Britain."Wikipedia

Polari was spoken by "mostly camp gay men. They were a class of people who lived on the margins of society. Many of them broke the law - a law which is now seen... as being unfair and cruel - and so they were at risk of arrest, shaming, blackmail, and attack. They were not seen as important or interesting. Their stories were not told." Fabulosa!: The Story of Polari, Britain's Secret Gay Language p. 10-11

The polari library:

  • Performs language & sentiment detection.
  • Is a plugin for a library named polars.
  • Was, coincidentally, first released during Pride Month (June 2024).

If you have fun with this library, please consider donating to a charity which supports LGBTQIA+ folks.

Perhaps:

Pull requests with further charity & organisation suggestions are welcome.1

Language Detection ๐Ÿ”Ž

Load the data quickly with hugging face & ducdkb

For quick setup with sample data, install the requirements in examples/example_requirements.txt

# Linux/MacOS
python -m venv .venv && source .venv/bin/activate && python -m pip install polari duckdb==0.10.3 polars==0.20.30 pyarrow==16.1.0

Load some sample data:

import polari
import duckdb
from time import time
from polars import Config, col

# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
rows = 5

# here are the languages that whichlang supports
languages = (
    # The MSA and Simplified Chinese less precise names in polari.
    "Modern Standard Arabic",
    "Simplified Chinese",
    "German",
    "English",
    "French",
    "Hindi",
    "Italian",
    "Japanese",
    "Korean",
    "Dutch",
    "Portuguese",
    "Russian",
    "Spanish",
    "Swedish",
    "Turkish",
    "Vietnamese",
)
# set up the LazyFrame
lf = (
    duckdb.sql(
        f"SELECT inputs, language FROM 'hf://datasets/CohereForAI/aya_dataset/data/train-00000-of-00001.parquet' WHERE language in {languages} LIMIT {str(rows)};"
    )
    .pl()
    .lazy()
)

Detect the language ๐ŸŒ ๐Ÿ”Ž

Config.set_tbl_hide_dataframe_shape(True)

df = lf.select(
    "inputs",
    polari.detect_lang("inputs", algorithm="which_lang").alias("detected_lang"),
    col("language").alias("true_lang"),
).collect()

print(df)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ inputs                          โ”† detected_lang โ”† true_lang  โ”‚
โ”‚ ---                             โ”† ---           โ”† ---        โ”‚
โ”‚ str                             โ”† str           โ”† str        โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Hรฃy tiแบฟp tแปฅc ฤ‘oแบกn vฤƒn sau:      โ”† Vietnamese    โ”† Vietnamese โ”‚
โ”‚ "Tโ€ฆ                             โ”†               โ”†            โ”‚
โ”‚ Bu paragrafฤฑn devamฤฑnฤฑ yazฤฑn: โ€ฆ โ”† Turkish       โ”† Turkish    โ”‚
โ”‚ ยฟCuรกl es la respuesta correctaโ€ฆ โ”† Spanish       โ”† Spanish    โ”‚
โ”‚ ไธญๆŠผ(ใกใ‚…ใ†ใŠ)ใ—ๅ‹ใกใจใ„ใˆใฐใ€  โ”† Japanese      โ”† Japanese   โ”‚
โ”‚ ใฉใ‚“ใชใ‚ฒใƒผใƒ ใฎๅ‹่ฒ ใฎๆฑบใพใ‚Šๆ–นโ€ฆ   โ”†               โ”†            โ”‚
โ”‚ Em que ano os filmes deixaram โ€ฆ โ”† Portuguese    โ”† Portuguese โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Algorithms

The above is only with the whichlang algorithm, the quickest and simplest algorithm.

Two of the algorithms can output a confidence score with detect_lang_confidence: what_lang, and lingua.

Supported algorithms:2

  • what_lang
  • lingua
  • whichlang

what_lang and lingua support language subsets and language exclusion. lingua supports high and low accuracy mode.

Detect the script ๐Ÿ“œ

It is also possible to detect the script of the dataset with what_lang and lingua.

df = lf.select(
    "inputs",
    "language",
    polari.detect_script("inputs").alias("detected_script"),
).collect()

print(df.head(3))
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ inputs                                                 โ”† language        โ”† detected_script โ”‚
โ”‚ ---                                                    โ”† ---             โ”† ---             โ”‚
โ”‚ str                                                    โ”† str             โ”† str             โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Heestan waxaa qada Khalid Haref Ahmed                  โ”† Somali          โ”† Latin           โ”‚
โ”‚ OO ku Jiray Kooxdii Dur Dur!                           โ”†                 โ”†                 โ”‚
โ”‚ Quels prรฉsident des ร‰tats-Unis ne sโ€™est jamais mariรฉ ? โ”† French          โ”† Latin           โ”‚
โ”‚ ูƒู… ุนุฏุฏ ุงู„ุฎู„ูุงุก ุงู„ุฑุงุดุฏูŠู† ุŸ ุฃุฌุจ ุนู„ู‰ ุงู„ุณุคุงู„ ุงู„ุณุงุจู‚.    โ”† Standard Arabic โ”† Arabic          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    

Sentiment Detection ๐Ÿ˜€๐Ÿ˜ 

polari can detect the sentiment of English language text via a rust port of VADER.

The pre-trained model was originally trained for sentiment detection on social media posts, but has semi-decent performance on opinionated text. The below performs analysis on amazon reviews.

Sample Data

import polari
import duckdb
from time import time
from polars import Config

# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
# This will load {rows} of 1*, 3*, and 5* reviews.
rows = 1
subset="Beauty_and_Personal_Care"
dataset = f"hf://datasets/McAuley-Lab/Amazon-Reviews-2023/raw/review_categories/{subset}.jsonl"
# set up the LazyFrame
lf = (
    duckdb.sql(
    f"""
    WITH positive as(
            SELECT text, rating FROM '{dataset}' WHERE rating = 5 LIMIT {rows}
        )
        , neutral as(
            SELECT text, rating FROM '{dataset}' WHERE rating = 3 LIMIT {rows}
        )
        , negative as(
            SELECT text, rating FROM '{dataset}' WHERE rating = 1 LIMIT {rows}
        )
    SELECT * FROM positive
    UNION ALL
    SELECT * FROM negative
    UNION ALL
    SELECT * FROM neutral;
    """
)
    .pl()
    .lazy()
)

Detect Sentiment ๐Ÿ˜€๐Ÿ˜ ๐Ÿ”Ž

df = lf.select(
    "text",
    polari.get_sentiment("text", output_type="compound").alias("sentiment"),
    polari.get_sentiment("text", output_type="pos").alias("pos"),
    polari.get_sentiment("text", output_type="neu").alias("neu"),
    polari.get_sentiment("text", output_type="neg").alias("neg"),
    "rating",
).collect()

df.head()
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ text                                                 โ”† sentiment โ”† pos      โ”† neu      โ”† neg      โ”† rating โ”‚
โ”‚ ---                                                  โ”† ---       โ”† ---      โ”† ---      โ”† ---      โ”† ---    โ”‚
โ”‚ str                                                  โ”† f64       โ”† f64      โ”† f64      โ”† f64      โ”† f64    โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ Bought this for my granddaughter.  Her entire familyโ€ฆโ”† 0.63695   โ”† 0.21875  โ”† 0.78125  โ”† 0.0      โ”† 5.0    โ”‚
โ”‚ This is a good product but it doesn't last very longโ€ฆโ”† 0.238227  โ”† 0.130435 โ”† 0.869565 โ”† 0.0      โ”† 3.0    โ”‚
โ”‚ Tops the list for worst purchase. Tried these for alโ€ฆโ”† -0.939365 โ”† 0.094854 โ”† 0.735183 โ”† 0.169963 โ”† 1.0    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Output types include:

  • compound
  • neutral
  • positive
  • negative.

Credits

Language detection:

Sentiment:

Polars:

Footnotes

Footnotes

  1. In the extremely unlikely scenario that this project becomes popular, and therefore a library that needs to sustain itself, users could also be invited to donate to the project in a separate section of the README. โ†ฉ

  2. Benchmarking algorithm prediction precision/recall can be done with polari. Difference in detection speed by algorithm may be due to the implementation in polari, rather than the original rust crate. โ†ฉ

About

Sentiment and language detection for text analytics.

Resources

License

Stars

Watchers

Forks

Packages

No packages published