Polari can perform two purposes:
- Detect the language of natural language text.
- Detect the sentiment of English language text with a basic pre-trained algorithm.
Python's simplicity with rust's speed and scale. ๐
"Polari (from Italian parlare 'to talk') is a form of slang or cant historically used in Britain."Wikipedia
Polari was spoken by "mostly camp gay men. They were a class of people who lived on the margins of society. Many of them broke the law - a law which is now seen... as being unfair and cruel - and so they were at risk of arrest, shaming, blackmail, and attack. They were not seen as important or interesting. Their stories were not told." Fabulosa!: The Story of Polari, Britain's Secret Gay Language p. 10-11
The polari
library:
- Performs language & sentiment detection.
- Is a plugin for a library named polars.
- Was, coincidentally, first released during Pride Month (June 2024).
If you have fun with this library, please consider donating to a charity which supports LGBTQIA+ folks.
Perhaps:
- Stonewall
- The Trevor Project
- Mermaids
- Gendered Intelligence
- An organisation which supports people close to wherever you are in the world.
Pull requests with further charity & organisation suggestions are welcome.1
For quick setup with sample data, install the requirements in examples/example_requirements.txt
# Linux/MacOS
python -m venv .venv && source .venv/bin/activate && python -m pip install polari duckdb==0.10.3 polars==0.20.30 pyarrow==16.1.0
Load some sample data:
import polari
import duckdb
from time import time
from polars import Config, col
# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
rows = 5
# here are the languages that whichlang supports
languages = (
# The MSA and Simplified Chinese less precise names in polari.
"Modern Standard Arabic",
"Simplified Chinese",
"German",
"English",
"French",
"Hindi",
"Italian",
"Japanese",
"Korean",
"Dutch",
"Portuguese",
"Russian",
"Spanish",
"Swedish",
"Turkish",
"Vietnamese",
)
# set up the LazyFrame
lf = (
duckdb.sql(
f"SELECT inputs, language FROM 'hf://datasets/CohereForAI/aya_dataset/data/train-00000-of-00001.parquet' WHERE language in {languages} LIMIT {str(rows)};"
)
.pl()
.lazy()
)
Config.set_tbl_hide_dataframe_shape(True)
df = lf.select(
"inputs",
polari.detect_lang("inputs", algorithm="which_lang").alias("detected_lang"),
col("language").alias("true_lang"),
).collect()
print(df)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ inputs โ detected_lang โ true_lang โ
โ --- โ --- โ --- โ
โ str โ str โ str โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโก
โ Hรฃy tiแบฟp tแปฅc ฤoแบกn vฤn sau: โ Vietnamese โ Vietnamese โ
โ "Tโฆ โ โ โ
โ Bu paragrafฤฑn devamฤฑnฤฑ yazฤฑn: โฆ โ Turkish โ Turkish โ
โ ยฟCuรกl es la respuesta correctaโฆ โ Spanish โ Spanish โ
โ ไธญๆผ(ใกใ
ใใ)ใๅใกใจใใใฐใ โ Japanese โ Japanese โ
โ ใฉใใชใฒใผใ ใฎๅ่ฒ ใฎๆฑบใพใๆนโฆ โ โ โ
โ Em que ano os filmes deixaram โฆ โ Portuguese โ Portuguese โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
The above is only with the whichlang
algorithm, the quickest and simplest algorithm.
Two of the algorithms can output a confidence score with detect_lang_confidence
: what_lang, and lingua.
Supported algorithms:2
what_lang
lingua
whichlang
what_lang
and lingua
support language subsets and language exclusion.
lingua
supports high and low accuracy mode.
It is also possible to detect the script of the dataset with what_lang
and lingua
.
df = lf.select(
"inputs",
"language",
polari.detect_script("inputs").alias("detected_script"),
).collect()
print(df.head(3))
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ inputs โ language โ detected_script โ
โ --- โ --- โ --- โ
โ str โ str โ str โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโก
โ Heestan waxaa qada Khalid Haref Ahmed โ Somali โ Latin โ
โ OO ku Jiray Kooxdii Dur Dur! โ โ โ
โ Quels prรฉsident des รtats-Unis ne sโest jamais mariรฉ ? โ French โ Latin โ
โ ูู
ุนุฏุฏ ุงูุฎููุงุก ุงูุฑุงุดุฏูู ุ ุฃุฌุจ ุนูู ุงูุณุคุงู ุงูุณุงุจู. โ Standard Arabic โ Arabic โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโ
polari
can detect the sentiment of English language text via a rust port of VADER.
The pre-trained model was originally trained for sentiment detection on social media posts, but has semi-decent performance on opinionated text. The below performs analysis on amazon reviews.
import polari
import duckdb
from time import time
from polars import Config
# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
# This will load {rows} of 1*, 3*, and 5* reviews.
rows = 1
subset="Beauty_and_Personal_Care"
dataset = f"hf://datasets/McAuley-Lab/Amazon-Reviews-2023/raw/review_categories/{subset}.jsonl"
# set up the LazyFrame
lf = (
duckdb.sql(
f"""
WITH positive as(
SELECT text, rating FROM '{dataset}' WHERE rating = 5 LIMIT {rows}
)
, neutral as(
SELECT text, rating FROM '{dataset}' WHERE rating = 3 LIMIT {rows}
)
, negative as(
SELECT text, rating FROM '{dataset}' WHERE rating = 1 LIMIT {rows}
)
SELECT * FROM positive
UNION ALL
SELECT * FROM negative
UNION ALL
SELECT * FROM neutral;
"""
)
.pl()
.lazy()
)
df = lf.select(
"text",
polari.get_sentiment("text", output_type="compound").alias("sentiment"),
polari.get_sentiment("text", output_type="pos").alias("pos"),
polari.get_sentiment("text", output_type="neu").alias("neu"),
polari.get_sentiment("text", output_type="neg").alias("neg"),
"rating",
).collect()
df.head()
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโ
โ text โ sentiment โ pos โ neu โ neg โ rating โ
โ --- โ --- โ --- โ --- โ --- โ --- โ
โ str โ f64 โ f64 โ f64 โ f64 โ f64 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโโโชโโโโโโโโโก
โ Bought this for my granddaughter. Her entire familyโฆโ 0.63695 โ 0.21875 โ 0.78125 โ 0.0 โ 5.0 โ
โ This is a good product but it doesn't last very longโฆโ 0.238227 โ 0.130435 โ 0.869565 โ 0.0 โ 3.0 โ
โ Tops the list for worst purchase. Tried these for alโฆโ -0.939365 โ 0.094854 โ 0.735183 โ 0.169963 โ 1.0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโ
Output types include:
- compound
- neutral
- positive
- negative.
Language detection:
Sentiment:
Polars:
- The Polars DataFrame library.
- Marco Gorelli. In addition to being a stalwart in the open source DataFrame community, Marco has made an incredible tutorial for making polars plugins.
Footnotes
-
In the extremely unlikely scenario that this project becomes popular, and therefore a library that needs to sustain itself, users could also be invited to donate to the project in a separate section of the README. โฉ
-
Benchmarking algorithm prediction precision/recall can be done with
polari
. Difference in detection speed by algorithm may be due to the implementation inpolari
, rather than the original rust crate. โฉ