Polari can perform two purposes:
- Detect the language of natural language text.
- Detect the sentiment of English language text with a basic pre-trained algorithm.
Python's simplicity with rust's speed and scale. 🚀
"Polari (from Italian parlare 'to talk') is a form of slang or cant historically used in Britain."Wikipedia
Polari was spoken by "mostly camp gay men. They were a class of people who lived on the margins of society. Many of them broke the law - a law which is now seen... as being unfair and cruel - and so they were at risk of arrest, shaming, blackmail, and attack. They were not seen as important or interesting. Their stories were not told." Fabulosa!: The Story of Polari, Britain's Secret Gay Language p. 10-11
The polari
library:
- Performs language & sentiment detection.
- Is a plugin for a library named polars.
- Was, coincidentally, first released during Pride Month (June 2024).
If you have fun with this library, please consider donating to a charity which supports LGBTQIA+ folks.
Perhaps:
- Stonewall
- The Trevor Project
- Mermaids
- Gendered Intelligence
- An organisation which supports people close to wherever you are in the world.
Pull requests with further charity & organisation suggestions are welcome.1
For quick setup with sample data, install the requirements in examples/example_requirements.txt
# Linux/MacOS
python -m venv .venv && source .venv/bin/activate && python -m pip install polari duckdb==0.10.3 polars==0.20.30 pyarrow==16.1.0
Load some sample data:
import polari
import duckdb
from time import time
from polars import Config, col
# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
rows = 5
# here are the languages that whichlang supports
languages = (
# The MSA and Simplified Chinese less precise names in polari.
"Modern Standard Arabic",
"Simplified Chinese",
"German",
"English",
"French",
"Hindi",
"Italian",
"Japanese",
"Korean",
"Dutch",
"Portuguese",
"Russian",
"Spanish",
"Swedish",
"Turkish",
"Vietnamese",
)
# set up the LazyFrame
lf = (
duckdb.sql(
f"SELECT inputs, language FROM 'hf://datasets/CohereForAI/aya_dataset/data/train-00000-of-00001.parquet' WHERE language in {languages} LIMIT {str(rows)};"
)
.pl()
.lazy()
)
Config.set_tbl_hide_dataframe_shape(True)
df = lf.select(
"inputs",
polari.detect_lang("inputs", algorithm="which_lang").alias("detected_lang"),
col("language").alias("true_lang"),
).collect()
print(df)
┌─────────────────────────────────┬───────────────┬────────────┐
│ inputs ┆ detected_lang ┆ true_lang │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════════════════════════╪═══════════════╪════════════╡
│ Hãy tiếp tục đoạn văn sau: ┆ Vietnamese ┆ Vietnamese │
│ "T… ┆ ┆ │
│ Bu paragrafın devamını yazın: … ┆ Turkish ┆ Turkish │
│ ¿Cuál es la respuesta correcta… ┆ Spanish ┆ Spanish │
│ 中押(ちゅうお)し勝ちといえば、 ┆ Japanese ┆ Japanese │
│ どんなゲームの勝負の決まり方… ┆ ┆ │
│ Em que ano os filmes deixaram … ┆ Portuguese ┆ Portuguese │
└─────────────────────────────────┴───────────────┴────────────┘
The above is only with the whichlang
algorithm, the quickest and simplest algorithm.
Two of the algorithms can output a confidence score with detect_lang_confidence
: what_lang, and lingua.
Supported algorithms:2
what_lang
lingua
whichlang
what_lang
and lingua
support language subsets and language exclusion.
lingua
supports high and low accuracy mode.
It is also possible to detect the script of the dataset with what_lang
and lingua
.
df = lf.select(
"inputs",
"language",
polari.detect_script("inputs").alias("detected_script"),
).collect()
print(df.head(3))
┌────────────────────────────────────────────────────────┬─────────────────┬─────────────────┐
│ inputs ┆ language ┆ detected_script │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════════════════════════════════════════════════╪═════════════════╪═════════════════╡
│ Heestan waxaa qada Khalid Haref Ahmed ┆ Somali ┆ Latin │
│ OO ku Jiray Kooxdii Dur Dur! ┆ ┆ │
│ Quels président des États-Unis ne s’est jamais marié ? ┆ French ┆ Latin │
│ كم عدد الخلفاء الراشدين ؟ أجب على السؤال السابق. ┆ Standard Arabic ┆ Arabic │
└────────────────────────────────────────────────────────┴─────────────────┴─────────────────┘
polari
can detect the sentiment of English language text via a rust port of VADER.
The pre-trained model was originally trained for sentiment detection on social media posts, but has semi-decent performance on opinionated text. The below performs analysis on amazon reviews.
import polari
import duckdb
from time import time
from polars import Config
# On row limits below the millions, the LazyFrame setup with duckdb will take most of the time.
# This will load {rows} of 1*, 3*, and 5* reviews.
rows = 1
subset="Beauty_and_Personal_Care"
dataset = f"hf://datasets/McAuley-Lab/Amazon-Reviews-2023/raw/review_categories/{subset}.jsonl"
# set up the LazyFrame
lf = (
duckdb.sql(
f"""
WITH positive as(
SELECT text, rating FROM '{dataset}' WHERE rating = 5 LIMIT {rows}
)
, neutral as(
SELECT text, rating FROM '{dataset}' WHERE rating = 3 LIMIT {rows}
)
, negative as(
SELECT text, rating FROM '{dataset}' WHERE rating = 1 LIMIT {rows}
)
SELECT * FROM positive
UNION ALL
SELECT * FROM negative
UNION ALL
SELECT * FROM neutral;
"""
)
.pl()
.lazy()
)
df = lf.select(
"text",
polari.get_sentiment("text", output_type="compound").alias("sentiment"),
polari.get_sentiment("text", output_type="pos").alias("pos"),
polari.get_sentiment("text", output_type="neu").alias("neu"),
polari.get_sentiment("text", output_type="neg").alias("neg"),
"rating",
).collect()
df.head()
┌──────────────────────────────────────────────────────┬───────────┬──────────┬──────────┬──────────┬────────┐
│ text ┆ sentiment ┆ pos ┆ neu ┆ neg ┆ rating │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════════════════════════════════════════════════╪═══════════╪══════════╪══════════╪══════════╪════════╡
│ Bought this for my granddaughter. Her entire family…┆ 0.63695 ┆ 0.21875 ┆ 0.78125 ┆ 0.0 ┆ 5.0 │
│ This is a good product but it doesn't last very long…┆ 0.238227 ┆ 0.130435 ┆ 0.869565 ┆ 0.0 ┆ 3.0 │
│ Tops the list for worst purchase. Tried these for al…┆ -0.939365 ┆ 0.094854 ┆ 0.735183 ┆ 0.169963 ┆ 1.0 │
└──────────────────────────────────────────────────────┴───────────┴──────────┴──────────┴──────────┴────────┘
Output types include:
- compound
- neutral
- positive
- negative.
Language detection:
Sentiment:
Polars:
- The Polars DataFrame library.
- Marco Gorelli. In addition to being a stalwart in the open source DataFrame community, Marco has made an incredible tutorial for making polars plugins.
Footnotes
-
In the extremely unlikely scenario that this project becomes popular, and therefore a library that needs to sustain itself, users could also be invited to donate to the project in a separate section of the README. ↩
-
Benchmarking algorithm prediction precision/recall can be done with
polari
. Difference in detection speed by algorithm may be due to the implementation inpolari
, rather than the original rust crate. ↩