-
-
Notifications
You must be signed in to change notification settings - Fork 7
Semantics
For semantics functionality, you need to install
pip install scikit-learn numpy
UralicNLP can cluster documents into semantically meaningful categories using LLM embeddings.
from uralicNLP.llm import get_llm
from uralicNLP import semantics
llm = get_llm("roneneldan/TinyStories-33M")
texts = ["dogs are funny", "cats play around", "cars go fast", "planes fly around", "parrots like to eat", "eagles soar in the skies", "moon is big", "saturn is a planet"]
semantics.cluster(texts, llm)
>>[['dogs are funny', 'parrots like to eat', 'moon is big'], ['cats play around', 'cars go fast', 'planes fly around', 'eagles soar in the skies'], ['saturn is a planet']]
This method will cluster texts into semantically similar clusters. You can use whichever LLM you want (see more in the LLM documentation). Note: A bigger LLM will give better results than what seen in the example.
If you need to get the indices instead of the actual texts, you can pass return_ids=True.
semantics.cluster(texts, llm, return_ids=True)
>>[[0, 4, 6], [1, 2, 3, 5], [7]]
These indices are relative to the texts list that is passed to the method.
The clustering method uses Affinity Propagation by default, but it is also possible to use HDBSCAN by passing method="hdbscan". The cluster method takes same parameters into account as scikit-learn.
semantics.cluster(texts, llm, method="hdbscan")
>>[['dogs are funny', 'moon is big'], ['cars go fast', 'parrots like to eat', 'eagles soar in the skies'], ['cats play around', 'planes fly around', 'saturn is a planet']]
This method is all in all outlined in the following paper that can be cited:
Hämäläinen, M., Rueter, J., & Alnajjar, K. (2024). Analyzing Pokémon and Mario Streamers’ Twitch Chat with LLM-based User Embeddings. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities (pp. 499-503).
damping=0.5, max_iter=200, convergence_iter=15, copy=True, preference=None, affinity='euclidean', verbose=False, random_state=None
min_cluster_size=2, min_samples=None, cluster_selection_epsilon=0.0, max_cluster_size=None, metric='euclidean', metric_params=None, alpha=1.0, algorithm='auto', leaf_size=40, n_jobs=None, cluster_selection_method='eom', allow_single_cluster=False, store_centers=None, copy=False
To use the embed_endangered method to do the clustering, you can use the following code.
from uralicNLP.llm import get_llm
from uralicNLP import semantics
llm = get_llm("roneneldan/TinyStories-33M")
endangered_texts = ["Ёртозь ёртовсь кудостонть.", "Теке сялгонзояк те касовксонть арасть.", "Истяяк арсеват.", "Атякштне, кунсолан, сыргойсть омбоцеде.", "Вальмаванть неявить ульцява ардыцят.", "Морат эрзянь моро?"]
semantics.cluster_endangered(endangered_texts, llm, "myv", "fin")
>>[['Теке сялгонзояк те касовксонть арасть.'], ['Атякштне, кунсолан, сыргойсть омбоцеде.'], ['Ёртозь ёртовсь кудостонть.', 'Истяяк арсеват.', 'Вальмаванть неявить ульцява ардыцят.', 'Морат эрзянь моро?']]
UralicNLP is an open-source Python library by Mika Hämäläinen