Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KnowledgeCommunity: Content Bundle #7837

Closed
4 tasks
drew2a opened this issue Jan 18, 2024 · 14 comments · Fixed by #7953
Closed
4 tasks

KnowledgeCommunity: Content Bundle #7837

drew2a opened this issue Jan 18, 2024 · 14 comments · Fixed by #7953

Comments

@drew2a
Copy link
Contributor

drew2a commented Jan 18, 2024

"Content Bundle" is a strategic feature in Tribler aimed at enhancing the organization and accessibility of digital content. It acts as an aggregation point for Content Items, bundling them together under a single, cohesive unit. This structure allows users to efficiently manage and access groups of related Content Items, simplifying navigation and retrieval. Ideal for categorizing content that shares common themes, attributes, or sources, the Content Bundle provides a streamlined way to handle complex sets of information, making it easier for users to find and interact with a rich array of content within the Tribler network.

The current representation of Content Items can be seen in the following picture:

image

We want them to have another layer of grouping:

Ubuntu 20
|
├ Ubuntu 20.04
|  ├ infohash 1
|  ├ ...
|  └ infohash N
|
└ Ubuntu 20.10
   ├ infohash K
   ├ ...
   └ infohash M

Everything that we need already exists in our Knowledge Database, we can reuse the existing CONTENT_ITEM as follows:

subject_type |   subject    | object_type  |    object    |
===========================================================
TORRENT      |  infohash 1  | CONTENT_ITEM | Ubuntu 20.04 |
TORRENT      |  infohash N  | CONTENT_ITEM | Ubuntu 20.04 |
TORRENT      |  infohash K  | CONTENT_ITEM | Ubuntu 20.10 |
TORRENT      |  infohash M  | CONTENT_ITEM | Ubuntu 20.10 |
CONTENT_ITEM | Ubuntu 20.04 | CONTENT_ITEM |   Ubuntu 20  |
CONTENT_ITEM | Ubuntu 20.10 | CONTENT_ITEM |   Ubuntu 20  |

Or another structure:


subject_type |         subject              | object_type  |    object    |
===========================================================================
TORRENT      |        infohash 1            | CONTENT_ITEM | Ubuntu 20    |
TORRENT      |        infohash N            | CONTENT_ITEM | Ubuntu 20    |
TORRENT      |        infohash K            | CONTENT_ITEM | Ubuntu 20    |
TORRENT      |        infohash M            | CONTENT_ITEM | Ubuntu 20    |
CONTENT_ITEM | HASH(infohash 1 + Ubuntu 20) | CONTENT_ITEM |      04      |
CONTENT_ITEM | HASH(infohash N + Ubuntu 20) | CONTENT_ITEM |      04      |
CONTENT_ITEM | HASH(infohash K + Ubuntu 20) | CONTENT_ITEM |      10      |
CONTENT_ITEM | HASH(infohash M + Ubuntu 20) | CONTENT_ITEM |      10      |

Or

subject_type |         subject              | object_type  |    object    |
===========================================================================
TORRENT      |        infohash 1            | CONTENT_ITEM |      04      |
TORRENT      |        infohash N            | CONTENT_ITEM |      04      |
TORRENT      |        infohash K            | CONTENT_ITEM |      10      |
TORRENT      |        infohash M            | CONTENT_ITEM |      10      |
CONTENT_ITEM |    HASH(infohash 1 + 04)     | CONTENT_ITEM | Ubuntu 22    |
CONTENT_ITEM |    HASH(infohash N + 04)     | CONTENT_ITEM | Ubuntu 22    |
CONTENT_ITEM |    HASH(infohash K + 10)     | CONTENT_ITEM | Ubuntu 22    |
CONTENT_ITEM |    HASH(infohash M + 10)     | CONTENT_ITEM | Ubuntu 22    |

So it's an open question regarding the structure. Please suggest your ideas.

To complete this task, we need to:

  • Backend: Add hash (if needed) and implement tests.
  • REST API endpoint?
  • Frontend: Make changes to the Edit Metadata Dialog.
  • Frontend: Make modifications to the UI.

Related:

@synctext
Copy link
Member

synctext commented Feb 7, 2024

We deployed Network Buzz within Tribler in 2010. This is what Reddit had to say then about it: Nothing has changed as of today. Thanks to the "network buzz" feature (which you can't turn off) it almost never goes below 10% CPU utilization and sometimes just sits at 50% maxing out one of my 2 cores.
Related prior work is the tag systems since 2011, we lack the user community for this. MusicBrainz has volunteers with over 1 million edit/tagging contributions, that is our leading example.

Please do a 10-day prototype @drew2a. You now did top-down design exploration. Time for bottom-up "learn-by-doing". We do not know if "content bundling" can be done exclusive and perfect with local heuristics and zero database changes. Or we need to store, gather rich metadata and offer content enrichment plus database changes. What about the near-duplicates we studied for years (never could fix)?

Lets leave the anti-spam for future sprints ❗ No need to distract @grimadas from deployment and debugging the low-level rendezvous component. Only in 2025 we will re-visit the Justin Bieber is gay, tag spam problem.

@drew2a
Copy link
Contributor Author

drew2a commented Feb 13, 2024

The first attempt at trying to group search results locally doesn't offer much hope, as it tends to group quite random torrents together without organizing them into the same content group.

The developed script:

  1. Load Titles: Loads titles from a specified text file.
  2. Download NLTK Resources: Downloads necessary NLTK resources like stopwords and WordNet for lemmatization.
  3. Preprocess Text: Includes removing text inside parentheses and brackets, converting to lowercase, removing punctuation and stopwords, and lemmatizing.
  4. Vectorize Text: Uses TF-IDF to convert the preprocessed titles into numerical vectors.
  5. Cluster Titles: Applies K-means clustering to the vectorized titles.
  6. Output Results: Groups titles by their cluster and prints them.
from collections import defaultdict
from pathlib import Path

import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
)

print('Results:')
for r in results[:50]:
    print(f'\t{r}')
first_title = results[0]

# Preprocess each title
print('\nPreprocessed results:')
preprocessed_results = [preprocess_text(title) for title in results]
for r in preprocessed_results:
    print(f'\t{r}')

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_results)
print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)

# Output clustering results
labels = kmeans.labels_
clusters = defaultdict(list)

# Group titles by their clusters
for i, label in enumerate(labels):
    clusters[label].append(results[i])

# Print clustering results by cluster
print("Clustering results by cluster:")
for cluster, titles in clusters.items():
    print(f"\nCluster {cluster}:")
    for title in titles:
        print(f"- {title}")

Results for ubuntu:

Results:
	Ubuntu Linux основы администрирования
	ubuntu-14.04.5-server-amd64.iso
	ubuntu-mate-20.04.3-desktop-amd64.iso
	ubuntu-18.04-live-server-amd64.iso
	Ubuntu 20.04.2.0 Desktop (64-bit)
	ubuntu-22.04-live-server-amd64.iso
	Ubuntu 20.04.3 (AMD64) (Server)
	ubuntu-17.04-server-amd64.iso
	ubuntu-20.04.4-desktop-amd64.iso
	ubuntu-20.04.4-live-server-amd64.iso
	ubuntu-22.04.1-desktop-amd64.iso
	Ubuntu reducido
	ubuntu
	ubuntu-11.04-desktop-amd64.iso
	ubuntu-11.04-alternate-i386.iso
	Ubuntu 12.10 Desktop (i386)
	ubuntu-20.04.1-desktop-amd64.iso
	ubuntu-22.04-desktop-amd64.iso
	Ubuntu 20.04.1 Desktop.iso
	ubuntu-22.10-desktop-amd64.iso
	ubuntu-18.10-desktop-amd64.iso
	ubuntu-14.10-desktop-i386.iso
	[Ubuntu] Anonymous OS 0.1
	Ubuntu 9.10
	ubuntu-15.04-desktop-i386.iso
	ubuntu-14.04-desktop-i386.iso
	ubuntu-11.10-dvd-amd64.iso
	ubuntu-15.04-desktop-amd64.iso
	ubuntu-18.04.3-live-server-amd64.iso
	ubuntu-21.04-desktop-amd64.iso
	Ubuntu 16.10
	ubuntu-23.04-live-server-amd64.iso
	ubuntu-18.10-server-amd64.iso
	ubuntu-18.04.1-desktop-amd64.iso
	ubuntu-14.04.4-desktop-amd64.iso
	ubuntu-14.04-server-amd64.ova
	ubuntu-16.04-desktop-i386.iso
	ubuntu-16.04.6-server-amd64.iso
	ubuntu-10.10-xenon-beta5
	ubuntu-20.04.3-desktop-amd64.iso
	Ubuntu
	ubuntu-20.04.2-desktop-amd64.iso
	Ubuntu Facile 01 2014.pdf
	ubuntu-12.04.5-dvd-i386.iso
	ubuntu-17.10-desktop-amd64.iso
	ubuntu-mate-20.04.4-desktop-amd64.iso
	Ubuntu Unleashed 2019 Edition
	ubuntu-20.04-live-server-amd64.iso
	ubuntu-11.10-desktop-i386.iso
	Ubuntu Ultimate Edition 1.9

Preprocessed results:
	ubuntu linux основы администрирования
	ubuntu 14 04 5 server amd64 iso
	ubuntu mate 20 04 3 desktop amd64 iso
	ubuntu 18 04 live server amd64 iso
	ubuntu 20 04 2 0 desktop
	ubuntu 22 04 live server amd64 iso
	ubuntu 20 04 3
	ubuntu 17 04 server amd64 iso
	ubuntu 20 04 4 desktop amd64 iso
	ubuntu 20 04 4 live server amd64 iso
	ubuntu 22 04 1 desktop amd64 iso
	ubuntu reducido
	ubuntu
	ubuntu 11 04 desktop amd64 iso
	ubuntu 11 04 alternate i386 iso
	ubuntu 12 10 desktop
	ubuntu 20 04 1 desktop amd64 iso
	ubuntu 22 04 desktop amd64 iso
	ubuntu 20 04 1 desktop iso
	ubuntu 22 10 desktop amd64 iso
	ubuntu 18 10 desktop amd64 iso
	ubuntu 14 10 desktop i386 iso
	anonymous o 0 1
	ubuntu 9 10
	ubuntu 15 04 desktop i386 iso
	ubuntu 14 04 desktop i386 iso
	ubuntu 11 10 dvd amd64 iso
	ubuntu 15 04 desktop amd64 iso
	ubuntu 18 04 3 live server amd64 iso
	ubuntu 21 04 desktop amd64 iso
	ubuntu 16 10
	ubuntu 23 04 live server amd64 iso
	ubuntu 18 10 server amd64 iso
	ubuntu 18 04 1 desktop amd64 iso
	ubuntu 14 04 4 desktop amd64 iso
	ubuntu 14 04 server amd64 ovum
	ubuntu 16 04 desktop i386 iso
	ubuntu 16 04 6 server amd64 iso
	ubuntu 10 10 xenon beta5
	ubuntu 20 04 3 desktop amd64 iso
	ubuntu
	ubuntu 20 04 2 desktop amd64 iso
	ubuntu facile 01 2014 pdf
	ubuntu 12 04 5 dvd i386 iso
	ubuntu 17 10 desktop amd64 iso
	ubuntu mate 20 04 4 desktop amd64 iso
	ubuntu unleashed 2019 edition
	ubuntu 20 04 live server amd64 iso
	ubuntu 11 10 desktop i386 iso
	ubuntu ultimate edition 1 9
	ubuntu netbook remix
	ubuntu 21 10 desktop amd64 iso
	ubuntu budgie 22 04 3 desktop amd64 iso
	ubuntu 16 04 3 server amd64 iso
	ubuntu 16 04 5
	ubuntu 14 04 6 desktop amd64 mac iso
	ubuntu 12 10 desktop i386 iso
	ubuntu 9 10 пользовательская сборка
	ubuntu 20 04 2 0 desktop amd64 iso
	ubuntu 16 10 server arm64 iso
	ubuntu satanic edition 666 4
	ubuntu 23 10 beta desktop amd64 iso
	ubuntu mate 19 10 desktop amd64 iso
	ubuntu 21 04 live server amd64 iso
	ubuntu 21 10 beta pack
	ubuntu 20 04 desktop amd64 iso
	ubuntu 12 04 4 desktop amd64 mac iso
	ubuntu 16 10 desktop i386 iso
	ubuntu linux ebook pack
	ubuntu 14 10 desktop amd64 iso
	ubuntu 13 04 desktop i386 iso
	ubuntu 12 04 5 desktop i386 iso
	ubuntu 18 04
	ubuntu 16 04 7 server amd64 iso
	ubuntu 12 04 server i386 iso
	ubuntu 22 04 3 live server amd64 iso
	ubuntu server essential 6685
	ubuntu 12 04 5 desktop amd64 iso
	ubuntu 14 10 server amd64 iso
	ubuntu 19 04 desktop amd64 iso
	ubuntu book ru djvu
	ubuntu 16 04 5 desktop amd64 iso
	ubuntu 15 04 server amd64 iso
	ubuntu unity 22 10 desktop amd64 iso
	ubuntu 11 10 oneiric ocelot
	ubuntu mate 21 10 desktop amd64 iso
	ubuntu 18 04 6 desktop amd64 iso
	ubuntu 20 10 desktop amd64 iso
	ubuntu facile aprile 2015 pdf
	ubuntu 18 04 desktop amd64 iso
	ubuntu server 20 04 2 lts
	ubuntu 18 04 4 desktop amd64 iso
	ubuntu pack 16 04 unity
	ubuntu 10 04 netbook
	ubuntu 14 04 1 server amd64 iso
	ubuntu facile marzo 2015 pdf
	ubuntu ultimate 1 4 dvd
	ubuntu 14 04 server i386 iso
	ubuntu 14 04 6 desktop i386 iso
	ubuntu 20 04 x64 untouched david1893
	ubuntu 18 04 5 live server amd64 iso
	ubuntu 16 04 6 server i386 iso
	ubuntu facile 04 2014 pdf
	ubuntu 19 04 server amd64 iso
	ubuntu 16 04 7 desktop amd64 iso
	ubuntu 19 10 desktop amd64 iso
	ubuntu 22 04 2 desktop amd64 iso
	ubuntu 16 04 6 desktop i386 iso
	ubuntu 16 10 desktop amd64 iso
	ubuntu 19 10 live server amd64 iso
	ubuntu 14 04 desktop amd64 iso
Clustering...
Clustering results by cluster:

Cluster 0:
- Ubuntu Linux основы администрирования
- ubuntu-mate-20.04.3-desktop-amd64.iso
- Ubuntu 20.04.2.0 Desktop (64-bit)
- Ubuntu 20.04.3 (AMD64) (Server)
- ubuntu-20.04.4-desktop-amd64.iso
- ubuntu-20.04.4-live-server-amd64.iso
- ubuntu-22.04.1-desktop-amd64.iso
- Ubuntu reducido
- ubuntu
- ubuntu-20.04.1-desktop-amd64.iso
- ubuntu-22.04-desktop-amd64.iso
- Ubuntu 20.04.1 Desktop.iso
- ubuntu-15.04-desktop-amd64.iso
- ubuntu-21.04-desktop-amd64.iso
- ubuntu-20.04.3-desktop-amd64.iso
- Ubuntu
- ubuntu-20.04.2-desktop-amd64.iso
- ubuntu-mate-20.04.4-desktop-amd64.iso
- ubuntu-20.04-live-server-amd64.iso
- Ubuntu Netbook Remix
- ubuntu-budgie-22.04.3-desktop-amd64.iso
- ubuntu-20.04.2.0-desktop-amd64.iso
- ubuntu-20.04-desktop-amd64.iso
- ubuntu-12.04.4-desktop-amd64+mac.iso
- Ubuntu Linux ebook pack
- ubuntu-12.04.5-desktop-amd64.iso
- ubuntu-19.04-desktop-amd64.iso
- Ubuntu-Book_RU.djvu
- ubuntu-20.10-desktop-amd64.iso
- Ubuntu Server 20.04.2 LTS
- Ubuntu - 20.04 - X64 - UNTOUCHED - David1893
- ubuntu-22.04.2-desktop-amd64.iso

Cluster 1:
- ubuntu-14.04.5-server-amd64.iso
- ubuntu-14.10-desktop-i386.iso
- ubuntu-14.04-desktop-i386.iso
- ubuntu-14.04.4-desktop-amd64.iso
- ubuntu-14.04-server-amd64.ova
- ubuntu-14.04.6-desktop-amd64+mac.iso
- ubuntu-14.10-desktop-amd64.iso
- ubuntu-14.10-server-amd64.iso
- ubuntu-14.04.1-server-amd64.iso
- ubuntu-14.04-server-i386.iso
- ubuntu-14.04.6-desktop-i386.iso
- ubuntu-14.04-desktop-amd64.iso

Cluster 2:
- ubuntu-18.04-live-server-amd64.iso
- Ubuntu 12.10 Desktop (i386)
- ubuntu-22.10-desktop-amd64.iso
- ubuntu-18.10-desktop-amd64.iso
- Ubuntu 9.10
- ubuntu-18.04.3-live-server-amd64.iso
- ubuntu-18.10-server-amd64.iso
- ubuntu-18.04.1-desktop-amd64.iso
- ubuntu-10.10-xenon-beta5
- ubuntu-17.10-desktop-amd64.iso
- ubuntu-21.10-desktop-amd64.iso
- Ubuntu 9.10 Пользовательская сборка
- ubuntu-23.10-beta-desktop-amd64.iso
- ubuntu-mate-19.10-desktop-amd64.iso
- ubuntu-21.10-beta-pack
- Ubuntu-18.04
- ubuntu-unity-22.10-desktop-amd64.iso
- ubuntu-mate-21.10-desktop-amd64.iso
- ubuntu-18.04.6-desktop-amd64.iso
- ubuntu-18.04-desktop-amd64.iso
- ubuntu-18.04.4-desktop-amd64.iso
- Ubuntu 10.04 Netbook
- ubuntu-18.04.5-live-server-amd64.iso
- ubuntu-19.10-desktop-amd64.iso

Cluster 5:
- ubuntu-22.04-live-server-amd64.iso
- ubuntu-17.04-server-amd64.iso
- Ubuntu 16.10
- ubuntu-23.04-live-server-amd64.iso
- ubuntu-16.04.6-server-amd64.iso
- ubuntu-16.04.3-server-amd64.iso
- Ubuntu-16.04.5
- ubuntu-16.10-server-arm64.iso
- ubuntu-21.04-live-server-amd64.iso
- ubuntu-16.04.7-server-amd64.iso
- ubuntu-22.04.3-live-server-amd64.iso
- Ubuntu Server Essentials - 6685 [ECLiPSE]
- ubuntu-16.04.5-desktop-amd64.iso
- ubuntu-15.04-server-amd64.iso
- ubuntu-pack-16.04-unity
- ubuntu-19.04-server-amd64.iso
- ubuntu-16.04.7-desktop-amd64.iso
- ubuntu-16.10-desktop-amd64.iso
- ubuntu-19.10-live-server-amd64.iso

Cluster 4:
- ubuntu-11.04-desktop-amd64.iso
- ubuntu-11.04-alternate-i386.iso
- ubuntu-11.10-dvd-amd64.iso
- ubuntu-11.10-desktop-i386.iso
- Ubuntu 11.10 Oneiric Ocelot
- ubuntu-ultimate-1.4-dvd

Cluster 3:
- [Ubuntu] Anonymous OS 0.1
- ubuntu-15.04-desktop-i386.iso
- ubuntu-16.04-desktop-i386.iso
- ubuntu-12.04.5-dvd-i386.iso
- ubuntu-12.10-desktop-i386.iso
- ubuntu-16.10-desktop-i386.iso
- ubuntu-13.04-desktop-i386.iso
- ubuntu-12.04.5-desktop-i386.iso
- ubuntu-12.04-server-i386.iso
- ubuntu-16.04.6-server-i386.iso
- ubuntu-16.04.6-desktop-i386.iso

Cluster 7:
- Ubuntu Facile 01 2014.pdf
- Ubuntu Facile - Aprile 2015.pdf
- Ubuntu Facile Marzo 2015.pdf
- Ubuntu Facile 04 2014.pdf

Cluster 6:
- Ubuntu Unleashed 2019 Edition
- Ubuntu Ultimate Edition 1.9
- Ubuntu Satanic Edition 666.4

@drew2a
Copy link
Contributor Author

drew2a commented Feb 14, 2024

Expanding on the same idea: what if instead of searching for all similar entries in the search results, we look for entries similar only to the first (most relevant) result?

The developed script:

  1. Load Titles: Reads titles from a specified text file, which are then prepared for processing. This step is essential for acquiring the raw data that will be analyzed and clustered based on similarity.
  2. Download NLTK Resources: Utilizes the Natural Language Toolkit (NLTK) to download necessary resources such as stopwords and the WordNet lemmatizer. These resources are critical for the text preprocessing stage, enabling the removal of common words that offer little value to the analysis and the conversion of words to their base or root form.
  3. Preprocess Text: Employs several preprocessing techniques to clean and standardize the titles. This includes removing text inside parentheses and brackets with regular expressions, converting all text to lowercase to ensure uniformity, eliminating punctuation to reduce noise, removing stopwords to focus on meaningful words, and lemmatizing words to their base form to consolidate different forms of the same word.
  4. Vectorize Text Using TF-IDF: Transforms the preprocessed text into numerical vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) method. This vectorization reflects the importance of words within the titles relative to the dataset, allowing for the quantitative comparison of text.
  5. Calculate Cosine Similarity: After vectorization, calculates the cosine similarity between the vector of the first title and the vectors of all subsequent titles. This similarity measurement is pivotal in identifying titles that are most similar to the first, presumed most relevant title.
  6. Cluster Based on Similarity: Instead of applying a traditional clustering algorithm like K-means, titles are grouped based on their cosine similarity to the first title. This method clusters titles by directly comparing their similarity scores, allowing for dynamic cluster formation based on a predefined similarity threshold.
  7. Output Results by Similarity: Outputs the titles, organized by their similarity to the first title. This step highlights the effectiveness of using the search engine's ranking to prioritize and group results, showcasing the clustered titles in a structured and understandable format.

But the results of the experiment are still far from ideal and even from a minimally viable product.

from pathlib import Path

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(
    r for r in set(Path('/ubuntu.txt').read_text().split('\n')) if r
)
print('Results:')
for r in results[:50]:
    print(f'\t{r}')
first_title = results[0]

# Preprocess each title
preprocessed_results = [preprocess_text(title) for title in results]
for r in preprocessed_results:
    print(f'\t{r}')

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_results)

# Calculate cosine similarity
similarity_matrix = cosine_similarity(X[0:1], X)

# Define a similarity threshold
similarity_threshold = 0.4  # Adjust this threshold as needed

# Filter indices by similarity threshold and keep similarity values
filtered_indices_and_similarity = [(i, similarity_matrix[0, i]) for i in range(similarity_matrix.shape[1]) if
                                   similarity_matrix[0, i] >= similarity_threshold]

# Sort filtered indices by similarity with the first title, keeping similarity values
sorted_filtered_indices_and_similarity = sorted(filtered_indices_and_similarity, key=lambda x: -x[1])

# Gather similar titles with their similarity values
similar_titles_with_similarity = [(results[i], similarity) for i, similarity in sorted_filtered_indices_and_similarity]

# Print similar titles with similarity values
print(f"\nTitles similar to the first title ({first_title}):")
for title, similarity in similar_titles_with_similarity:
    print(f"- {title} (Similarity: {similarity:.3f})")

Output:

Results:
	ubuntu-16.04.6-server-amd64.iso
	ubuntu-20.04-live-server-amd64.iso
	ubuntu-18.04.3-live-server-amd64.iso
	ubuntu-19.10-desktop-amd64.iso
	ubuntu-18.04.6-desktop-amd64.iso
	ubuntu-23.10-beta-desktop-amd64.iso
	ubuntu-18.04.5-live-server-amd64.iso
	ubuntu-14.04-desktop-i386.iso
	Ubuntu 16.10
	ubuntu-18.10-server-amd64.iso
	ubuntu-16.10-desktop-i386.iso
	ubuntu-14.04.1-server-amd64.iso
	Ubuntu Satanic Edition 666.4
	ubuntu-mate-19.10-desktop-amd64.iso
	ubuntu-22.04.1-desktop-amd64.iso
	ubuntu-16.04.6-server-i386.iso
	ubuntu-16.04.3-server-amd64.iso
	ubuntu-mate-21.10-desktop-amd64.iso
	ubuntu-22.10-desktop-amd64.iso
	ubuntu-22.04.2-desktop-amd64.iso
	ubuntu-19.04-server-amd64.iso
	ubuntu-16.10-server-arm64.iso
	Ubuntu Netbook Remix
	ubuntu-14.04.6-desktop-amd64+mac.iso
	ubuntu-21.04-desktop-amd64.iso
	ubuntu-17.04-server-amd64.iso
	ubuntu-pack-16.04-unity
	Ubuntu 20.04.2.0 Desktop (64-bit)
	ubuntu-14.04-server-amd64.ova
	Ubuntu-16.04.5
	ubuntu-16.04.5-desktop-amd64.iso
	ubuntu-15.04-desktop-i386.iso
	Ubuntu 12.10 Desktop (i386)
	ubuntu-12.04.4-desktop-amd64+mac.iso
	ubuntu-11.04-desktop-amd64.iso
	Ubuntu Server 20.04.2 LTS
	ubuntu-20.04.1-desktop-amd64.iso
	ubuntu-15.04-server-amd64.iso
	ubuntu-16.04-desktop-i386.iso
	Ubuntu reducido
	ubuntu-14.04-server-i386.iso
	ubuntu-19.04-desktop-amd64.iso
	ubuntu-12.04.5-desktop-amd64.iso
	ubuntu-14.04.6-desktop-i386.iso
	ubuntu-12.10-desktop-i386.iso
	ubuntu-21.10-beta-pack
	ubuntu-ultimate-1.4-dvd
	ubuntu-mate-20.04.4-desktop-amd64.iso
	ubuntu-18.04-live-server-amd64.iso
	Ubuntu 10.04 Netbook

Preprocessed results:
	ubuntu 16 04 6 server amd64 iso
	ubuntu 20 04 live server amd64 iso
	ubuntu 18 04 3 live server amd64 iso
	ubuntu 19 10 desktop amd64 iso
	ubuntu 18 04 6 desktop amd64 iso
	ubuntu 23 10 beta desktop amd64 iso
	ubuntu 18 04 5 live server amd64 iso
	ubuntu 14 04 desktop i386 iso
	ubuntu 16 10
	ubuntu 18 10 server amd64 iso
	ubuntu 16 10 desktop i386 iso
	ubuntu 14 04 1 server amd64 iso
	ubuntu satanic edition 666 4
	ubuntu mate 19 10 desktop amd64 iso
	ubuntu 22 04 1 desktop amd64 iso
	ubuntu 16 04 6 server i386 iso
	ubuntu 16 04 3 server amd64 iso
	ubuntu mate 21 10 desktop amd64 iso
	ubuntu 22 10 desktop amd64 iso
	ubuntu 22 04 2 desktop amd64 iso
	ubuntu 19 04 server amd64 iso
	ubuntu 16 10 server arm64 iso
	ubuntu netbook remix
	ubuntu 14 04 6 desktop amd64 mac iso
	ubuntu 21 04 desktop amd64 iso
	ubuntu 17 04 server amd64 iso
	ubuntu pack 16 04 unity
	ubuntu 20 04 2 0 desktop
	ubuntu 14 04 server amd64 ovum
	ubuntu 16 04 5
	ubuntu 16 04 5 desktop amd64 iso
	ubuntu 15 04 desktop i386 iso
	ubuntu 12 10 desktop
	ubuntu 12 04 4 desktop amd64 mac iso
	ubuntu 11 04 desktop amd64 iso
	ubuntu server 20 04 2 lts
	ubuntu 20 04 1 desktop amd64 iso
	ubuntu 15 04 server amd64 iso
	ubuntu 16 04 desktop i386 iso
	ubuntu reducido
	ubuntu 14 04 server i386 iso
	ubuntu 19 04 desktop amd64 iso
	ubuntu 12 04 5 desktop amd64 iso
	ubuntu 14 04 6 desktop i386 iso
	ubuntu 12 10 desktop i386 iso
	ubuntu 21 10 beta pack
	ubuntu ultimate 1 4 dvd
	ubuntu mate 20 04 4 desktop amd64 iso
	ubuntu 18 04 live server amd64 iso
	ubuntu 10 04 netbook
	ubuntu 14 04 4 desktop amd64 iso
	ubuntu 20 04 1 desktop iso
	ubuntu 14 04 5 server amd64 iso
	ubuntu 12 04 5 desktop i386 iso
	ubuntu linux ebook pack
	anonymous o 0 1
	ubuntu 20 04 desktop amd64 iso
	ubuntu 11 10 desktop i386 iso
	ubuntu book ru djvu
	ubuntu 9 10
	ubuntu 23 04 live server amd64 iso
	ubuntu 17 10 desktop amd64 iso
	ubuntu 16 04 6 desktop i386 iso
	ubuntu mate 20 04 3 desktop amd64 iso
	ubuntu
	ubuntu linux основы администрирования
	ubuntu 14 04 desktop amd64 iso
	ubuntu 20 04 2 0 desktop amd64 iso
	ubuntu 21 10 desktop amd64 iso
	ubuntu 20 04 x64 untouched david1893
	ubuntu 12 04 5 dvd i386 iso
	ubuntu 12 04 server i386 iso
	ubuntu
	ubuntu 20 04 2 desktop amd64 iso
	ubuntu 14 10 desktop i386 iso
	ubuntu 16 04 7 server amd64 iso
	ubuntu 11 10 dvd amd64 iso
	ubuntu facile 04 2014 pdf
	ubuntu 14 10 desktop amd64 iso
	ubuntu 18 10 desktop amd64 iso
	ubuntu 18 04 1 desktop amd64 iso
	ubuntu 16 04 7 desktop amd64 iso
	ubuntu 15 04 desktop amd64 iso
	ubuntu 22 04 3 live server amd64 iso
	ubuntu 19 10 live server amd64 iso
	ubuntu unleashed 2019 edition
	ubuntu server essential 6685
	ubuntu 20 10 desktop amd64 iso
	ubuntu budgie 22 04 3 desktop amd64 iso
	ubuntu 18 04 desktop amd64 iso
	ubuntu 22 04 desktop amd64 iso
	ubuntu 14 10 server amd64 iso
	ubuntu 20 04 3
	ubuntu 13 04 desktop i386 iso
	ubuntu 20 04 4 live server amd64 iso
	ubuntu 18 04
	ubuntu 20 04 4 desktop amd64 iso
	ubuntu 18 04 4 desktop amd64 iso
	ubuntu 11 10 oneiric ocelot
	ubuntu 16 10 desktop amd64 iso
	ubuntu facile marzo 2015 pdf
	ubuntu 9 10 пользовательская сборка
	ubuntu 10 10 xenon beta5
	ubuntu 20 04 3 desktop amd64 iso
	ubuntu 11 04 alternate i386 iso
	ubuntu facile aprile 2015 pdf
	ubuntu unity 22 10 desktop amd64 iso
	ubuntu facile 01 2014 pdf
	ubuntu ultimate edition 1 9
	ubuntu 21 04 live server amd64 iso
	ubuntu 22 04 live server amd64 iso

Titles similar to the first title (ubuntu-16.04.6-server-amd64.iso):
	ubuntu-16.04.6-server-amd64.iso (Similarity: 1.000)
	ubuntu-16.04.3-server-amd64.iso (Similarity: 1.000)
	ubuntu-16.04.7-server-amd64.iso (Similarity: 1.000)
	ubuntu-16.04.5-desktop-amd64.iso (Similarity: 0.795)
	ubuntu-16.04.7-desktop-amd64.iso (Similarity: 0.795)
	ubuntu-16.04.6-server-i386.iso (Similarity: 0.790)
	Ubuntu-16.04.5 (Similarity: 0.742)
	ubuntu-16.10-desktop-amd64.iso (Similarity: 0.639)
	ubuntu-16.04-desktop-i386.iso (Similarity: 0.593)
	ubuntu-16.04.6-desktop-i386.iso (Similarity: 0.593)
	ubuntu-14.04.1-server-amd64.iso (Similarity: 0.584)
	ubuntu-14.04.5-server-amd64.iso (Similarity: 0.584)
	Ubuntu 16.10 (Similarity: 0.542)
	ubuntu-16.10-server-arm64.iso (Similarity: 0.536)
	ubuntu-19.04-server-amd64.iso (Similarity: 0.525)
	ubuntu-15.04-server-amd64.iso (Similarity: 0.497)
	ubuntu-20.04-live-server-amd64.iso (Similarity: 0.492)
	ubuntu-20.04.4-live-server-amd64.iso (Similarity: 0.492)
	ubuntu-17.04-server-amd64.iso (Similarity: 0.478)
	ubuntu-18.04.3-live-server-amd64.iso (Similarity: 0.473)
	ubuntu-18.04.5-live-server-amd64.iso (Similarity: 0.473)
	ubuntu-18.04-live-server-amd64.iso (Similarity: 0.473)
	ubuntu-16.10-desktop-i386.iso (Similarity: 0.471)
	ubuntu-22.04.3-live-server-amd64.iso (Similarity: 0.464)
	ubuntu-22.04-live-server-amd64.iso (Similarity: 0.464)
	ubuntu-14.10-server-amd64.iso (Similarity: 0.455)
	ubuntu-18.10-server-amd64.iso (Similarity: 0.446)
	ubuntu-21.04-live-server-amd64.iso (Similarity: 0.446)
	ubuntu-14.04-server-i386.iso (Similarity: 0.423)
	ubuntu-23.04-live-server-amd64.iso (Similarity: 0.416)
	ubuntu-12.04-server-i386.iso (Similarity: 0.401)

@synctext
Copy link
Member

We learned something! I genuinely don't find it a bad start.

Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away?

@drew2a
Copy link
Contributor Author

drew2a commented Feb 15, 2024

I've modified the original script to address the question "Seems number signal is thrown away?" and to print all TF-IDF values for each term.

Indeed, it was discovered that certain digits were being ignored, specifically those consisting of a single character. This occurred because the vectorizer, by default, disregards all terms that are composed of only one character.

In the example below the number 5 are ignored:

	Original:     ubuntu-12.04.5-desktop-i386.iso
	Preprocessed: ubuntu 12 04 5 desktop i386 iso
	TF-IDF:
			12: 0.668
			i386: 0.530
			desktop: 0.318
			04: 0.275
			iso: 0.248
			ubuntu: 0.185

I've fix it and also added more output to discern the terms around which titles were grouped into clusters, I analyzed the centroids of the clusters determined by the K-means algorithm. The centroids represent the "center" or "mean" vector of each cluster in the feature space, essentially capturing the average importance of each term within the cluster. By examining these centroids, we can identify which terms have the highest TF-IDF values across the documents in a cluster, giving us insight into the thematic essence of each cluster.

This information should provide us with a better understanding of the details involved in grouping items into clusters.

Features: 
['0' '1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '2014'
 '2015' '2019' '21' '22' '23' '3' '4' '5' '6' '666' '6685' '7' '9'
 'alternate' 'amd64' 'anonymous' 'aprile' 'arm64' 'beta' 'beta5' 'book'
 'budgie' 'david1893' 'desktop' 'djvu' 'dvd' 'ebook' 'edition' 'essential'
 'facile' 'i386' 'iso' 'linux' 'live' 'lts' 'mac' 'marzo' 'mate' 'netbook'
 'o' 'ocelot' 'oneiric' 'ovum' 'pack' 'pdf' 'reducido' 'remix' 'ru'
 'satanic' 'server' 'ubuntu' 'ultimate' 'unity' 'unleashed' 'untouched'
 'x64' 'xenon' 'администрирования' 'основы' 'пользовательская' 'сборка']

Original and preprocessed titles with their TF-IDF vectors:

	Original:     Ubuntu-16.04.5
	Preprocessed: ubuntu 16 4 5
	TF-IDF:
			5: 0.721
			16: 0.596
			4: 0.291
			ubuntu: 0.200

	Original:     ubuntu-mate-21.10-desktop-amd64.iso
	Preprocessed: ubuntu mate 21 10 desktop amd64 iso
	TF-IDF:
			mate: 0.606
			21: 0.579
			10: 0.342
			desktop: 0.255
			amd64: 0.235
			iso: 0.199
			ubuntu: 0.149

	Original:     Ubuntu 9.10 Пользовательская сборка
	Preprocessed: ubuntu 9 10 пользовательская сборка
	TF-IDF:
			сборка: 0.578
			пользовательская: 0.578
			9: 0.498
			10: 0.266
			ubuntu: 0.116

	Original:     Ubuntu Linux ebook pack
	Preprocessed: ubuntu linux ebook pack
	TF-IDF:
			ebook: 0.617
			linux: 0.567
			pack: 0.532
			ubuntu: 0.124

	Original:     ubuntu-23.04-live-server-amd64.iso
	Preprocessed: ubuntu 23 4 live server amd64 iso
	TF-IDF:
			23: 0.684
			live: 0.492
			server: 0.353
			amd64: 0.236
			4: 0.218
			iso: 0.200
			ubuntu: 0.149

	Original:     ubuntu-12.04.4-desktop-amd64+mac.iso
	Preprocessed: ubuntu 12 4 4 desktop amd64 mac iso
	TF-IDF:
			mac: 0.643
			12: 0.507
			4: 0.409
			desktop: 0.241
			amd64: 0.222
			iso: 0.188
			ubuntu: 0.140

	Original:     ubuntu-11.04-alternate-i386.iso
	Preprocessed: ubuntu 11 4 alternate i386 iso
	TF-IDF:
			alternate: 0.684
			11: 0.534
			i386: 0.393
			4: 0.200
			iso: 0.184
			ubuntu: 0.137

	Original:     ubuntu-22.04.1-desktop-amd64.iso
	Preprocessed: ubuntu 22 4 1 desktop amd64 iso
	TF-IDF:
			22: 0.599
			1: 0.581
			desktop: 0.294
			amd64: 0.271
			4: 0.250
			iso: 0.229
			ubuntu: 0.172

	Original:     ubuntu-18.04.1-desktop-amd64.iso
	Preprocessed: ubuntu 18 4 1 desktop amd64 iso
	TF-IDF:
			1: 0.593
			18: 0.576
			desktop: 0.300
			amd64: 0.276
			4: 0.255
			iso: 0.234
			ubuntu: 0.175

	Original:     Ubuntu 20.04.1 Desktop.iso
	Preprocessed: ubuntu 20 4 1 desktop iso
	TF-IDF:
			1: 0.646
			20: 0.545
			desktop: 0.327
			4: 0.278
			iso: 0.255
			ubuntu: 0.191

	Original:     ubuntu-16.04-desktop-i386.iso
	Preprocessed: ubuntu 16 4 desktop i386 iso
	TF-IDF:
			16: 0.598
			i386: 0.573
			desktop: 0.343
			4: 0.292
			iso: 0.268
			ubuntu: 0.200

	Original:     ubuntu-17.10-desktop-amd64.iso
	Preprocessed: ubuntu 17 10 desktop amd64 iso
	TF-IDF:
			17: 0.780
			10: 0.391
			desktop: 0.292
			amd64: 0.269
			iso: 0.228
			ubuntu: 0.170

	Original:     ubuntu-23.10-beta-desktop-amd64.iso
	Preprocessed: ubuntu 23 10 beta desktop amd64 iso
	TF-IDF:
			beta: 0.615
			23: 0.615
			10: 0.309
			desktop: 0.230
			amd64: 0.212
			iso: 0.180
			ubuntu: 0.134

	Original:     ubuntu-20.04.2.0-desktop-amd64.iso
	Preprocessed: ubuntu 20 4 2 0 desktop amd64 iso
	TF-IDF:
			0: 0.595
			2: 0.539
			20: 0.396
			desktop: 0.237
			amd64: 0.219
			4: 0.202
			iso: 0.185
			ubuntu: 0.139

	Original:     ubuntu-12.04-server-i386.iso
	Preprocessed: ubuntu 12 4 server i386 iso
	TF-IDF:
			12: 0.641
			i386: 0.508
			server: 0.420
			4: 0.259
			iso: 0.238
			ubuntu: 0.178

	Original:     Ubuntu Unleashed 2019 Edition
	Preprocessed: ubuntu unleashed 2019 edition
	TF-IDF:
			2019: 0.599
			unleashed: 0.599
			edition: 0.517
			ubuntu: 0.120

	Original:     Ubuntu Linux основы администрирования
	Preprocessed: ubuntu linux основы администрирования
	TF-IDF:
			администрирования: 0.589
			основы: 0.589
			linux: 0.541
			ubuntu: 0.118

	Original:     ubuntu-22.10-desktop-amd64.iso
	Preprocessed: ubuntu 22 10 desktop amd64 iso
	TF-IDF:
			22: 0.689
			10: 0.453
			desktop: 0.338
			amd64: 0.311
			iso: 0.264
			ubuntu: 0.197

	Original:     ubuntu-ultimate-1.4-dvd
	Preprocessed: ubuntu ultimate 1 4 dvd
	TF-IDF:
			ultimate: 0.623
			dvd: 0.584
			1: 0.461
			4: 0.198
			ubuntu: 0.136

	Original:     ubuntu-19.04-server-amd64.iso
	Preprocessed: ubuntu 19 4 server amd64 iso
	TF-IDF:
			19: 0.734
			server: 0.446
			amd64: 0.297
			4: 0.275
			iso: 0.252
			ubuntu: 0.189

	Original:     ubuntu-15.04-server-amd64.iso
	Preprocessed: ubuntu 15 4 server amd64 iso
	TF-IDF:
			15: 0.766
			server: 0.422
			amd64: 0.281
			4: 0.260
			iso: 0.239
			ubuntu: 0.178

	Original:     ubuntu-budgie-22.04.3-desktop-amd64.iso
	Preprocessed: ubuntu budgie 22 4 3 desktop amd64 iso
	TF-IDF:
			budgie: 0.641
			3: 0.464
			22: 0.449
			desktop: 0.221
			amd64: 0.203
			4: 0.188
			iso: 0.172
			ubuntu: 0.129

	Original:     ubuntu-16.10-desktop-i386.iso
	Preprocessed: ubuntu 16 10 desktop i386 iso
	TF-IDF:
			16: 0.563
			i386: 0.540
			10: 0.433
			desktop: 0.323
			iso: 0.252
			ubuntu: 0.189

	Original:     ubuntu-18.04.6-desktop-amd64.iso
	Preprocessed: ubuntu 18 4 6 desktop amd64 iso
	TF-IDF:
			6: 0.631
			18: 0.555
			desktop: 0.289
			amd64: 0.266
			4: 0.246
			iso: 0.226
			ubuntu: 0.169

	Original:     ubuntu-16.04.6-desktop-i386.iso
	Preprocessed: ubuntu 16 4 6 desktop i386 iso
	TF-IDF:
			6: 0.599
			16: 0.478
			i386: 0.458
			desktop: 0.275
			4: 0.234
			iso: 0.214
			ubuntu: 0.160

	Original:     ubuntu-18.04.4-desktop-amd64.iso
	Preprocessed: ubuntu 18 4 4 desktop amd64 iso
	TF-IDF:
			18: 0.627
			4: 0.555
			desktop: 0.327
			amd64: 0.301
			iso: 0.255
			ubuntu: 0.191

	Original:     ubuntu-18.04-desktop-amd64.iso
	Preprocessed: ubuntu 18 4 desktop amd64 iso
	TF-IDF:
			18: 0.715
			desktop: 0.373
			amd64: 0.343
			4: 0.317
			iso: 0.291
			ubuntu: 0.217

	Original:     ubuntu-18.10-desktop-amd64.iso
	Preprocessed: ubuntu 18 10 desktop amd64 iso
	TF-IDF:
			18: 0.667
			10: 0.466
			desktop: 0.348
			amd64: 0.320
			iso: 0.271
			ubuntu: 0.203

	Original:     ubuntu-14.04-desktop-i386.iso
	Preprocessed: ubuntu 14 4 desktop i386 iso
	TF-IDF:
			14: 0.615
			i386: 0.563
			desktop: 0.338
			4: 0.287
			iso: 0.263
			ubuntu: 0.197

	Original:     ubuntu
	Preprocessed: ubuntu
	TF-IDF:
			ubuntu: 1.000

	Original:     ubuntu-12.04.5-desktop-amd64.iso
	Preprocessed: ubuntu 12 4 5 desktop amd64 iso
	TF-IDF:
			12: 0.598
			5: 0.598
			desktop: 0.284
			amd64: 0.262
			4: 0.242
			iso: 0.222
			ubuntu: 0.166

	Original:     ubuntu-22.04.2-desktop-amd64.iso
	Preprocessed: ubuntu 22 4 2 desktop amd64 iso
	TF-IDF:
			2: 0.634
			22: 0.569
			desktop: 0.279
			amd64: 0.257
			4: 0.237
			iso: 0.218
			ubuntu: 0.163

	Original:     Ubuntu 11.10 Oneiric Ocelot
	Preprocessed: ubuntu 11 10 oneiric ocelot
	TF-IDF:
			ocelot: 0.591
			oneiric: 0.591
			11: 0.462
			10: 0.273
			ubuntu: 0.119

	Original:     ubuntu-22.04.3-live-server-amd64.iso
	Preprocessed: ubuntu 22 4 3 live server amd64 iso
	TF-IDF:
			3: 0.515
			22: 0.499
			live: 0.470
			server: 0.338
			amd64: 0.225
			4: 0.208
			iso: 0.191
			ubuntu: 0.143

	Original:     ubuntu-11.10-dvd-amd64.iso
	Preprocessed: ubuntu 11 10 dvd amd64 iso
	TF-IDF:
			dvd: 0.646
			11: 0.586
			10: 0.346
			amd64: 0.237
			iso: 0.201
			ubuntu: 0.151

	Original:     ubuntu-14.04.6-desktop-i386.iso
	Preprocessed: ubuntu 14 4 6 desktop i386 iso
	TF-IDF:
			6: 0.593
			14: 0.496
			i386: 0.453
			desktop: 0.272
			4: 0.231
			iso: 0.212
			ubuntu: 0.159

	Original:     ubuntu-20.04.1-desktop-amd64.iso
	Preprocessed: ubuntu 20 4 1 desktop amd64 iso
	TF-IDF:
			1: 0.618
			20: 0.522
			desktop: 0.313
			amd64: 0.288
			4: 0.266
			iso: 0.244
			ubuntu: 0.183

	Original:     Ubuntu 9.10
	Preprocessed: ubuntu 9 10
	TF-IDF:
			9: 0.864
			10: 0.462
			ubuntu: 0.201

	Original:     ubuntu-14.04.5-server-amd64.iso
	Preprocessed: ubuntu 14 4 5 server amd64 iso
	TF-IDF:
			5: 0.603
			14: 0.523
			server: 0.395
			amd64: 0.264
			4: 0.244
			iso: 0.224
			ubuntu: 0.167

	Original:     ubuntu-14.04-desktop-amd64.iso
	Preprocessed: ubuntu 14 4 desktop amd64 iso
	TF-IDF:
			14: 0.697
			desktop: 0.382
			amd64: 0.352
			4: 0.325
			iso: 0.298
			ubuntu: 0.223

	Original:     ubuntu-12.04.5-desktop-i386.iso
	Preprocessed: ubuntu 12 4 5 desktop i386 iso
	TF-IDF:
			12: 0.556
			5: 0.556
			i386: 0.441
			desktop: 0.264
			4: 0.225
			iso: 0.206
			ubuntu: 0.154

	Original:     ubuntu-10.10-xenon-beta5
	Preprocessed: ubuntu 10 10 xenon beta5
	TF-IDF:
			beta5: 0.588
			xenon: 0.588
			10: 0.542
			ubuntu: 0.118

	Original:     Ubuntu Server Essentials - 6685 [ECLiPSE]
	Preprocessed: ubuntu server essential 6685
	TF-IDF:
			6685: 0.664
			essential: 0.664
			server: 0.315
			ubuntu: 0.133

	Original:     ubuntu-14.04.6-desktop-amd64+mac.iso
	Preprocessed: ubuntu 14 4 6 desktop amd64 mac iso
	TF-IDF:
			mac: 0.617
			6: 0.504
			14: 0.421
			desktop: 0.231
			amd64: 0.213
			4: 0.196
			iso: 0.180
			ubuntu: 0.135

	Original:     ubuntu-pack-16.04-unity
	Preprocessed: ubuntu pack 16 4 unity
	TF-IDF:
			unity: 0.638
			pack: 0.599
			16: 0.416
			4: 0.203
			ubuntu: 0.139

	Original:     ubuntu-19.10-desktop-amd64.iso
	Preprocessed: ubuntu 19 10 desktop amd64 iso
	TF-IDF:
			19: 0.727
			10: 0.429
			desktop: 0.320
			amd64: 0.295
			iso: 0.250
			ubuntu: 0.187

	Original:     ubuntu-14.04-server-i386.iso
	Preprocessed: ubuntu 14 4 server i386 iso
	TF-IDF:
			14: 0.586
			i386: 0.536
			server: 0.443
			4: 0.273
			iso: 0.251
			ubuntu: 0.187

	Original:     ubuntu-20.04.3-desktop-amd64.iso
	Preprocessed: ubuntu 20 4 3 desktop amd64 iso
	TF-IDF:
			3: 0.642
			20: 0.509
			desktop: 0.305
			amd64: 0.281
			4: 0.259
			iso: 0.238
			ubuntu: 0.178

	Original:     ubuntu-18.04.3-live-server-amd64.iso
	Preprocessed: ubuntu 18 4 3 live server amd64 iso
	TF-IDF:
			3: 0.522
			18: 0.477
			live: 0.477
			server: 0.343
			amd64: 0.228
			4: 0.211
			iso: 0.194
			ubuntu: 0.145

	Original:     Ubuntu Server 20.04.2 LTS
	Preprocessed: ubuntu server 20 4 2 lts
	TF-IDF:
			lts: 0.661
			2: 0.516
			20: 0.379
			server: 0.314
			4: 0.193
			ubuntu: 0.133

	Original:     Ubuntu 20.04.2.0 Desktop (64-bit)
	Preprocessed: ubuntu 20 4 2 0 desktop
	TF-IDF:
			0: 0.621
			2: 0.563
			20: 0.414
			desktop: 0.248
			4: 0.211
			ubuntu: 0.145

	Original:     ubuntu-21.10-desktop-amd64.iso
	Preprocessed: ubuntu 21 10 desktop amd64 iso
	TF-IDF:
			21: 0.727
			10: 0.429
			desktop: 0.320
			amd64: 0.295
			iso: 0.250
			ubuntu: 0.187

	Original:     Ubuntu Ultimate Edition 1.9
	Preprocessed: ubuntu ultimate edition 1 9
	TF-IDF:
			ultimate: 0.546
			edition: 0.512
			9: 0.512
			1: 0.404
			ubuntu: 0.119

	Original:     Ubuntu Facile 01 2014.pdf
	Preprocessed: ubuntu facile 1 2014 pdf
	TF-IDF:
			2014: 0.561
			pdf: 0.499
			facile: 0.499
			1: 0.415
			ubuntu: 0.123

	Original:     Ubuntu Netbook Remix
	Preprocessed: ubuntu netbook remix
	TF-IDF:
			remix: 0.728
			netbook: 0.670
			ubuntu: 0.146

	Original:     Ubuntu 12.10 Desktop (i386)
	Preprocessed: ubuntu 12 10 desktop
	TF-IDF:
			12: 0.765
			10: 0.487
			desktop: 0.364
			ubuntu: 0.212

	Original:     ubuntu-22.04-live-server-amd64.iso
	Preprocessed: ubuntu 22 4 live server amd64 iso
	TF-IDF:
			22: 0.582
			live: 0.548
			server: 0.394
			amd64: 0.263
			4: 0.243
			iso: 0.223
			ubuntu: 0.167

	Original:     ubuntu-16.04.6-server-amd64.iso
	Preprocessed: ubuntu 16 4 6 server amd64 iso
	TF-IDF:
			6: 0.624
			16: 0.498
			server: 0.395
			amd64: 0.263
			4: 0.243
			iso: 0.223
			ubuntu: 0.167

	Original:     [Ubuntu] Anonymous OS 0.1
	Preprocessed: anonymous o 0 1
	TF-IDF:
			o: 0.559
			anonymous: 0.559
			0: 0.482
			1: 0.380

	Original:     ubuntu-14.04-server-amd64.ova
	Preprocessed: ubuntu 14 4 server amd64 ovum
	TF-IDF:
			ovum: 0.736
			14: 0.462
			server: 0.350
			amd64: 0.233
			4: 0.215
			ubuntu: 0.148

	Original:     ubuntu-20.04.2-desktop-amd64.iso
	Preprocessed: ubuntu 20 4 2 desktop amd64 iso
	TF-IDF:
			2: 0.671
			20: 0.493
			desktop: 0.295
			amd64: 0.272
			4: 0.251
			iso: 0.230
			ubuntu: 0.172

	Original:     Ubuntu Facile - Aprile 2015.pdf
	Preprocessed: ubuntu facile aprile 2015 pdf
	TF-IDF:
			aprile: 0.557
			2015: 0.512
			pdf: 0.455
			facile: 0.455
			ubuntu: 0.112

	Original:     ubuntu-16.04.3-server-amd64.iso
	Preprocessed: ubuntu 16 4 3 server amd64 iso
	TF-IDF:
			3: 0.610
			16: 0.505
			server: 0.400
			amd64: 0.267
			4: 0.247
			iso: 0.226
			ubuntu: 0.169

	Original:     ubuntu-20.04-desktop-amd64.iso
	Preprocessed: ubuntu 20 4 desktop amd64 iso
	TF-IDF:
			20: 0.665
			desktop: 0.398
			amd64: 0.367
			4: 0.339
			iso: 0.311
			ubuntu: 0.232

	Original:     ubuntu-16.04.7-server-amd64.iso
	Preprocessed: ubuntu 16 4 7 server amd64 iso
	TF-IDF:
			7: 0.699
			16: 0.456
			server: 0.361
			amd64: 0.241
			4: 0.223
			iso: 0.204
			ubuntu: 0.153

	Original:     ubuntu-16.04.5-desktop-amd64.iso
	Preprocessed: ubuntu 16 4 5 desktop amd64 iso
	TF-IDF:
			5: 0.635
			16: 0.525
			desktop: 0.302
			amd64: 0.278
			4: 0.257
			iso: 0.235
			ubuntu: 0.176

	Original:     Ubuntu
	Preprocessed: ubuntu
	TF-IDF:
			ubuntu: 1.000

	Original:     Ubuntu 20.04.3 (AMD64) (Server)
	Preprocessed: ubuntu 20 4 3
	TF-IDF:
			3: 0.732
			20: 0.580
			4: 0.296
			ubuntu: 0.203

	Original:     ubuntu-20.04.4-desktop-amd64.iso
	Preprocessed: ubuntu 20 4 4 desktop amd64 iso
	TF-IDF:
			4: 0.584
			20: 0.573
			desktop: 0.344
			amd64: 0.316
			iso: 0.268
			ubuntu: 0.200

	Original:     ubuntu-14.04.4-desktop-amd64.iso
	Preprocessed: ubuntu 14 4 4 desktop amd64 iso
	TF-IDF:
			14: 0.607
			4: 0.566
			desktop: 0.333
			amd64: 0.307
			iso: 0.260
			ubuntu: 0.194

	Original:     ubuntu-11.04-desktop-amd64.iso
	Preprocessed: ubuntu 11 4 desktop amd64 iso
	TF-IDF:
			11: 0.771
			desktop: 0.340
			amd64: 0.312
			4: 0.289
			iso: 0.265
			ubuntu: 0.198

	Original:     ubuntu-14.10-desktop-i386.iso
	Preprocessed: ubuntu 14 10 desktop i386 iso
	TF-IDF:
			14: 0.581
			i386: 0.532
			10: 0.427
			desktop: 0.319
			iso: 0.249
			ubuntu: 0.186

	Original:     ubuntu-16.04.7-desktop-amd64.iso
	Preprocessed: ubuntu 16 4 7 desktop amd64 iso
	TF-IDF:
			7: 0.722
			16: 0.471
			desktop: 0.270
			amd64: 0.249
			4: 0.230
			iso: 0.211
			ubuntu: 0.158

	Original:     ubuntu-19.10-live-server-amd64.iso
	Preprocessed: ubuntu 19 10 live server amd64 iso
	TF-IDF:
			19: 0.600
			live: 0.507
			server: 0.364
			10: 0.354
			amd64: 0.243
			iso: 0.206
			ubuntu: 0.154

	Original:     ubuntu-unity-22.10-desktop-amd64.iso
	Preprocessed: ubuntu unity 22 10 desktop amd64 iso
	TF-IDF:
			unity: 0.671
			22: 0.511
			10: 0.336
			desktop: 0.251
			amd64: 0.231
			iso: 0.196
			ubuntu: 0.146

	Original:     ubuntu-mate-19.10-desktop-amd64.iso
	Preprocessed: ubuntu mate 19 10 desktop amd64 iso
	TF-IDF:
			mate: 0.606
			19: 0.579
			10: 0.342
			desktop: 0.255
			amd64: 0.235
			iso: 0.199
			ubuntu: 0.149

	Original:     ubuntu-13.04-desktop-i386.iso
	Preprocessed: ubuntu 13 4 desktop i386 iso
	TF-IDF:
			13: 0.779
			i386: 0.448
			desktop: 0.268
			4: 0.228
			iso: 0.209
			ubuntu: 0.156

	Original:     ubuntu-17.04-server-amd64.iso
	Preprocessed: ubuntu 17 4 server amd64 iso
	TF-IDF:
			17: 0.786
			server: 0.406
			amd64: 0.271
			4: 0.250
			iso: 0.229
			ubuntu: 0.172

	Original:     Ubuntu reducido
	Preprocessed: ubuntu reducido
	TF-IDF:
			reducido: 0.980
			ubuntu: 0.197

	Original:     ubuntu-11.10-desktop-i386.iso
	Preprocessed: ubuntu 11 10 desktop i386 iso
	TF-IDF:
			11: 0.664
			i386: 0.488
			10: 0.392
			desktop: 0.293
			iso: 0.228
			ubuntu: 0.171

	Original:     ubuntu-16.04.6-server-i386.iso
	Preprocessed: ubuntu 16 4 6 server i386 iso
	TF-IDF:
			6: 0.580
			16: 0.463
			i386: 0.444
			server: 0.367
			4: 0.226
			iso: 0.207
			ubuntu: 0.155

	Original:     Ubuntu-Book_RU.djvu
	Preprocessed: ubuntu book ru djvu
	TF-IDF:
			djvu: 0.574
			ru: 0.574
			book: 0.574
			ubuntu: 0.115

	Original:     ubuntu-20.04.4-live-server-amd64.iso
	Preprocessed: ubuntu 20 4 4 live server amd64 iso
	TF-IDF:
			live: 0.531
			4: 0.470
			20: 0.462
			server: 0.382
			amd64: 0.255
			iso: 0.216
			ubuntu: 0.161

	Original:     ubuntu-18.04-live-server-amd64.iso
	Preprocessed: ubuntu 18 4 live server amd64 iso
	TF-IDF:
			18: 0.559
			live: 0.559
			server: 0.402
			amd64: 0.268
			4: 0.247
			iso: 0.227
			ubuntu: 0.170

	Original:     ubuntu-15.04-desktop-i386.iso
	Preprocessed: ubuntu 15 4 desktop i386 iso
	TF-IDF:
			15: 0.731
			i386: 0.487
			desktop: 0.292
			4: 0.248
			iso: 0.228
			ubuntu: 0.170

	Original:     ubuntu-15.04-desktop-amd64.iso
	Preprocessed: ubuntu 15 4 desktop amd64 iso
	TF-IDF:
			15: 0.800
			desktop: 0.320
			amd64: 0.294
			4: 0.272
			iso: 0.249
			ubuntu: 0.186

	Original:     ubuntu-mate-20.04.4-desktop-amd64.iso
	Preprocessed: ubuntu mate 20 4 4 desktop amd64 iso
	TF-IDF:
			mate: 0.632
			4: 0.452
			20: 0.444
			desktop: 0.266
			amd64: 0.245
			iso: 0.208
			ubuntu: 0.155

	Original:     ubuntu-mate-20.04.3-desktop-amd64.iso
	Preprocessed: ubuntu mate 20 4 3 desktop amd64 iso
	TF-IDF:
			mate: 0.587
			3: 0.520
			20: 0.412
			desktop: 0.247
			amd64: 0.227
			4: 0.210
			iso: 0.193
			ubuntu: 0.144

	Original:     ubuntu-14.10-server-amd64.iso
	Preprocessed: ubuntu 14 10 server amd64 iso
	TF-IDF:
			14: 0.614
			server: 0.465
			10: 0.451
			amd64: 0.310
			iso: 0.263
			ubuntu: 0.196

	Original:     ubuntu-21.04-live-server-amd64.iso
	Preprocessed: ubuntu 21 4 live server amd64 iso
	TF-IDF:
			21: 0.623
			live: 0.527
			server: 0.379
			amd64: 0.253
			4: 0.233
			iso: 0.214
			ubuntu: 0.160

	Original:     ubuntu-18.10-server-amd64.iso
	Preprocessed: ubuntu 18 10 server amd64 iso
	TF-IDF:
			18: 0.634
			server: 0.455
			10: 0.442
			amd64: 0.304
			iso: 0.257
			ubuntu: 0.193

	Original:     Ubuntu - 20.04 - X64 - UNTOUCHED - David1893
	Preprocessed: ubuntu 20 4 x64 untouched david1893
	TF-IDF:
			david1893: 0.538
			untouched: 0.538
			x64: 0.538
			20: 0.309
			4: 0.157
			ubuntu: 0.108

	Original:     ubuntu-16.10-desktop-amd64.iso
	Preprocessed: ubuntu 16 10 desktop amd64 iso
	TF-IDF:
			16: 0.631
			10: 0.485
			desktop: 0.362
			amd64: 0.333
			iso: 0.283
			ubuntu: 0.211

	Original:     ubuntu-22.04-desktop-amd64.iso
	Preprocessed: ubuntu 22 4 desktop amd64 iso
	TF-IDF:
			22: 0.735
			desktop: 0.361
			amd64: 0.332
			4: 0.307
			iso: 0.282
			ubuntu: 0.211

	Original:     Ubuntu-18.04
	Preprocessed: ubuntu 18 4
	TF-IDF:
			18: 0.881
			4: 0.390
			ubuntu: 0.268

	Original:     Ubuntu 10.04 Netbook
	Preprocessed: ubuntu 10 4 netbook
	TF-IDF:
			netbook: 0.845
			10: 0.424
			4: 0.269
			ubuntu: 0.185

	Original:     ubuntu-20.04-live-server-amd64.iso
	Preprocessed: ubuntu 20 4 live server amd64 iso
	TF-IDF:
			live: 0.582
			20: 0.506
			server: 0.418
			amd64: 0.279
			4: 0.258
			iso: 0.236
			ubuntu: 0.177

	Original:     ubuntu-21.04-desktop-amd64.iso
	Preprocessed: ubuntu 21 4 desktop amd64 iso
	TF-IDF:
			21: 0.771
			desktop: 0.340
			amd64: 0.312
			4: 0.289
			iso: 0.265
			ubuntu: 0.198

	Original:     Ubuntu Facile Marzo 2015.pdf
	Preprocessed: ubuntu facile marzo 2015 pdf
	TF-IDF:
			marzo: 0.557
			2015: 0.512
			pdf: 0.455
			facile: 0.455
			ubuntu: 0.112

	Original:     ubuntu-20.10-desktop-amd64.iso
	Preprocessed: ubuntu 20 10 desktop amd64 iso
	TF-IDF:
			20: 0.614
			10: 0.493
			desktop: 0.368
			amd64: 0.339
			iso: 0.287
			ubuntu: 0.215

	Original:     ubuntu-14.10-desktop-amd64.iso
	Preprocessed: ubuntu 14 10 desktop amd64 iso
	TF-IDF:
			14: 0.648
			10: 0.476
			desktop: 0.355
			amd64: 0.327
			iso: 0.277
			ubuntu: 0.207

	Original:     Ubuntu 16.10
	Preprocessed: ubuntu 16 10
	TF-IDF:
			16: 0.766
			10: 0.590
			ubuntu: 0.257

	Original:     ubuntu-12.04.5-dvd-i386.iso
	Preprocessed: ubuntu 12 4 5 dvd i386 iso
	TF-IDF:
			dvd: 0.566
			12: 0.475
			5: 0.475
			i386: 0.377
			4: 0.192
			iso: 0.176
			ubuntu: 0.132

	Original:     ubuntu-14.04.1-server-amd64.iso
	Preprocessed: ubuntu 14 4 1 server amd64 iso
	TF-IDF:
			1: 0.579
			14: 0.534
			server: 0.404
			amd64: 0.270
			4: 0.249
			iso: 0.229
			ubuntu: 0.171

	Original:     ubuntu-21.10-beta-pack
	Preprocessed: ubuntu 21 10 beta pack
	TF-IDF:
			beta: 0.587
			pack: 0.551
			21: 0.499
			10: 0.294
			ubuntu: 0.128

	Original:     ubuntu-16.10-server-arm64.iso
	Preprocessed: ubuntu 16 10 server arm64 iso
	TF-IDF:
			arm64: 0.724
			16: 0.434
			server: 0.344
			10: 0.334
			iso: 0.194
			ubuntu: 0.145

	Original:     Ubuntu Satanic Edition 666.4
	Preprocessed: ubuntu satanic edition 666 4
	TF-IDF:
			666: 0.590
			satanic: 0.590
			edition: 0.509
			4: 0.173
			ubuntu: 0.119

	Original:     ubuntu-12.10-desktop-i386.iso
	Preprocessed: ubuntu 12 10 desktop i386 iso
	TF-IDF:
			12: 0.636
			i386: 0.504
			10: 0.405
			desktop: 0.302
			iso: 0.236
			ubuntu: 0.176

	Original:     ubuntu-19.04-desktop-amd64.iso
	Preprocessed: ubuntu 19 4 desktop amd64 iso
	TF-IDF:
			19: 0.771
			desktop: 0.340
			amd64: 0.312
			4: 0.289
			iso: 0.265
			ubuntu: 0.198

	Original:     ubuntu-18.04.5-live-server-amd64.iso
	Preprocessed: ubuntu 18 4 5 live server amd64 iso
	TF-IDF:
			5: 0.522
			18: 0.477
			live: 0.477
			server: 0.343
			amd64: 0.228
			4: 0.211
			iso: 0.194
			ubuntu: 0.145

	Original:     Ubuntu Facile 04 2014.pdf
	Preprocessed: ubuntu facile 4 2014 pdf
	TF-IDF:
			2014: 0.605
			pdf: 0.538
			facile: 0.538
			4: 0.193
			ubuntu: 0.132

Clustering...

Clustering results by cluster, including top features and their weights:

Cluster 0 (Top Features: 10 (0.323), netbook (0.216), 9 (0.195), ubuntu (0.145), remix (0.104)):
	Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.835)
	Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.903)
	Ubuntu 9.10 (Distance to Centroid: 0.771)
	ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.839)
	Ubuntu Netbook Remix (Distance to Centroid: 0.896)
	Ubuntu 10.04 Netbook (Distance to Centroid: 0.757)
	ubuntu-21.10-beta-pack (Distance to Centroid: 0.896)

Cluster 1 (Top Features: 16 (0.529), iso (0.180), ubuntu (0.177), 4 (0.175), i386 (0.144)):
	Ubuntu-16.04.5 (Distance to Centroid: 0.750)
	ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.590)
	ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.630)
	ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.652)
	ubuntu-pack-16.04-unity (Distance to Centroid: 0.914)
	ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.654)
	ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.723)
	ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.724)
	ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.667)
	ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.721)
	ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.658)
	ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.600)
	Ubuntu 16.10 (Distance to Centroid: 0.671)
	ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.820)

Cluster 2 (Top Features: live (0.518), server (0.372), 4 (0.256), amd64 (0.248), iso (0.210)):
	ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.670)
	ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.603)
	ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.553)
	ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.531)
	ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.492)
	ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.458)
	ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.620)
	ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.486)
	ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.605)

Cluster 3 (Top Features: 10 (0.239), desktop (0.233), 12 (0.209), i386 (0.209), iso (0.208)):
	ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.865)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.840)
	ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.931)
	ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.855)
	ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.911)
	ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.793)
	ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.789)
	ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.802)
	ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.769)
	ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.890)
	ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.723)
	ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.802)
	Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.737)
	ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.750)
	ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.873)
	ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.882)
	ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.733)
	ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.861)
	ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.823)
	ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.607)

Cluster 4 (Top Features: pdf (0.278), facile (0.278), 1 (0.237), ultimate (0.167), 2014 (0.167)):
	ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.880)
	Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.895)
	Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.621)
	[Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 0.962)
	Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 0.762)
	Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 0.762)
	Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.707)

Cluster 5 (Top Features: 20 (0.490), 4 (0.264), desktop (0.239), amd64 (0.182), iso (0.173)):
	Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.655)
	ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.676)
	ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.609)
	ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.578)
	Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.879)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.748)
	ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.567)
	ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.441)
	Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.738)
	ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.474)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.640)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.679)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.995)
	ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.653)

Cluster 6 (Top Features: ubuntu (0.228), 4 (0.205), 14 (0.200), amd64 (0.176), desktop (0.173)):
	Ubuntu Linux ebook pack (Distance to Centroid: 1.061)
	ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.810)
	ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.782)
	Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.063)
	Ubuntu Linux основы администрирования (Distance to Centroid: 1.062)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.904)
	ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.793)
	ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.716)
	ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.732)
	ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.743)
	ubuntu (Distance to Centroid: 0.900)
	ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.850)
	ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.790)
	ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.798)
	ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.630)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.812)
	ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.799)
	ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.873)
	Ubuntu (Distance to Centroid: 0.900)
	ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.627)
	ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.839)
	Ubuntu reducido (Distance to Centroid: 1.055)
	Ubuntu-Book_RU.djvu (Distance to Centroid: 1.072)
	ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.855)
	ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.803)
	ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.773)
	Ubuntu-18.04 (Distance to Centroid: 0.890)
	ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.839)
	ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.744)
	ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.763)
	Ubuntu Satanic Edition 666.4 (Distance to Centroid: 1.030)

Cluster 7 (Top Features: 19 (0.379), server (0.268), amd64 (0.249), iso (0.211), 10 (0.174)):
	ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.515)
	ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.851)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.017)
	ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.602)
	ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.595)
	ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.708)
	ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.860)
	ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.795)
	ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.622)

The script:

from collections import defaultdict
from pathlib import Path

import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
)

first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]

# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\w+\b')
X = vectorizer.fit_transform(preprocessed_results)

# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

print(f'Features: \n{feature_names}')

# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
    # Accessing the i-th TF-IDF vector in sparse format directly
    tfidf_vector = X[i]
    # Extracting indices of non-zero elements (words that are actually present in the document)
    non_zero_indices = tfidf_vector.nonzero()[1]
    # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
    tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
    # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
    sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
    # Formatting the sorted TF-IDF values into a string for easy display
    sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
    # Print sorted TF-IDF values
    print(f'\tOriginal:     {original}')
    print(f'\tPreprocessed: {preprocessed}')
    print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')

print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)

# Getting cluster centroids
centroids = kmeans.cluster_centers_

# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
    sorted_feature_indices = centroid.argsort()[::-1]
    top_n = 5  # Number of key words
    top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
    cluster_top_features_with_weights[i] = top_features_with_weights

# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)

# Grouping titles by their clusters
for i, label in enumerate(labels):
    clusters[label].append(results[i])

# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)

# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
    top_features_str = ', '.join(
        [f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]])
    print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
    for title in clusters[cluster]:
        # Find the index of the current title
        title_index = results.index(title)
        # Calculate "fit" metric as the distance to the centroid of its cluster
        # The distance itself is used as a metric of fit
        fit_metric = distances_to_centroids[title_index, cluster]
        print(f"\t{title} (Distance to Centroid: {fit_metric:.3f})")

@drew2a
Copy link
Contributor Author

drew2a commented Feb 15, 2024

Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away?

I've slightly modified the pattern for the TfidfVectorizer from (?u)\b\w+\b to (?u)\b\d+\b (which means "use only digits"), and obtained results quite close to what was requested:

Cluster 0 (Top Features: 17 (0.923), 10 (0.224), 4 (0.152), 9 (0.000), 2 (0.000)):
	ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.272)
	ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.272)

Cluster 1 (Top Features: 14 (0.773), 4 (0.295), 10 (0.148), 6 (0.123), 5 (0.060)):
	ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.279)
	ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.279)
	ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.279)
	ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.279)
	ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.442)
	ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.554)
	ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.554)
	ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.554)
	ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.655)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.655)
	ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.685)
	ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.708)


Cluster 2 (Top Features: 20 (0.660), 4 (0.378), 2 (0.170), 3 (0.140), 1 (0.091)):
	ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.352)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.352)
	ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.352)
	ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.423)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.423)
	ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.423)
	Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.644)
	ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.644)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.651)
	ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.651)
	Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.651)
	ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.683)
	Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.683)
	ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.752)
	ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.776)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.776)

Cluster 3 (Top Features: 21 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)):
	ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.249)
	ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.249)
	ubuntu-21.10-beta-pack (Distance to Centroid: 0.249)
	ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.373)
	ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.373)

Cluster 4 (Top Features: 10 (0.167), 4 (0.143), 12 (0.123), 19 (0.101), 11 (0.101)):
	Ubuntu Linux ebook pack (Distance to Centroid: 0.325)
	Ubuntu (Distance to Centroid: 0.325)
	ubuntu (Distance to Centroid: 0.325)
	Ubuntu-Book_RU.djvu (Distance to Centroid: 0.325)
	Ubuntu reducido (Distance to Centroid: 0.325)
	Ubuntu Linux основы администрирования (Distance to Centroid: 0.325)
	Ubuntu Netbook Remix (Distance to Centroid: 0.325)
	Ubuntu 10.04 Netbook (Distance to Centroid: 0.818)
	ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.847)
	Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.847)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.857)
	ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.872)
	ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.872)
	Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.872)
	ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.872)
	ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.872)
	ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.872)
	ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.877)
	ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.878)
	ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.892)
	ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.892)
	ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.892)
	ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.903)
	ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.903)
	ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.903)
	ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.903)
	Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.920)
	Ubuntu 9.10 (Distance to Centroid: 0.920)
	ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.937)
	ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.938)
	ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.944)
	ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.944)
	ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.944)
	Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.968)
	ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.969)
	Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.971)
	Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.983)
	ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.992)
	Ubuntu Satanic Edition 666.4 (Distance to Centroid: 0.992)
	[Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 1.000)
	Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 1.007)
	Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 1.007)
	Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.030)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.030)

Cluster 5 (Top Features: 16 (0.688), 4 (0.226), 10 (0.174), 6 (0.160), 7 (0.116)):
	ubuntu-pack-16.04-unity (Distance to Centroid: 0.416)
	ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.416)
	ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.552)
	ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.552)
	Ubuntu 16.10 (Distance to Centroid: 0.552)
	ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.552)
	ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.645)
	ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.645)
	ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.645)
	ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.694)
	Ubuntu-16.04.5 (Distance to Centroid: 0.694)
	ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.747)
	ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.760)
	ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.760)

Cluster 6 (Top Features: 18 (0.772), 4 (0.302), 10 (0.114), 6 (0.072), 5 (0.071)):
	Ubuntu-18.04 (Distance to Centroid: 0.252)
	ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.252)
	ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.252)
	ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.404)
	ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.569)
	ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.569)
	ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.648)
	ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.671)
	ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.671)
	ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.684)

Cluster 7 (Top Features: 22 (0.773), 4 (0.235), 3 (0.173), 10 (0.137), 2 (0.090)):
	ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.330)
	ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.330)
	ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.524)
	ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.524)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.561)
	ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.561)
	ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.638)
	ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.684)

The script:

from collections import defaultdict
from pathlib import Path

import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(sorted(
    r for r in set(Path('/ubuntu.txt').read_text().split('\n')) if r
))

first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]

# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)

# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

print(f'Features: \n{feature_names}')

# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
    # Accessing the i-th TF-IDF vector in sparse format directly
    tfidf_vector = X[i]
    # Extracting indices of non-zero elements (words that are actually present in the document)
    non_zero_indices = tfidf_vector.nonzero()[1]
    # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
    tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
    # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
    sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
    # Formatting the sorted TF-IDF values into a string for easy display
    sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
    # Print sorted TF-IDF values
    print(f'\tOriginal:     {original}')
    print(f'\tPreprocessed: {preprocessed}')
    print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')

print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)

# Getting cluster centroids
centroids = kmeans.cluster_centers_

# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
    sorted_feature_indices = centroid.argsort()[::-1]
    top_n = 5  # Number of key words
    top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
    cluster_top_features_with_weights[i] = top_features_with_weights

# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)

# Grouping titles by their clusters
for i, label in enumerate(labels):
    clusters[label].append(results[i])

# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)

# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
    top_features_str = ', '.join(
        f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
    )
    print(f"\nCluster {cluster} (Top Features: {top_features_str}):")

    # Prepare a list to hold titles and their distances
    titles_and_distances = []
    for title in clusters[cluster]:
        # Find the index of the current title
        title_index = results.index(title)
        # Calculate "fit" metric as the distance to the centroid of its cluster
        fit_metric = distances_to_centroids[title_index, cluster]
        # Add title and its distance to the list
        titles_and_distances.append((title, fit_metric))

    # Sort titles within the cluster by their distance to the centroid (ascending order)
    sorted_titles_and_distances = sorted(titles_and_distances, key=lambda x: x[1])

    # Print sorted titles by their distance to centroid
    for title, distance in sorted_titles_and_distances:
        print(f"\t{title} (Distance to Centroid: {distance:.3f})")

@drew2a
Copy link
Contributor Author

drew2a commented Feb 20, 2024

In the previous example, elements were grouped fairly well, but there was a cluster containing elements close to noise (Cluster 4). To identify such a cluster (and filter it out in the future), we attempted to calculate the intra-cluster dispersion. This approach aimed to quantify the cohesion within each cluster by measuring the average distance of points from their cluster centroid. The rationale behind this method is that a cluster with a higher average distance among its points might be less cohesive and potentially contain more noise, making it a candidate for exclusion from further analysis.

To implement this, we first grouped the indices of the elements belonging to each cluster. Then, for each cluster, we constructed a matrix of its points by vertically stacking the corresponding rows from the TF-IDF matrix X using the indices we had collected. This allowed us to calculate pairwise distances between each point in a cluster and its centroid, using the pairwise_distances function. By computing the mean of these distances, we obtained a measure of intra-cluster dispersion.

The calculated average distances provided a clear metric to assess the tightness of each cluster. Clusters with lower average distances were deemed more cohesive, indicating that their elements were closely related to each other and to the cluster's overall theme. Conversely, clusters with higher average distances were scrutinized for potential exclusion, as their wide dispersion suggested a lack of a unifying theme or the presence of outlier elements. This methodological adjustment offered a systematic way to identify and potentially remove clusters that detract from the clarity and relevance of the clustering outcome, thereby refining the analysis.

Cluster 0 (Top Features: 14 (0.773), 4 (0.295), 10 (0.148), 6 (0.123), 5 (0.060)):
Intra-cluster distance: 0.494
	ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.279)
	ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.279)
	ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.279)
	ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.279)
	ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.442)
	ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.554)
	ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.554)
	ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.554)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.655)
	ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.655)
	ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.685)
	ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.708)

Cluster 1 (Top Features: 4 (0.174), 10 (0.162), 18 (0.153), 22 (0.119), 19 (0.097)):
Intra-cluster distance: 0.827
	Ubuntu (Distance to Centroid: 0.345)
	Ubuntu Linux ebook pack (Distance to Centroid: 0.345)
	Ubuntu Linux основы администрирования (Distance to Centroid: 0.345)
	Ubuntu Netbook Remix (Distance to Centroid: 0.345)
	Ubuntu reducido (Distance to Centroid: 0.345)
	Ubuntu-Book_RU.djvu (Distance to Centroid: 0.345)
	ubuntu (Distance to Centroid: 0.345)
	ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.812)
	Ubuntu 10.04 Netbook (Distance to Centroid: 0.812)
	ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.826)
	ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.826)
	Ubuntu-18.04 (Distance to Centroid: 0.835)
	ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.835)
	ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.835)
	ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.861)
	ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.861)
	ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.870)
	ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.874)
	ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.874)
	ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.887)
	ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.887)
	ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.887)
	ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.892)
	ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.894)
	ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.894)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.894)
	ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.897)
	ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.903)
	ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.903)
	ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.922)
	ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.944)
	ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.944)
	ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.944)
	Ubuntu 9.10 (Distance to Centroid: 0.948)
	Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.948)
	ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.950)
	ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.950)
	ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.968)
	ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.968)
	Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.987)
	Ubuntu Satanic Edition 666.4 (Distance to Centroid: 0.991)
	ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.991)
	Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 1.016)
	Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 1.016)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.037)
	Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.037)

Cluster 2 (Top Features: 1 (0.694), 4 (0.200), 20 (0.153), 2014 (0.101), 9 (0.098)):
Intra-cluster distance: 0.627
	ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.394)
	Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.518)
	ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.518)
	ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.639)
	ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.656)
	Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.759)
	[Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 0.759)
	Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.776)

Cluster 3 (Top Features: 21 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)):
Intra-cluster distance: 0.299
	ubuntu-21.10-beta-pack (Distance to Centroid: 0.249)
	ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.249)
	ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.249)
	ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.373)
	ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.373)

Cluster 4 (Top Features: 12 (0.776), 5 (0.291), 4 (0.261), 10 (0.153), 9 (0.000)):
Intra-cluster distance: 0.466
	ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.380)
	ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.429)
	ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.429)
	ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.429)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.493)
	Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.552)
	ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.552)

Cluster 5 (Top Features: 16 (0.688), 4 (0.226), 10 (0.174), 6 (0.160), 7 (0.116)):
Intra-cluster distance: 0.616
	ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.416)
	ubuntu-pack-16.04-unity (Distance to Centroid: 0.416)
	Ubuntu 16.10 (Distance to Centroid: 0.552)
	ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.552)
	ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.552)
	ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.552)
	ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.645)
	ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.645)
	ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.645)
	Ubuntu-16.04.5 (Distance to Centroid: 0.694)
	ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.694)
	ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.747)
	ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.760)
	ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.760)

Cluster 6 (Top Features: 11 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)):
Intra-cluster distance: 0.299
	Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.249)
	ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.249)
	ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.249)
	ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.373)
	ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.373)

Cluster 7 (Top Features: 20 (0.666), 4 (0.388), 2 (0.194), 3 (0.160), 0 (0.093)):
Intra-cluster distance: 0.556
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.359)
	ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.359)
	ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.359)
	ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426)
	ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.426)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426)
	Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.624)
	ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.624)
	Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.637)
	ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637)
	ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.757)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.758)
	ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.758)

The script:

import re
import string
from collections import defaultdict
from pathlib import Path

import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.sparse import vstack

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(sorted(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))

first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]

# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)

# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

print(f'Features: \n{feature_names}')

# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
    # Accessing the i-th TF-IDF vector in sparse format directly
    tfidf_vector = X[i]
    # Extracting indices of non-zero elements (words that are actually present in the document)
    non_zero_indices = tfidf_vector.nonzero()[1]
    # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
    tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
    # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
    sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
    # Formatting the sorted TF-IDF values into a string for easy display
    sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
    # Print sorted TF-IDF values
    print(f'\tOriginal:     {original}')
    print(f'\tPreprocessed: {preprocessed}')
    print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')

print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)

# Getting cluster centroids
centroids = kmeans.cluster_centers_
# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
intra_cluster_distances = defaultdict(list)

# Grouping titles by their clusters
for i, label in enumerate(labels):
    clusters[label].append(results[i])
    clusters_indices[label].append(i)

for cluster, indices in clusters_indices.items():
    if indices:
        points_matrix = vstack([X.getrow(i) for i in indices])
        distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean')
        intra_cluster_distance = np.mean(distances)
        intra_cluster_distances[cluster] = intra_cluster_distance

# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
    sorted_feature_indices = centroid.argsort()[::-1]
    top_n = 5  # Number of key words
    top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
    cluster_top_features_with_weights[i] = top_features_with_weights

# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)

# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
    top_features_str = ', '.join(
        f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
    )
    intra_cluster_distance = intra_cluster_distances[cluster]
    print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
    print(f'Intra-cluster distance: {intra_cluster_distance:.3f}')
    # Prepare a list to hold titles and their distances
    titles_and_distances = []
    for title in clusters[cluster]:
        # Find the index of the current title
        title_index = results.index(title)
        # Calculate "fit" metric as the distance to the centroid of its cluster
        fit_metric = distances_to_centroids[title_index, cluster]
        # Add title and its distance to the list
        titles_and_distances.append((title, fit_metric))

    # Sort titles within the cluster by their distance to the centroid (ascending order)
    sorted_titles_and_distances = sorted(titles_and_distances, key=lambda x: x[1])

    # Print sorted titles by their distance to centroid
    for title, distance in sorted_titles_and_distances:
        print(f"\t{title} (Distance to Centroid: {distance:.3f})")

@drew2a
Copy link
Contributor Author

drew2a commented Feb 20, 2024

Another metric that can be utilized for filtering clusters is the silhouette coefficient(the coefficient values range from -1 to 1). This metric provides insight into the distance between clusters and the cohesion within them. By calculating the silhouette coefficient for each sample within the dataset, we gain the ability to evaluate not just the overall clustering performance but also the individual performance of each cluster. This granular analysis is crucial for identifying clusters that may not be well-defined or might contain elements that are essentially outliers, potentially skewing the overall analysis.

To implement this, we first used the silhouette_samples function from scikit-learn, which computes the silhouette coefficient for each sample, giving us a detailed breakdown of how well each sample fits within its assigned cluster compared to neighboring clusters. By aggregating these scores on a per-cluster basis, we were able to compute an average silhouette score for each cluster. This average score serves as a proxy for the cluster's quality, with higher scores indicating tighter and more distinct clusters, and lower scores suggesting clusters with overlapping or diffuse boundaries.

This approach allowed us to systematically evaluate each cluster's integrity. Clusters with low average silhouette scores were flagged for further exclusion.

Cluster 0 (Top Features: 14 (0.773), 4 (0.295), 10 (0.148), 6 (0.123), 5 (0.060)):
Intra-cluster distance: 0.494 (the less the better)
Average Silhouette Score = 0.429 (the higher the better)
	ubuntu-14.04-desktop-amd64.iso (Distance to Centroid: 0.279, Silhouette Score: 0.601)
	ubuntu-14.04-desktop-i386.iso (Distance to Centroid: 0.279, Silhouette Score: 0.601)
	ubuntu-14.04-server-amd64.ova (Distance to Centroid: 0.279, Silhouette Score: 0.601)
	ubuntu-14.04-server-i386.iso (Distance to Centroid: 0.279, Silhouette Score: 0.601)
	ubuntu-14.04.4-desktop-amd64.iso (Distance to Centroid: 0.442, Silhouette Score: 0.451)
	ubuntu-14.10-desktop-amd64.iso (Distance to Centroid: 0.554, Silhouette Score: 0.442)
	ubuntu-14.10-desktop-i386.iso (Distance to Centroid: 0.554, Silhouette Score: 0.442)
	ubuntu-14.10-server-amd64.iso (Distance to Centroid: 0.554, Silhouette Score: 0.442)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Distance to Centroid: 0.655, Silhouette Score: 0.343)
	ubuntu-14.04.6-desktop-i386.iso (Distance to Centroid: 0.655, Silhouette Score: 0.343)
	ubuntu-14.04.1-server-amd64.iso (Distance to Centroid: 0.685, Silhouette Score: 0.056)
	ubuntu-14.04.5-server-amd64.iso (Distance to Centroid: 0.708, Silhouette Score: 0.224)

Cluster 1 (Top Features: 4 (0.174), 10 (0.162), 18 (0.153), 22 (0.119), 19 (0.097)):
Intra-cluster distance: 0.827 (the less the better)
Average Silhouette Score = 0.056 (the higher the better)
	Ubuntu (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	Ubuntu Linux ebook pack (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	Ubuntu Linux основы администрирования (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	Ubuntu Netbook Remix (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	Ubuntu reducido (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	Ubuntu-Book_RU.djvu (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	ubuntu (Distance to Centroid: 0.345, Silhouette Score: 0.133)
	ubuntu-18.04.4-desktop-amd64.iso (Distance to Centroid: 0.812, Silhouette Score: 0.050)
	Ubuntu 10.04 Netbook (Distance to Centroid: 0.812, Silhouette Score: -0.024)
	ubuntu-18.10-desktop-amd64.iso (Distance to Centroid: 0.826, Silhouette Score: 0.089)
	ubuntu-18.10-server-amd64.iso (Distance to Centroid: 0.826, Silhouette Score: 0.089)
	Ubuntu-18.04 (Distance to Centroid: 0.835, Silhouette Score: 0.100)
	ubuntu-18.04-desktop-amd64.iso (Distance to Centroid: 0.835, Silhouette Score: 0.100)
	ubuntu-18.04-live-server-amd64.iso (Distance to Centroid: 0.835, Silhouette Score: 0.100)
	ubuntu-22.10-desktop-amd64.iso (Distance to Centroid: 0.861, Silhouette Score: 0.072)
	ubuntu-unity-22.10-desktop-amd64.iso (Distance to Centroid: 0.861, Silhouette Score: 0.072)
	ubuntu-18.04.3-live-server-amd64.iso (Distance to Centroid: 0.870, Silhouette Score: 0.016)
	ubuntu-22.04-desktop-amd64.iso (Distance to Centroid: 0.874, Silhouette Score: 0.062)
	ubuntu-22.04-live-server-amd64.iso (Distance to Centroid: 0.874, Silhouette Score: 0.062)
	ubuntu-19.10-desktop-amd64.iso (Distance to Centroid: 0.887, Silhouette Score: 0.075)
	ubuntu-19.10-live-server-amd64.iso (Distance to Centroid: 0.887, Silhouette Score: 0.075)
	ubuntu-mate-19.10-desktop-amd64.iso (Distance to Centroid: 0.887, Silhouette Score: 0.075)
	ubuntu-10.10-xenon-beta5 (Distance to Centroid: 0.892, Silhouette Score: -0.054)
	ubuntu-18.04.5-live-server-amd64.iso (Distance to Centroid: 0.894, Silhouette Score: -0.041)
	ubuntu-22.04.3-live-server-amd64.iso (Distance to Centroid: 0.894, Silhouette Score: 0.014)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Distance to Centroid: 0.894, Silhouette Score: 0.014)
	ubuntu-18.04.6-desktop-amd64.iso (Distance to Centroid: 0.897, Silhouette Score: 0.025)
	ubuntu-19.04-desktop-amd64.iso (Distance to Centroid: 0.903, Silhouette Score: 0.068)
	ubuntu-19.04-server-amd64.iso (Distance to Centroid: 0.903, Silhouette Score: 0.068)
	ubuntu-22.04.2-desktop-amd64.iso (Distance to Centroid: 0.922, Silhouette Score: -0.031)
	ubuntu-15.04-desktop-amd64.iso (Distance to Centroid: 0.944, Silhouette Score: 0.054)
	ubuntu-15.04-desktop-i386.iso (Distance to Centroid: 0.944, Silhouette Score: 0.054)
	ubuntu-15.04-server-amd64.iso (Distance to Centroid: 0.944, Silhouette Score: 0.054)
	Ubuntu 9.10 (Distance to Centroid: 0.948, Silhouette Score: 0.031)
	Ubuntu 9.10 Пользовательская сборка (Distance to Centroid: 0.948, Silhouette Score: 0.031)
	ubuntu-17.10-desktop-amd64.iso (Distance to Centroid: 0.950, Silhouette Score: 0.026)
	ubuntu-23.10-beta-desktop-amd64.iso (Distance to Centroid: 0.950, Silhouette Score: 0.026)
	ubuntu-17.04-server-amd64.iso (Distance to Centroid: 0.968, Silhouette Score: 0.026)
	ubuntu-23.04-live-server-amd64.iso (Distance to Centroid: 0.968, Silhouette Score: 0.026)
	Ubuntu Facile 04 2014.pdf (Distance to Centroid: 0.987, Silhouette Score: -0.025)
	Ubuntu Satanic Edition 666.4 (Distance to Centroid: 0.991, Silhouette Score: 0.015)
	ubuntu-13.04-desktop-i386.iso (Distance to Centroid: 0.991, Silhouette Score: 0.015)
	Ubuntu Facile - Aprile 2015.pdf (Distance to Centroid: 1.016, Silhouette Score: 0.068)
	Ubuntu Facile Marzo 2015.pdf (Distance to Centroid: 1.016, Silhouette Score: 0.068)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Distance to Centroid: 1.037, Silhouette Score: 0.046)
	Ubuntu Unleashed 2019 Edition (Distance to Centroid: 1.037, Silhouette Score: 0.046)

Cluster 2 (Top Features: 1 (0.694), 4 (0.200), 20 (0.153), 2014 (0.101), 9 (0.098)):
Intra-cluster distance: 0.627 (the less the better)
Average Silhouette Score = 0.217 (the higher the better)
	ubuntu-ultimate-1.4-dvd (Distance to Centroid: 0.394, Silhouette Score: 0.377)
	Ubuntu 20.04.1 Desktop.iso (Distance to Centroid: 0.518, Silhouette Score: 0.161)
	ubuntu-20.04.1-desktop-amd64.iso (Distance to Centroid: 0.518, Silhouette Score: 0.161)
	ubuntu-18.04.1-desktop-amd64.iso (Distance to Centroid: 0.639, Silhouette Score: 0.214)
	ubuntu-22.04.1-desktop-amd64.iso (Distance to Centroid: 0.656, Silhouette Score: 0.216)
	Ubuntu Ultimate Edition 1.9 (Distance to Centroid: 0.759, Silhouette Score: 0.199)
	[Ubuntu] Anonymous OS 0.1 (Distance to Centroid: 0.759, Silhouette Score: 0.216)
	Ubuntu Facile 01 2014.pdf (Distance to Centroid: 0.776, Silhouette Score: 0.195)

Cluster 3 (Top Features: 21 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)):
Intra-cluster distance: 0.299 (the less the better)
Average Silhouette Score = 0.712 (the higher the better)
	ubuntu-21.10-beta-pack (Distance to Centroid: 0.249, Silhouette Score: 0.758)
	ubuntu-21.10-desktop-amd64.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758)
	ubuntu-mate-21.10-desktop-amd64.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758)
	ubuntu-21.04-desktop-amd64.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643)
	ubuntu-21.04-live-server-amd64.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643)

Cluster 4 (Top Features: 12 (0.776), 5 (0.291), 4 (0.261), 10 (0.153), 9 (0.000)):
Intra-cluster distance: 0.466 (the less the better)
Average Silhouette Score = 0.506 (the higher the better)
	ubuntu-12.04-server-i386.iso (Distance to Centroid: 0.380, Silhouette Score: 0.513)
	ubuntu-12.04.5-desktop-amd64.iso (Distance to Centroid: 0.429, Silhouette Score: 0.573)
	ubuntu-12.04.5-desktop-i386.iso (Distance to Centroid: 0.429, Silhouette Score: 0.573)
	ubuntu-12.04.5-dvd-i386.iso (Distance to Centroid: 0.429, Silhouette Score: 0.573)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Distance to Centroid: 0.493, Silhouette Score: 0.419)
	Ubuntu 12.10 Desktop (i386) (Distance to Centroid: 0.552, Silhouette Score: 0.447)
	ubuntu-12.10-desktop-i386.iso (Distance to Centroid: 0.552, Silhouette Score: 0.447)

Cluster 5 (Top Features: 16 (0.688), 4 (0.226), 10 (0.174), 6 (0.160), 7 (0.116)):
Intra-cluster distance: 0.616 (the less the better)
Average Silhouette Score = 0.325 (the higher the better)
	ubuntu-16.04-desktop-i386.iso (Distance to Centroid: 0.416, Silhouette Score: 0.419)
	ubuntu-pack-16.04-unity (Distance to Centroid: 0.416, Silhouette Score: 0.419)
	Ubuntu 16.10 (Distance to Centroid: 0.552, Silhouette Score: 0.406)
	ubuntu-16.10-desktop-amd64.iso (Distance to Centroid: 0.552, Silhouette Score: 0.406)
	ubuntu-16.10-desktop-i386.iso (Distance to Centroid: 0.552, Silhouette Score: 0.406)
	ubuntu-16.10-server-arm64.iso (Distance to Centroid: 0.552, Silhouette Score: 0.406)
	ubuntu-16.04.6-desktop-i386.iso (Distance to Centroid: 0.645, Silhouette Score: 0.327)
	ubuntu-16.04.6-server-amd64.iso (Distance to Centroid: 0.645, Silhouette Score: 0.327)
	ubuntu-16.04.6-server-i386.iso (Distance to Centroid: 0.645, Silhouette Score: 0.327)
	Ubuntu-16.04.5 (Distance to Centroid: 0.694, Silhouette Score: 0.206)
	ubuntu-16.04.5-desktop-amd64.iso (Distance to Centroid: 0.694, Silhouette Score: 0.206)
	ubuntu-16.04.3-server-amd64.iso (Distance to Centroid: 0.747, Silhouette Score: 0.176)
	ubuntu-16.04.7-desktop-amd64.iso (Distance to Centroid: 0.760, Silhouette Score: 0.257)
	ubuntu-16.04.7-server-amd64.iso (Distance to Centroid: 0.760, Silhouette Score: 0.257)

Cluster 6 (Top Features: 11 (0.891), 10 (0.305), 4 (0.140), 9 (0.000), 2 (0.000)):
Intra-cluster distance: 0.299 (the less the better)
Average Silhouette Score = 0.712 (the higher the better)
	Ubuntu 11.10 Oneiric Ocelot (Distance to Centroid: 0.249, Silhouette Score: 0.758)
	ubuntu-11.10-desktop-i386.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758)
	ubuntu-11.10-dvd-amd64.iso (Distance to Centroid: 0.249, Silhouette Score: 0.758)
	ubuntu-11.04-alternate-i386.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643)
	ubuntu-11.04-desktop-amd64.iso (Distance to Centroid: 0.373, Silhouette Score: 0.643)

Cluster 7 (Top Features: 20 (0.666), 4 (0.388), 2 (0.194), 3 (0.160), 0 (0.093)):
Intra-cluster distance: 0.556 (the less the better)
Average Silhouette Score = 0.390 (the higher the better)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Distance to Centroid: 0.359, Silhouette Score: 0.498)
	ubuntu-20.04-desktop-amd64.iso (Distance to Centroid: 0.359, Silhouette Score: 0.498)
	ubuntu-20.04-live-server-amd64.iso (Distance to Centroid: 0.359, Silhouette Score: 0.498)
	ubuntu-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426, Silhouette Score: 0.466)
	ubuntu-20.04.4-live-server-amd64.iso (Distance to Centroid: 0.426, Silhouette Score: 0.466)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Distance to Centroid: 0.426, Silhouette Score: 0.466)
	Ubuntu Server 20.04.2 LTS (Distance to Centroid: 0.624, Silhouette Score: 0.356)
	ubuntu-20.04.2-desktop-amd64.iso (Distance to Centroid: 0.624, Silhouette Score: 0.356)
	Ubuntu 20.04.3 (AMD64) (Server) (Distance to Centroid: 0.637, Silhouette Score: 0.367)
	ubuntu-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637, Silhouette Score: 0.367)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Distance to Centroid: 0.637, Silhouette Score: 0.367)
	ubuntu-20.10-desktop-amd64.iso (Distance to Centroid: 0.757, Silhouette Score: 0.227)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Distance to Centroid: 0.758, Silhouette Score: 0.266)
	ubuntu-20.04.2.0-desktop-amd64.iso (Distance to Centroid: 0.758, Silhouette Score: 0.266)

The script:

import re
import string
from collections import defaultdict
from pathlib import Path

import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from scipy.sparse import vstack
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances, silhouette_samples

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(sorted(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))

first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]

# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)

# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

print(f'Features: \n{feature_names}')

# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
    # Accessing the i-th TF-IDF vector in sparse format directly
    tfidf_vector = X[i]
    # Extracting indices of non-zero elements (words that are actually present in the document)
    non_zero_indices = tfidf_vector.nonzero()[1]
    # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
    tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
    # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
    sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
    # Formatting the sorted TF-IDF values into a string for easy display
    sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
    # Print sorted TF-IDF values
    print(f'\tOriginal:     {original}')
    print(f'\tPreprocessed: {preprocessed}')
    print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')

print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)

# Getting cluster centroids
centroids = kmeans.cluster_centers_
# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
intra_cluster_distances = defaultdict(list)

silhouette_vals = silhouette_samples(X, labels, metric='euclidean')
cluster_silhouette_scores = defaultdict(list)

# Grouping titles by their clusters
for i, label in enumerate(labels):
    clusters[label].append(results[i])
    clusters_indices[label].append(i)
    cluster_silhouette_scores[label].append(silhouette_vals[i])

for cluster, indices in clusters_indices.items():
    if indices:
        points_matrix = vstack([X.getrow(i) for i in indices])
        distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean')
        intra_cluster_distance = np.mean(distances)
        intra_cluster_distances[cluster] = intra_cluster_distance

# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
    sorted_feature_indices = centroid.argsort()[::-1]
    top_n = 5  # Number of key words
    top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
    cluster_top_features_with_weights[i] = top_features_with_weights

# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)

# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
    top_features_str = ', '.join(
        f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
    )
    intra_cluster_distance = intra_cluster_distances[cluster]
    print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
    average_score = np.mean(cluster_silhouette_scores[cluster])
    print(f'Intra-cluster distance: {intra_cluster_distance:.3f} (the less the better)')
    print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")

    # Prepare a list to hold titles, their distances, and silhouette scores
    titles_distances_scores = []
    for i, title_index in enumerate(clusters_indices[cluster]):
        title = results[title_index]
        fit_metric = distances_to_centroids[title_index, cluster]
        silhouette_score = silhouette_vals[title_index]
        titles_distances_scores.append((title, fit_metric, silhouette_score))


    # Sort titles within the cluster by their distance to the centroid (ascending order)
    sorted_titles_distances_scores = sorted(titles_distances_scores, key=lambda x: x[1])

    # Print sorted titles by their distance to centroid and include silhouette score
    for title, distance, silhouette_score in sorted_titles_distances_scores:
        print(f"\t{title} (Distance to Centroid: {distance:.3f}, Silhouette Score: {silhouette_score:.3f})")

@drew2a
Copy link
Contributor Author

drew2a commented Feb 21, 2024

The best part of any job is the visualization:
3d_all_clusters

3d_each_cluster

2d_each_cluster

The script:

# This script performs cluster analysis using the K-Means algorithm, applied to a multi-dimensional dataset.
# It includes steps for fitting the K-Means model, calculating and interpreting key metrics such as intra-cluster
# distances and silhouette scores, and estimating the dimensional characteristics of each cluster. The aim is to
# evaluate the cohesion and separation of clusters, identify the top features defining each cluster, and approximate
# the "size" or spread of clusters through a novel approach based on calculating the volume of an orthogonal figure
# formed by the furthest points in each cluster. The script provides a comprehensive overview of the clustering results,
# offering insights into the data structure and the effectiveness of the clustering.
import math
import re
import string
from collections import defaultdict
from pathlib import Path

import matplotlib.pyplot as plt
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import euclidean_distances, pairwise_distances, silhouette_samples

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


# Load titles from a text file
results = list(sorted(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))

first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]

# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)

# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

print(f'Features: \n{feature_names}')

# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
    # Accessing the i-th TF-IDF vector in sparse format directly
    tfidf_vector = X[i]
    # Extracting indices of non-zero elements (words that are actually present in the document)
    non_zero_indices = tfidf_vector.nonzero()[1]
    # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
    tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
    # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
    sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
    # Formatting the sorted TF-IDF values into a string for easy display
    sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
    # Print sorted TF-IDF values
    print(f'\tOriginal:     {original}')
    print(f'\tPreprocessed: {preprocessed}')
    print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')

print("Clustering...")
# Cluster using K-means
# Initialize KMeans clustering with a fixed random state to ensure reproducibility
kmeans = KMeans(random_state=42)
# Fit the model to the data X to perform the clustering
kmeans.fit(X)

# Retrieve the centroids of the clusters formed by KMeans
centroids = kmeans.cluster_centers_
# Retrieve the labels assigned to each data point in X, indicating their cluster membership
labels = kmeans.labels_
# Initialize dictionaries to store clustering results and distances
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
intra_cluster_distances = defaultdict(list)

# Calculate the silhouette scores for each data point in X based on their cluster assignment
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')
# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)

# Loop through each data point and its cluster label
for i, label in enumerate(labels):
    # Group data points by their cluster label, storing both the original titles and their indices
    clusters[label].append(results[i])
    clusters_indices[label].append(i)
    # Accumulate silhouette scores by cluster
    cluster_silhouette_scores[label].append(silhouette_vals[i])

# For each cluster, calculate the average distance of points to the cluster's centroid
for cluster, indices in clusters_indices.items():
    if indices:
        # Convert the subset of X corresponding to the current cluster to a dense format
        points_matrix = X[indices, :].toarray()
        # Calculate pairwise Euclidean distances between points in the cluster and the cluster's centroid
        distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean')
        # Calculate and store the average intra-cluster distance as a measure of cluster cohesion
        intra_cluster_distance = np.mean(distances)
        intra_cluster_distances[cluster] = intra_cluster_distance

# Identify the top features (words) for each cluster based on the centroids' coordinates
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
    # Sort features in descending order of importance for the cluster
    sorted_feature_indices = centroid.argsort()[::-1]
    # Select the top N features for the cluster
    top_n = 5
    top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
    # Store the top features and their weights for each cluster
    cluster_top_features_with_weights[i] = top_features_with_weights

# Calculate distances from each data point to its cluster centroid
distances_to_centroids = kmeans.transform(X)


# Print summary information for each cluster, including its top features, average intra-cluster distance, and silhouette score
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
    # Join the top features with their weights into a string for printing
    top_features_str = ', '.join(
        f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
    )
    # Retrieve the average intra-cluster distance for the current cluster
    intra_cluster_distance = intra_cluster_distances[cluster]
    print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
    # Calculate and print the average silhouette score for the cluster
    average_score = np.mean(cluster_silhouette_scores[cluster])
    print(f'Intra-cluster distance: {intra_cluster_distance:.3f} (the less the better)')
    print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")

    # Initialize a list to hold titles, their distances from the centroid, and silhouette scores
    titles_distances_scores = []
    for i, title_index in enumerate(clusters_indices[cluster]):
        # Retrieve the title and its metrics
        title = results[title_index]
        fit_metric = distances_to_centroids[title_index, cluster]  # The distance of the title from its cluster centroid
        silhouette_score = silhouette_vals[title_index]  # The silhouette score of the title
        # Append the title and its metrics to the list
        titles_distances_scores.append((title, fit_metric, silhouette_score))

    # Sort titles within the cluster by their distance to the centroid in ascending order
    sorted_titles_distances_scores = sorted(titles_distances_scores, key=lambda x: x[1])

    # Print each title with its distance to the centroid and silhouette score
    for title, distance, silhouette_score in sorted_titles_distances_scores:
        print(f"\t{title} (Distance to Centroid: {distance:.3f}, Silhouette Score: {silhouette_score:.3f})")

# PLOT 1

# Assuming X, centroids, clusters_indices are already defined
fig, axs = plt.subplots(len(clusters.keys()), figsize=(10, 5 * len(clusters.keys())),)

# Convert axs to a list if there's only one subplot to standardize the iteration
if len(clusters.keys()) == 1:
    axs = [axs]

# Setting the universal scale for X-axis from 0 to 1
x_min, x_max = 0, 1

for idx, cluster in enumerate(sorted(clusters.keys())):
    indices = clusters_indices[cluster]
    if indices:
        # Convert cluster points to dense format if necessary
        cluster_points = X[indices, :].toarray()
        # Calculate distances from each point in the cluster to its centroid
        distances = euclidean_distances(cluster_points, centroids[[cluster]]).flatten()

        # Plot the histogram of distances with a uniform X-axis scale
        axs[idx].hist(distances, bins=20, alpha=0.7, label=f'Cluster {cluster}', range=(x_min, x_max))
        axs[idx].set_title(f'Distance to Centroid Distribution for Cluster {cluster}')
        axs[idx].set_xlabel('Distance to Centroid')
        axs[idx].set_ylabel('Number of Points')
        axs[idx].legend()
plt.tight_layout()

# PLOT 2
x_limits = (-0.6, 0.6)
y_limits = (-0.6, 0.6)
z_limits = (-0.6, 0.6)

# Convert data to dense format and apply PCA to reduce dimensionality to 3
pca = PCA(n_components=3)
X_pca_3d = pca.fit_transform(X.toarray())

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.set_xlim(x_limits)
ax.set_ylim(y_limits)
ax.set_zlim(z_limits)
# Get the color map for visualizing different clusters
colors = plt.cm.get_cmap('tab10', len(clusters.keys()))

for cluster in sorted(clusters.keys()):
    cluster_indices = clusters_indices[cluster]
    cluster_points = X_pca_3d[cluster_indices, :]
    clr = colors(cluster)
    ax.scatter(cluster_points[:, 0], cluster_points[:, 1], cluster_points[:, 2], color=clr, label=f'Cluster {cluster}', alpha=0.6)

# Apply PCA to centroids to get their coordinates in 3D space
centroids_pca_3d = pca.transform(centroids)

# Draw centroids with corresponding colors
for i, centroid in enumerate(centroids_pca_3d):
    ax.scatter(centroid[0], centroid[1], centroid[2], color=colors(i), marker='x', s=100, edgecolor='k', linewidths=2)

ax.set_title('3D Cluster Visualization with PCA')
ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_zlabel('PCA Component 3')
ax.view_init(elev=20, azim=-35)
plt.legend()

# PLOT 3

# Determining the number of clusters
n_clusters = len(clusters.keys())

# Calculating the optimal number of rows and columns for subplots
rows = math.ceil(math.sqrt(n_clusters))
cols = math.ceil(n_clusters / rows)

# Creating a figure for subplots
fig = plt.figure(figsize=(cols * 6, rows * 5))

# Drawing each cluster in its own subplot
for idx, cluster in enumerate(sorted(clusters.keys())):
    ax = fig.add_subplot(rows, cols, idx + 1, projection='3d')
    cluster_indices = clusters_indices[cluster]
    cluster_points = X_pca_3d[cluster_indices, :]
    centroids_pca_3d = pca.transform([centroids[cluster]])
    ax.set_xlim(x_limits)
    ax.set_ylim(y_limits)
    ax.set_zlim(z_limits)
    # Visualizing cluster points
    ax.scatter(cluster_points[:, 0], cluster_points[:, 1], cluster_points[:, 2], label=f'Cluster {cluster}', alpha=0.6)

    # Visualizing the centroid
    ax.scatter(centroids_pca_3d[:, 0], centroids_pca_3d[:, 1], centroids_pca_3d[:, 2], color='black', marker='x', s=100, label='Centroid')

    ax.set_title(f'Cluster {cluster}')
    ax.set_xlabel('PCA Component 1')
    ax.set_ylabel('PCA Component 2')
    ax.set_zlabel('PCA Component 3')
    ax.view_init(elev=20, azim=-35)

plt.tight_layout()
plt.show()

@drew2a
Copy link
Contributor Author

drew2a commented Feb 27, 2024

Instead of integrating the current algorithm into Tribler, I decided to focus on its improvement and dedicate half of the current week to this task.

I haven't yet focused on measuring the algorithm's performance because I want to first ensure that the clustering results are as accurate as possible. There are two main areas I'm currently working on to improve the quality of the clustering:

  1. Figuring out how to determine the optimal number of clusters, which is crucial for accurately grouping the data.
  2. Incorporating the position of words within the text into the algorithm, which I believe will greatly enhance the quality of the results.

Once I'm confident that the algorithm is producing the best possible clustering outcomes, I'll turn my attention to optimizing its performance.

So, the next iteration of the algorithm contains two modifications:

Transition from KMeans to HDBSCAN for Clustering

Initially, our algorithm employed KMeans for clustering, which necessitates specifying the number of clusters a priori. This requirement posed a significant limitation, as determining the optimal number of clusters is not straightforward and can vary significantly depending on the dataset's nature and size. To address this challenge, we transitioned to using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). Unlike KMeans, HDBSCAN does not require pre-specification of the number of clusters. Instead, it dynamically identifies clusters based on data density, offering several advantages:
Adaptability: HDBSCAN adapts to the inherent structure of the data, leading to more meaningful and natural groupings.
Noise Handling: It effectively identifies and isolates noise, improving the overall quality of the clusters. Unlike KMeans, where every point is assigned to a cluster regardless of how well it fits, HDBSCAN can leave points unassigned (labeled as -1)
Variable Cluster Sizes: The algorithm accommodates clusters of varying densities and sizes, aligning closer with real-world data distributions.

This shift aims to achieve more accurate and representative clustering by leveraging the data's natural structure, potentially enhancing the user experience through more precise content categorization.

Incorporating N-Grams into TFIDF Vectorization

The original vectorization approach using TFIDF (Term Frequency-Inverse Document Frequency) focused on individual terms without considering the order or proximity of words. To capture the contextual nuances and the sequence in which terms appear, we integrated n-grams into our TFIDF vectorization (TfidfVectorizer(token_pattern=r'(?u)\b\d+\b', ngram_range=(1, 2))). N-grams are contiguous sequences of n items from a given sample of text or speech. By incorporating n-grams:

Contextual Awareness: The algorithm can now recognize and give weight to term proximity and order, capturing more nuanced meanings.
Feature Enrichment: Including n-grams expands the feature set with phrase-level information, which is particularly beneficial for understanding the context and thematic content.
Quality Improvement: This adjustment is anticipated to significantly enhance the quality of clustering by providing a richer, more contextually informed feature set for analysis.

Clustering results by cluster:

Cluster 14 (features: ):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu (Silhouette Score: 1.000)
	Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 1.000)
	Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 1.000)
	Ubuntu Linux ebook pack (Silhouette Score: 1.000)
	Ubuntu Linux основы администрирования (Silhouette Score: 1.000)
	Ubuntu Netbook Remix (Silhouette Score: 1.000)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: 1.000)
	Ubuntu Unleashed 2019 Edition (Silhouette Score: 1.000)
	Ubuntu reducido (Silhouette Score: 1.000)
	Ubuntu-Book_RU.djvu (Silhouette Score: 1.000)
	ubuntu (Silhouette Score: 1.000)

Cluster 31 (features: 20 4: 3.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 1.000)
	ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 13 (features: 11 10: 3.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000)
	ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 1.000)

Cluster 12 (features: 12 10: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000)
	ubuntu-12.10-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 11 (features: 16 10: 4.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 16.10 (Silhouette Score: 1.000)
	ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-16.10-server-arm64.iso (Silhouette Score: 1.000)

Cluster 18 (features: 4 1: 1.600, 20 4: 1.200):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000)
	ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 26 (features: 2 0: 1.371, 4 2: 1.165, 20 4: 0.874):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 1.000)
	ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 32 (features: 4 3: 2.332, 20 4: 1.888):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000)
	ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 10 (features: 9 10: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 9.10 (Silhouette Score: 1.000)
	Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000)

Cluster 25 (features: 4 2: 1.600, 20 4: 1.200):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu Server 20.04.2 LTS (Silhouette Score: 1.000)
	ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 23 (features: 4 5: 1.477, 16 4: 1.348):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu-16.04.5 (Silhouette Score: 1.000)
	ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 27 (features: 18 4: 3.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu-18.04 (Silhouette Score: 1.000)
	ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 2 (features: 11 4: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-11.04-alternate-i386.iso (Silhouette Score: 1.000)
	ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 1 (features: 14 10: 3.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 22 (features: 16 4: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-pack-16.04-unity (Silhouette Score: 1.000)

Cluster 17 (features: 4 6: 2.252, 16 4: 1.982):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000)

Cluster 16 (features: 4 7: 1.624, 16 4: 1.167):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000)

Cluster 3 (features: 18 10: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 6 (features: 19 4: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-19.04-server-amd64.iso (Silhouette Score: 1.000)

Cluster 7 (features: 19 10: 3.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 24 (features: 4 4: 2.365, 20 4: 1.846):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 9 (features: 21 10: 3.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-21.10-beta-pack (Silhouette Score: 1.000)
	ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 20 (features: 22 4: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 21 (features: 22 4: 1.439, 4 3: 1.389):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 29 (features: 14 4: 4.684, 4 5: 0.729):
Average Silhouette Score = 0.639 (the higher the better)
	ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.755)
	ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.755)
	ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.755)
	ubuntu-14.04-server-i386.iso (Silhouette Score: 0.755)
	ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 0.173)

Cluster 15 (features: 12 4: 3.922, 4 5: 2.039, 4 4: 0.693):
Average Silhouette Score = 0.405 (the higher the better)
	ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.574)
	ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.574)
	ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.574)
	ubuntu-12.04-server-i386.iso (Silhouette Score: 0.266)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.040)

Cluster 0 (features: 15 4: 3.000, 17 4: 1.000):
Average Silhouette Score = 0.323 (the higher the better)
	ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 0.529)
	ubuntu-15.04-desktop-i386.iso (Silhouette Score: 0.529)
	ubuntu-15.04-server-amd64.iso (Silhouette Score: 0.529)
	ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.293)

Cluster 30 (features: 14 4: 2.014, 4 6: 1.483, 4 4: 0.741):
Average Silhouette Score = 0.198 (the higher the better)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 0.388)
	ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 0.388)
	ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: -0.183)

Cluster 8 (features: 21 4: 2.000, 20 10: 1.000):
Average Silhouette Score = 0.098 (the higher the better)
	ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.293)
	ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.293)
	ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.293)

Cluster 4 (features: 22 10: 2.000, 23 4: 1.000):
Average Silhouette Score = 0.098 (the higher the better)
	ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 0.293)
	ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: 0.293)
	ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.293)

Cluster 19 (features: 18 4: 1.365, 4 4: 0.731, 4 6: 0.731):
Average Silhouette Score = -0.229 (the higher the better)
	ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: -0.229)
	ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.229)

Cluster 28 (features: 18 4: 1.391, 4 3: 0.719, 4 5: 0.719):
Average Silhouette Score = -0.232 (the higher the better)
	ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.232)
	ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.232)

Cluster 5 (features: 0 1: 1.000, 10 10: 1.000):
Average Silhouette Score = -0.293 (the higher the better)
	[Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.293)
	ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.293)

Cluster -1 (features: 4 1: 2.220, 22 4: 1.386, 1 2014: 1.000, 1 4: 1.000, 1 9: 1.000):
Average Silhouette Score = -0.332 (the higher the better)
	Ubuntu 10.04 Netbook (Silhouette Score: -0.293)
	Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.293)
	Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.293)
	Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.293)
	Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.293)
	ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.293)
	ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.293)
	ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: -0.293)
	ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.293)
	ubuntu-14.04.1-server-amd64.iso (Silhouette Score: -0.349)
	ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.393)
	ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: -0.403)
	ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.429)
	ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: -0.434)

The script:

# This script performs cluster analysis using the HDBSCAN algorithm, enhanced by N-gram TF-IDF vectorization, applied 
# to text data.
# It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number, 
# calculating and interpreting key metrics like silhouette scores to evaluate cluster quality.
# The script also explores the integration of word position into the clustering process through N-gram vectorization, 
# aiming to capture more nuanced relationships between terms.
# The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining 
# each cluster, and understanding the contextual relationships within the data.
# This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure 
# of the text data and the effectiveness of the modified clustering strategy.
import re
import string
from collections import defaultdict
from enum import Enum, auto
from pathlib import Path

import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import HDBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_samples

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')


class Vectorizer(Enum):
    TFIDF = auto()
    TFIDF_NGRAMM = auto()


vectorize_type = Vectorizer.TFIDF_NGRAMM
# Load titles from a text file
results = list(sorted(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]

# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
if vectorize_type == Vectorizer.TFIDF:
    vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
    X = vectorizer.fit_transform(preprocessed_results)
elif vectorize_type == Vectorizer.TFIDF_NGRAMM:
    vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b', ngram_range=(1, 2))
    X = vectorizer.fit_transform(preprocessed_results)

# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

print(f'Features: \n{feature_names}')

# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
    # Accessing the i-th TF-IDF vector in sparse format directly
    tfidf_vector = X[i]
    # Extracting indices of non-zero elements (words that are actually present in the document)
    non_zero_indices = tfidf_vector.nonzero()[1]
    # Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
    tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
    # Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
    sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
    # Formatting the sorted TF-IDF values into a string for easy display
    sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
    # Print sorted TF-IDF values
    print(f'\tOriginal:     {original}')
    print(f'\tPreprocessed: {preprocessed}')
    print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')

print("Clustering...")
# Initialize and fit the HDBSCAN model
hdbscan = HDBSCAN(min_cluster_size=2)
hdbscan.fit(X)

# Retrieve cluster labels
labels = hdbscan.labels_

# Initialize dictionaries for storing clustering results
clusters = defaultdict(list)
clusters_indices = defaultdict(list)

# Calculate silhouette scores for each data point in X based on their cluster membership
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')

# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)

# Group data points by their cluster label
for i, label in enumerate(labels):
    clusters[label].append(results[i])
    clusters_indices[label].append(i)
    cluster_silhouette_scores[label].append(silhouette_vals[i])

# Initialize a dictionary to store the sum of TF-IDF values for features by cluster
cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1]))

# Sum up TF-IDF values for each feature within each cluster
for i, label in enumerate(labels):
    cluster_feature_sums[label] += X[i].toarray()[0]

# Number of top features to select for each cluster
top_n_features = 5
feature_names = vectorizer.get_feature_names_out()

# Dictionary to store the top N features for each cluster
top_features_per_cluster = {}

for cluster, sums in cluster_feature_sums.items():
    # Indices of features with sums greater than 0, sorted by their sum in descending order
    positive_indices = [index for index, value in enumerate(sums) if value > 0]
    top_indices = sorted(positive_indices, key=lambda index: sums[index], reverse=True)[:top_n_features]
    # Extract the feature names and their sums for the top features with values greater than 0
    top_features = [(feature_names[index], sums[index]) for index in top_indices if sums[index] > 0]
    top_features_per_cluster[cluster] = top_features


# First, calculate the average silhouette score for each cluster
average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()}

# Then, sort the clusters by their average silhouette score
sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True)

# Output clustering results, now sorted by the average silhouette score
print("\nClustering results by cluster:")
for cluster in sorted_clusters:
    features = top_features_per_cluster[cluster]
    average_score = average_scores[cluster]
    features_str = (f"{feature}: {value:.3f}" for feature, value in features)
    features_line = ', '.join(features_str)

    print(f"\nCluster {cluster} (features: {features_line}):")
    print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")

    # Prepare and sort titles within the cluster by their silhouette score
    titles_scores = []
    for title_index in clusters_indices[cluster]:
        title = results[title_index]
        silhouette_score = silhouette_vals[title_index]
        titles_scores.append((title, silhouette_score))

    sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True)

    # Print each title with its silhouette score
    for title, silhouette_score in sorted_titles_scores:
        print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})")

@drew2a
Copy link
Contributor Author

drew2a commented Feb 29, 2024

To achieve more specific clustering results, such as differentiating between clusters for "Ubuntu 20.04.X" instead of a more general "Ubuntu 20.04," the following HDBSCAN constructor parameters can be adjusted: min_samples and cluster_selection_epsilon.

  • min_samples: This parameter influences how conservatively the algorithm defines what constitutes a dense cluster. By increasing min_samples, you can ensure that only groups of data points with a higher density are considered clusters, leading to more specific and tightly-knit clusters. For example, setting min_samples to a higher value could help distinguish between different subversions of Ubuntu 20.04 by requiring a denser concentration of points to form a cluster.

  • cluster_selection_epsilon: Adjusting this parameter affects the algorithm's sensitivity to forming new clusters based on density distance. A lower cluster_selection_epsilon value can lead to the formation of more, smaller clusters by preventing the merging of nearby clusters. This can be particularly useful for distinguishing between closely related but distinct versions or configurations, such as different patches of Ubuntu 20.04.

Conversely, to configure HDBSCAN for creating more general groups, the same parameters can be adjusted in the opposite direction. Decreasing min_samples allows for more lenient cluster formation, potentially grouping various subversions of Ubuntu 20.04 into a single cluster. Similarly, increasing cluster_selection_epsilon encourages the merging of nearby clusters into larger, more general groups.

By fine-tuning these parameters, HDBSCAN can be tailored to identify clusters at the desired level of specificity, from highly detailed clusters differentiating between minor variations to broader groups encompassing more general categories.

Below are two examples:

  • min_samples=1 and cluster_selection_epsilon=0
Cluster 9 (features: ):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu (Silhouette Score: 1.000)
	Ubuntu Linux ebook pack (Silhouette Score: 1.000)
	Ubuntu Linux основы администрирования (Silhouette Score: 1.000)
	Ubuntu Netbook Remix (Silhouette Score: 1.000)
	Ubuntu reducido (Silhouette Score: 1.000)
	Ubuntu-Book_RU.djvu (Silhouette Score: 1.000)
	ubuntu (Silhouette Score: 1.000)

Cluster 31 (features: 20 4: 2.019, 20: 1.977, 4: 1.007):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 1.000)
	ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 0 (features: 11 10: 2.066, 11: 1.873, 10: 1.105):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000)
	ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 1.000)

Cluster 1 (features: 12 10: 1.462, 12: 1.151, 10: 0.733):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000)
	ubuntu-12.10-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 2 (features: 16 10: 2.937, 16: 2.152, 10: 1.656):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 16.10 (Silhouette Score: 1.000)
	ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-16.10-server-arm64.iso (Silhouette Score: 1.000)

Cluster 21 (features: 4 1: 1.155, 1: 1.005, 20 4: 0.866, 20: 0.849, 4: 0.432):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000)
	ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 23 (features: 2 0: 0.973, 0: 0.913, 2: 0.827, 4 2: 0.827, 20 4: 0.621):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 1.000)
	ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 28 (features: 3: 1.616, 4 3: 1.616, 20 4: 1.308, 20: 1.281, 4: 0.653):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000)
	ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 8 (features: 9 10: 1.370, 9: 1.285, 10: 0.687):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 9.10 (Silhouette Score: 1.000)
	Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000)

Cluster 3 (features: 2015: 2.000):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 1.000)
	Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 1.000)

Cluster 22 (features: 2: 1.111, 4 2: 1.111, 20 4: 0.833, 20: 0.816, 4: 0.416):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu Server 20.04.2 LTS (Silhouette Score: 1.000)
	ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 25 (features: 4 5: 1.043, 5: 1.043, 16 4: 0.951, 16: 0.862, 4: 0.421):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu-16.04.5 (Silhouette Score: 1.000)
	ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 7 (features: 11 4: 1.481, 11: 1.259, 4: 0.471):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-11.04-alternate-i386.iso (Silhouette Score: 1.000)
	ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 11 (features: 12 4: 1.556, 12: 1.442, 4 5: 1.442, 5: 1.442, 4: 0.583):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 1.000)

Cluster 15 (features: 4 6: 1.044, 6: 1.044, 14 4: 0.945, 14: 0.873, 4: 0.407):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000)
	ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 4 (features: 14 10: 2.226, 14: 1.621, 10: 1.191):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 13 (features: 15: 2.063, 15 4: 2.063, 4: 0.700):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-15.04-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-15.04-server-amd64.iso (Silhouette Score: 1.000)

Cluster 20 (features: 4 6: 1.589, 6: 1.589, 16 4: 1.399, 16: 1.268, 4: 0.619):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000)

Cluster 10 (features: 4 7: 1.147, 7: 1.147, 16 4: 0.824, 16: 0.747, 4: 0.365):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000)

Cluster 16 (features: 18 10: 1.504, 18: 1.081, 10: 0.754):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 19 (features: 19 10: 2.066, 19: 1.873, 10: 1.105):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 32 (features: 4 4: 1.792, 20 4: 1.399, 4: 1.397, 20: 1.371):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 29 (features: 21 10: 2.066, 21: 1.873, 10: 1.105):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-21.10-beta-pack (Silhouette Score: 1.000)
	ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 26 (features: 22 4: 1.406, 22: 1.312, 4: 0.548):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 27 (features: 22 4: 1.015, 3: 0.979, 4 3: 0.979, 22: 0.947, 4: 0.395):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 30 (features: 22 10: 1.477, 22: 1.126, 10: 0.741):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 17 (features: 18 4: 2.606, 18: 2.457, 4: 1.304, 4 4: 0.554):
Average Silhouette Score = 0.632 (the higher the better)
	Ubuntu-18.04 (Silhouette Score: 0.732)
	ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.732)
	ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.732)
	ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.332)

Cluster 24 (features: 16 4: 1.884, 16: 1.708, 4: 0.834, 3: 0.521, 4 3: 0.521):
Average Silhouette Score = 0.364 (the higher the better)
	ubuntu-16.04-desktop-i386.iso (Silhouette Score: 0.500)
	ubuntu-pack-16.04-unity (Silhouette Score: 0.500)
	ubuntu-16.04.3-server-amd64.iso (Silhouette Score: 0.091)

Cluster 14 (features: 14 4: 4.275, 14: 3.947, 4: 2.060, 4 4: 0.566, 4 1: 0.551):
Average Silhouette Score = 0.355 (the higher the better)
	ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.541)
	ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.541)
	ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.541)
	ubuntu-14.04-server-i386.iso (Silhouette Score: 0.541)
	ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.224)
	ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 0.067)
	ubuntu-14.04.1-server-amd64.iso (Silhouette Score: 0.034)

Cluster 12 (features: 12 4: 1.254, 12: 1.162, 4: 0.674, 4 4: 0.526):
Average Silhouette Score = 0.267 (the higher the better)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.338)
	ubuntu-12.04-server-i386.iso (Silhouette Score: 0.196)

Cluster 5 (features: 19 4: 1.481, 19: 1.259, 17 4: 0.720, 4: 0.682, 17: 0.662):
Average Silhouette Score = 0.116 (the higher the better)
	ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 0.311)
	ubuntu-19.04-server-amd64.iso (Silhouette Score: 0.311)
	ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.275)

Cluster 6 (features: 21 4: 1.481, 21: 1.259, 23 4: 0.720, 4: 0.682, 23: 0.662):
Average Silhouette Score = 0.116 (the higher the better)
	ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.311)
	ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.311)
	ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.275)

Cluster 18 (features: 18 4: 0.976, 18: 0.920, 3: 0.504, 4 3: 0.504, 4 5: 0.504):
Average Silhouette Score = -0.194 (the higher the better)
	ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.194)
	ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.194)

Cluster -1 (features: 1: 2.837, 10: 2.097, 4: 1.897, 2014: 1.267, 4 1: 1.066):
Average Silhouette Score = -0.305 (the higher the better)
	ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.254)
	Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.258)
	Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.268)
	[Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.268)
	Ubuntu 10.04 Netbook (Silhouette Score: -0.268)
	ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.271)
	Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.276)
	ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.279)
	ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.280)
	ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: -0.280)
	Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.286)
	ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.286)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: -0.293)
	Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.293)
	ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.377)
	ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.399)
	ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: -0.428)
	ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: -0.433)
  • min_samples=3 and cluster_selection_epsilon=0.5
Cluster 5 (features: 9 10: 1.370, 9: 1.285, 10: 0.687):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 9.10 (Silhouette Score: 1.000)
	Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000)

Cluster 14 (features: 4 6: 1.044, 6: 1.044, 14 4: 0.945, 14: 0.873, 4: 0.407):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000)
	ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 0 (features: 15: 2.063, 15 4: 2.063, 4: 0.700):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-15.04-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-15.04-server-amd64.iso (Silhouette Score: 1.000)

Cluster 19 (features: 4 6: 1.589, 6: 1.589, 16 4: 1.399, 16: 1.268, 4: 0.619):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000)

Cluster 9 (features: 4 7: 1.147, 7: 1.147, 16 4: 0.824, 16: 0.747, 4: 0.365):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000)

Cluster 10 (features: 19 10: 2.066, 19: 1.873, 10: 1.105):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 22 (features: 4 4: 1.792, 20 4: 1.399, 4: 1.397, 20: 1.371):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 15 (features: 18 4: 2.606, 18: 2.457, 4: 1.304, 4 4: 0.554):
Average Silhouette Score = 0.632 (the higher the better)
	Ubuntu-18.04 (Silhouette Score: 0.732)
	ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.732)
	ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.732)
	ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.332)

Cluster 17 (features: 2: 1.938, 4 2: 1.938, 20 4: 1.454, 20: 1.424, 2 0: 0.973):
Average Silhouette Score = 0.489 (the higher the better)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 0.524)
	ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 0.524)
	Ubuntu Server 20.04.2 LTS (Silhouette Score: 0.454)
	ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 0.454)

Cluster 8 (features: 12 4: 2.810, 12: 2.604, 4 5: 1.442, 5: 1.442, 4: 1.257):
Average Silhouette Score = 0.455 (the higher the better)
	ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.596)
	ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.596)
	ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.596)
	ubuntu-12.04-server-i386.iso (Silhouette Score: 0.306)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.184)

Cluster 13 (features: 14 4: 4.275, 14: 3.947, 4: 2.060, 4 4: 0.566, 4 1: 0.551):
Average Silhouette Score = 0.381 (the higher the better)
	ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.541)
	ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.541)
	ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.541)
	ubuntu-14.04-server-i386.iso (Silhouette Score: 0.541)
	ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.224)
	ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 0.143)
	ubuntu-14.04.1-server-amd64.iso (Silhouette Score: 0.139)

Cluster 11 (features: 21 10: 2.066, 21: 1.873, 10: 1.477, 20 10: 0.805, 20: 0.462):
Average Silhouette Score = 0.362 (the higher the better)
	ubuntu-21.10-beta-pack (Silhouette Score: 0.562)
	ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 0.562)
	ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 0.562)
	ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.239)

Cluster 12 (features: 22 4: 3.414, 22: 3.187, 4: 1.330, 3: 0.979, 4 3: 0.979):
Average Silhouette Score = 0.243 (the higher the better)
	ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.395)
	ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 0.395)
	ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 0.251)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 0.251)
	ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 0.107)
	ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: 0.059)

Cluster 18 (features: 16 4: 2.835, 16: 2.570, 4: 1.255, 4 5: 1.043, 5: 1.043):
Average Silhouette Score = 0.232 (the higher the better)
	Ubuntu-16.04.5 (Silhouette Score: 0.337)
	ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 0.337)
	ubuntu-16.04-desktop-i386.iso (Silhouette Score: 0.265)
	ubuntu-pack-16.04-unity (Silhouette Score: 0.265)
	ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.041)

Cluster 4 (features: 21 4: 1.481, 21: 1.259, 19 4: 0.741, 4: 0.707, 19: 0.629):
Average Silhouette Score = 0.118 (the higher the better)
	ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.313)
	ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.313)
	ubuntu-19.04-server-amd64.iso (Silhouette Score: -0.272)

Cluster 3 (features: 14 10: 2.226, 14: 1.621, 11 4: 1.481, 11: 1.259, 10: 1.191):
Average Silhouette Score = -0.106 (the higher the better)
	ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 0.057)
	ubuntu-14.10-desktop-i386.iso (Silhouette Score: 0.057)
	ubuntu-14.10-server-amd64.iso (Silhouette Score: 0.057)
	ubuntu-11.04-alternate-i386.iso (Silhouette Score: -0.142)
	ubuntu-11.04-desktop-amd64.iso (Silhouette Score: -0.142)
	ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.284)
	ubuntu-19.04-desktop-amd64.iso (Silhouette Score: -0.343)

Cluster 16 (features: 18 4: 0.976, 18: 0.920, 3: 0.504, 4 3: 0.504, 4 5: 0.504):
Average Silhouette Score = -0.194 (the higher the better)
	ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.194)
	ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.194)

Cluster -1 (features: 10: 2.910, 1: 2.022, 18: 1.990, 4: 1.748, 4 1: 1.696):
Average Silhouette Score = -0.271 (the higher the better)
	ubuntu-18.10-desktop-amd64.iso (Silhouette Score: -0.196)
	ubuntu-18.10-server-amd64.iso (Silhouette Score: -0.196)
	ubuntu-22.10-desktop-amd64.iso (Silhouette Score: -0.213)
	ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: -0.213)
	Ubuntu 10.04 Netbook (Silhouette Score: -0.255)
	ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: -0.257)
	ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.260)
	Ubuntu 12.10 Desktop (i386) (Silhouette Score: -0.268)
	ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.271)
	ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.272)
	Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.285)
	Ubuntu 20.04.1 Desktop.iso (Silhouette Score: -0.324)
	ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: -0.324)
	ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.352)
	ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.385)

Cluster 7 (features: 2015: 2.000, 1: 1.363, 10: 1.045, 2019: 1.000, 6685: 1.000):
Average Silhouette Score = -0.272 (the higher the better)
	Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: -0.214)
	Ubuntu Facile Marzo 2015.pdf (Silhouette Score: -0.214)
	Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.277)
	[Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.277)
	Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.277)
	ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.283)
	ubuntu-12.10-desktop-i386.iso (Silhouette Score: -0.283)
	Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.291)
	ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.291)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: -0.293)
	Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.293)

The selected parameter values for the HDBSCAN constructor are not definitive but are intended to illustrate the potential for enhancing clustering quality through careful optimization. The optimal settings for these parameters can significantly vary, underscoring the importance of adjustment based on the specific clustering goals. Intuitively, the choice of these values should align with the user's objectives: broader topic identification might necessitate one set of parameters, while uncovering more detailed, dense information may require a different configuration.

Ref:

@drew2a
Copy link
Contributor Author

drew2a commented Feb 29, 2024

The next step involved a deeper exploration of vectorization algorithms to determine if there are more advanced options beyond TFIDF that could better suit our needs. This exploration led us to experiment with FastText, an advanced word embedding technique known for capturing the nuances of word semantics and relationships more effectively than traditional TFIDF. FastText, by leveraging neural network models, generates vector representations of words that incorporate the context in which words appear, as well as the morphology of the words themselves.

While the results obtained with FastText were practically identical to those achieved with TFIDF, a key distinction emerged: FastText's flexibility in analyzing all presented tokens, not just the numeric ones as was the case in the previous version using TFIDF.

Cluster 1:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu (Silhouette Score: 1.000)
	ubuntu (Silhouette Score: 1.000)

Cluster 6:
Average Silhouette Score = 0.668 (the higher the better)
	ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 0.669)
	ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 0.668)

Cluster 3:
Average Silhouette Score = 0.621 (the higher the better)
	Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 0.629)
	Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 0.613)

Cluster 15:
Average Silhouette Score = 0.571 (the higher the better)
	ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: 0.576)
	ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 0.565)

Cluster 7:
Average Silhouette Score = 0.565 (the higher the better)
	ubuntu-18.10-server-amd64.iso (Silhouette Score: 0.574)
	ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 0.556)

Cluster 18:
Average Silhouette Score = 0.524 (the higher the better)
	ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 0.530)
	ubuntu-19.04-server-amd64.iso (Silhouette Score: 0.517)

Cluster 23:
Average Silhouette Score = 0.520 (the higher the better)
	ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 0.521)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 0.518)

Cluster 21:
Average Silhouette Score = 0.495 (the higher the better)
	ubuntu-16.04.6-server-i386.iso (Silhouette Score: 0.552)
	ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 0.483)
	ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 0.450)

Cluster 9:
Average Silhouette Score = 0.485 (the higher the better)
	ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 0.522)
	ubuntu-15.04-desktop-i386.iso (Silhouette Score: 0.468)
	ubuntu-15.04-server-amd64.iso (Silhouette Score: 0.463)

Cluster 8:
Average Silhouette Score = 0.467 (the higher the better)
	ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 0.488)
	ubuntu-14.10-server-amd64.iso (Silhouette Score: 0.477)
	ubuntu-14.10-desktop-i386.iso (Silhouette Score: 0.435)

Cluster 14:
Average Silhouette Score = 0.395 (the higher the better)
	ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 0.429)
	ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: 0.387)
	ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 0.368)

Cluster 22:
Average Silhouette Score = 0.388 (the higher the better)
	ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 0.418)
	ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 0.358)

Cluster 10:
Average Silhouette Score = 0.376 (the higher the better)
	ubuntu-16.10-desktop-i386.iso (Silhouette Score: 0.406)
	ubuntu-16.10-server-arm64.iso (Silhouette Score: 0.378)
	ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 0.344)

Cluster 0:
Average Silhouette Score = 0.374 (the higher the better)
	Ubuntu 9.10 (Silhouette Score: 0.497)
	Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 0.251)

Cluster 11:
Average Silhouette Score = 0.344 (the higher the better)
	ubuntu-11.04-alternate-i386.iso (Silhouette Score: 0.379)
	ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 0.310)

Cluster 25:
Average Silhouette Score = 0.338 (the higher the better)
	ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 0.436)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: 0.325)
	ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 0.253)

Cluster 20:
Average Silhouette Score = 0.321 (the higher the better)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 0.365)
	ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 0.278)

Cluster 28:
Average Silhouette Score = 0.313 (the higher the better)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 0.327)
	ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 0.300)

Cluster 27:
Average Silhouette Score = 0.261 (the higher the better)
	ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.284)
	ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 0.238)

Cluster 4:
Average Silhouette Score = 0.243 (the higher the better)
	ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 0.248)
	ubuntu-11.10-desktop-i386.iso (Silhouette Score: 0.238)

Cluster 24:
Average Silhouette Score = 0.242 (the higher the better)
	ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 0.277)
	ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 0.207)

Cluster 17:
Average Silhouette Score = 0.239 (the higher the better)
	ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.414)
	ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.378)
	ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.361)
	ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 0.116)
	ubuntu-14.04.5-server-amd64.iso (Silhouette Score: -0.075)

Cluster 19:
Average Silhouette Score = 0.228 (the higher the better)
	ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.334)
	ubuntu-14.04-server-i386.iso (Silhouette Score: 0.302)
	ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.264)
	ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.014)

Cluster 16:
Average Silhouette Score = 0.211 (the higher the better)
	Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 0.296)
	ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 0.271)
	ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: 0.146)
	ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 0.129)

Cluster 5:
Average Silhouette Score = 0.186 (the higher the better)
	Ubuntu 12.10 Desktop (i386) (Silhouette Score: 0.297)
	ubuntu-12.10-desktop-i386.iso (Silhouette Score: 0.074)

Cluster 12:
Average Silhouette Score = 0.157 (the higher the better)
	ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 0.273)
	ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: 0.263)
	ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: 0.076)
	ubuntu-20.10-desktop-amd64.iso (Silhouette Score: 0.016)

Cluster 26:
Average Silhouette Score = 0.126 (the higher the better)
	ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.314)
	ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.209)
	ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.082)
	ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.100)

Cluster 13:
Average Silhouette Score = 0.116 (the higher the better)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 0.311)
	ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 0.242)
	ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 0.043)
	ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: -0.132)

Cluster 2:
Average Silhouette Score = 0.113 (the higher the better)
	Ubuntu Linux основы администрирования (Silhouette Score: 0.195)
	Ubuntu Linux ebook pack (Silhouette Score: 0.175)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: 0.110)
	Ubuntu reducido (Silhouette Score: 0.063)
	Ubuntu Unleashed 2019 Edition (Silhouette Score: 0.020)

Cluster -1:
Average Silhouette Score = -0.382 (the higher the better)
	[Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.175)
	Ubuntu-Book_RU.djvu (Silhouette Score: -0.209)
	Ubuntu 10.04 Netbook (Silhouette Score: -0.231)
	Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.233)
	Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.249)
	Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.264)
	Ubuntu Facile 01 2014.pdf (Silhouette Score: -0.301)
	ubuntu-pack-16.04-unity (Silhouette Score: -0.311)
	Ubuntu Netbook Remix (Silhouette Score: -0.317)
	Ubuntu-16.04.5 (Silhouette Score: -0.330)
	ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.337)
	ubuntu-10.10-xenon-beta5 (Silhouette Score: -0.376)
	Ubuntu Server 20.04.2 LTS (Silhouette Score: -0.382)
	ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.386)
	Ubuntu-18.04 (Silhouette Score: -0.388)
	ubuntu-12.04-server-i386.iso (Silhouette Score: -0.406)
	ubuntu-14.04-server-amd64.ova (Silhouette Score: -0.411)
	ubuntu-17.04-server-amd64.iso (Silhouette Score: -0.418)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: -0.422)
	Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: -0.425)
	ubuntu-14.04.1-server-amd64.iso (Silhouette Score: -0.433)
	ubuntu-16.04-desktop-i386.iso (Silhouette Score: -0.438)
	ubuntu-21.10-beta-pack (Silhouette Score: -0.448)
	ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.454)
	ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.458)
	Ubuntu 16.10 (Silhouette Score: -0.487)
	ubuntu-23.04-live-server-amd64.iso (Silhouette Score: -0.495)
	ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.496)
	Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: -0.513)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: -0.514)
	ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.521)

The script:

# This script performs cluster analysis using the HDBSCAN algorithm, enhanced by FastText vectorization, applied
# to text data.
# It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number,
# calculating and interpreting key metrics like silhouette scores to evaluate cluster quality.
# The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining
# each cluster, and understanding the contextual relationships within the data.
# This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure
# of the text data and the effectiveness of the modified clustering strategy.

import re
import string
from collections import defaultdict
from pathlib import Path

import nltk
import numpy as np
from gensim.models import FastText
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import HDBSCAN
from sklearn.metrics import silhouette_samples

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Load titles from a text file
results = list(sorted(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


first_title = results[0]

# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]


def extract_digits(tokens):
    return [token for token in tokens if re.match(r'^\d+$', token)]


digit_only_titles = [extract_digits(title.split()) for title in preprocessed_results]
model = FastText(sentences=digit_only_titles, vector_size=100, window=5, min_count=1, workers=4)

X = np.array([np.mean([model.wv[word] for word in title.split() if word in model.wv], axis=0) for title in
              preprocessed_results])

print("Clustering...")
# Initialize and fit the HDBSCAN model
hdbscan = HDBSCAN(min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0)
hdbscan.fit(X)

# Retrieve cluster labels
labels = hdbscan.labels_

# Initialize dictionaries for storing clustering results
clusters = defaultdict(list)
clusters_indices = defaultdict(list)

# Calculate silhouette scores for each data point in X based on their cluster membership
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')

# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)

# Group data points by their cluster label
for i, label in enumerate(labels):
    clusters[label].append(results[i])
    clusters_indices[label].append(i)
    cluster_silhouette_scores[label].append(silhouette_vals[i])

# Initialize a dictionary to store the sum of TF-IDF values for features by cluster
cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1]))

# First, calculate the average silhouette score for each cluster
average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()}

# Then, sort the clusters by their average silhouette score
sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True)

# Output clustering results, now sorted by the average silhouette score
print("\nClustering results by cluster:")
for cluster in sorted_clusters:
    average_score = average_scores[cluster]

    print(f"\nCluster {cluster}:")
    print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")

    # Prepare and sort titles within the cluster by their silhouette score
    titles_scores = []
    for title_index in clusters_indices[cluster]:
        title = results[title_index]
        silhouette_score = silhouette_vals[title_index]
        titles_scores.append((title, silhouette_score))

    sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True)

    # Print each title with its silhouette score
    for title, silhouette_score in sorted_titles_scores:
        print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})")

Ref:

@drew2a
Copy link
Contributor Author

drew2a commented Feb 29, 2024

This endeavor was an attempt to leverage transformers, specifically the BERT model, as a tokenizer in our clustering process. When utilizing BERT as a tokenizer, we observe that it delivers inferior results compared to analyzing either the entire titles or only the extracted digits. Additionally, BERT is noticeably slower and more resource-intensive.

I acknowledge that I haven't delved deeply into BERT's intricacies, as it's a complex model, and gaining a thorough understanding would require a substantial investment of time. My goal was to create a simple, almost out-of-the-box example to get a sense of how it operates.

Cluster 30:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu (Silhouette Score: 1.000)
	ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 12:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 10.04 Netbook (Silhouette Score: 1.000)
	Ubuntu-18.04 (Silhouette Score: 1.000)
	Ubuntu-Book_RU.djvu (Silhouette Score: 1.000)

Cluster 11:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000)
	ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 10:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000)
	ubuntu-16.04.3-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 19:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 16.10 (Silhouette Score: 1.000)
	ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 3:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000)
	ubuntu-19.04-server-amd64.iso (Silhouette Score: 1.000)

Cluster 22:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 1.000)
	ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 5:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000)
	Ubuntu 9.10 (Silhouette Score: 1.000)

Cluster 20:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu Linux ebook pack (Silhouette Score: 1.000)
	ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 26:
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu Server 20.04.2 LTS (Silhouette Score: 1.000)
	ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 35:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-14.04-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 7:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04-server-amd64.ova (Silhouette Score: 1.000)
	ubuntu-14.04-server-i386.iso (Silhouette Score: 1.000)
	ubuntu-14.04.1-server-amd64.iso (Silhouette Score: 1.000)

Cluster 33:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-14.04.5-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000)

Cluster 6:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-23.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 34:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-15.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-15.04-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 9:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-15.04-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 13:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 23:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 21:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 31:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 28:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 17:
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 32:
Average Silhouette Score = 0.645 (the higher the better)
	ubuntu-12.04-server-i386.iso (Silhouette Score: 0.773)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.773)
	ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 0.773)
	ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 0.773)
	ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 0.131)

Cluster 8:
Average Silhouette Score = 0.526 (the higher the better)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: 0.681)
	ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 0.681)
	ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 0.681)
	ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 0.058)

Cluster 25:
Average Silhouette Score = 0.519 (the higher the better)
	ubuntu-10.10-xenon-beta5 (Silhouette Score: 0.689)
	ubuntu-11.04-alternate-i386.iso (Silhouette Score: 0.689)
	ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 0.689)
	ubuntu-13.04-desktop-i386.iso (Silhouette Score: 0.008)

Cluster 14:
Average Silhouette Score = 0.482 (the higher the better)
	Ubuntu reducido (Silhouette Score: 0.597)
	Ubuntu-16.04.5 (Silhouette Score: 0.597)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 0.251)

Cluster 2:
Average Silhouette Score = 0.435 (the higher the better)
	Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 0.517)
	Ubuntu Facile 04 2014.pdf (Silhouette Score: 0.517)
	Ubuntu Satanic Edition 666.4 (Silhouette Score: 0.270)

Cluster 24:
Average Silhouette Score = 0.428 (the higher the better)
	ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 0.545)
	ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 0.545)
	ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 0.194)

Cluster 29:
Average Silhouette Score = 0.400 (the higher the better)
	ubuntu-21.10-beta-pack (Silhouette Score: 0.530)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: 0.530)
	ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.139)

Cluster 18:
Average Silhouette Score = 0.363 (the higher the better)
	ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.516)
	ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.516)
	ubuntu-16.04.6-server-i386.iso (Silhouette Score: 0.057)

Cluster 1:
Average Silhouette Score = 0.304 (the higher the better)
	Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 0.307)
	Ubuntu Facile 01 2014.pdf (Silhouette Score: 0.300)

Cluster 4:
Average Silhouette Score = 0.295 (the higher the better)
	Ubuntu Netbook Remix (Silhouette Score: 0.338)
	ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: 0.253)

Cluster 0:
Average Silhouette Score = 0.264 (the higher the better)
	Ubuntu Linux основы администрирования (Silhouette Score: 0.312)
	Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 0.216)

Cluster 15:
Average Silhouette Score = 0.171 (the higher the better)
	ubuntu-12.10-desktop-i386.iso (Silhouette Score: 0.191)
	ubuntu (Silhouette Score: 0.152)

Cluster 27:
Average Silhouette Score = 0.135 (the higher the better)
	ubuntu-16.10-server-arm64.iso (Silhouette Score: 0.209)
	ubuntu-16.10-desktop-i386.iso (Silhouette Score: 0.156)
	ubuntu-17.04-server-amd64.iso (Silhouette Score: 0.041)

Cluster 16:
Average Silhouette Score = 0.048 (the higher the better)
	ubuntu-20.10-desktop-amd64.iso (Silhouette Score: 0.130)
	ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 0.053)
	ubuntu-21.04-desktop-amd64.iso (Silhouette Score: -0.040)

Cluster -1:
Average Silhouette Score = -0.466 (the higher the better)
	Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.285)
	ubuntu-16.10-desktop-amd64.iso (Silhouette Score: -0.450)
	[Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.480)
	ubuntu-14.10-desktop-amd64.iso (Silhouette Score: -0.486)
	Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.496)
	ubuntu-20.04-live-server-amd64.iso (Silhouette Score: -0.520)
	ubuntu-17.10-desktop-amd64.iso (Silhouette Score: -0.547)

The script:

# This script performs cluster analysis using the HDBSCAN algorithm, enhanced by FastText vectorization, applied
# to text data.
# It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number,
# calculating and interpreting key metrics like silhouette scores to evaluate cluster quality.
# The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining
# each cluster, and understanding the contextual relationships within the data.
# This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure
# of the text data and the effectiveness of the modified clustering strategy.

import re
import string
from collections import defaultdict
from pathlib import Path

import nltk
import numpy as np
import torch
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import HDBSCAN
from sklearn.metrics import silhouette_samples
from transformers import BertModel, BertTokenizer

# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# Load titles from a text file
results = list(sorted(
    r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))


def preprocess_text(text):
    # Remove text inside parentheses and brackets
    text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
    # Convert to lowercase
    text = text.lower()
    # Replace punctuation and hyphens with spaces
    text = re.sub(r'[' + string.punctuation + ']', ' ', text)
    # Remove leading zeros
    text = re.sub(r'\b0+(\d+)\b', r'\1', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word and word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)


first_title = results[0]

# Preprocess titles
print("Preprocessing titles...")
preprocessed_results = [preprocess_text(title) for title in results]

print("Loading BERT model and tokenizer...")
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)


def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings


def extract_digits(text):
    return ' '.join(re.findall(r'\b\d+\b', text))


print("Loading and preprocessing titles...")
digit_only_titles = [extract_digits(title) for title in preprocessed_results]
print("Transforming titles to BERT embeddings...")
X = np.array([get_bert_embedding(title) for title in digit_only_titles if title.strip() != ''])

print("Clustering...")
# Initialize and fit the HDBSCAN model
hdbscan = HDBSCAN(min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0)
hdbscan.fit(X)

# Retrieve cluster labels
labels = hdbscan.labels_

# Initialize dictionaries for storing clustering results
clusters = defaultdict(list)
clusters_indices = defaultdict(list)

# Calculate silhouette scores for each data point in X based on their cluster membership
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')

# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)

# Group data points by their cluster label
for i, label in enumerate(labels):
    clusters[label].append(results[i])
    clusters_indices[label].append(i)
    cluster_silhouette_scores[label].append(silhouette_vals[i])

# Initialize a dictionary to store the sum of TF-IDF values for features by cluster
cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1]))

# First, calculate the average silhouette score for each cluster
average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()}

# Then, sort the clusters by their average silhouette score
sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True)

# Output clustering results, now sorted by the average silhouette score
print("\nClustering results by cluster:")
for cluster in sorted_clusters:
    average_score = average_scores[cluster]

    print(f"\nCluster {cluster}:")
    print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")

    # Prepare and sort titles within the cluster by their silhouette score
    titles_scores = []
    for title_index in clusters_indices[cluster]:
        title = results[title_index]
        silhouette_score = silhouette_vals[title_index]
        titles_scores.append((title, silhouette_score))

    sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True)

    # Print each title with its silhouette score
    for title, silhouette_score in sorted_titles_scores:
        print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})")

Ref:

@drew2a
Copy link
Contributor Author

drew2a commented Feb 29, 2024

The final improvement in this iteration was an attempt to modify the standard TF-IDF algorithm to account for the position of tokens, which led to better results (comparable to N-Grams with TFIDF) than all other experiments conducted. By incorporating token positioning, we were able to differentiate between identical tokens based on their locations within the text, offering a deeper insight into the document's structure. Though our implementation is somewhat naive and likely not the most efficient in terms of performance, it represents a swift and straightforward prototype.

The trick is:

class PositionalDigitTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        # Custom analyzer that extracts digits and their position within the document
        def positional_digit_analyzer(doc):
            # Splitting the document into words
            words = doc.split()
            # Initializing a list to store digits with their position
            positional_digits = []
            # Iterating over words and their indexes in the list
            for index, word in enumerate(words, start=1):  # Indexing starts from 1
                if word.isdigit():  # Checking if the word is a digit
                    # Adding to the list in the format "digit_position_in_text"
                    positional_digits.append(f"{word}_{index}")
                    positional_digits.append(f"{word}")
            return positional_digits

        return positional_digit_analyzer

This modification leads to tokens that look like:

	Original:     ubuntu-mate-20.04.4-desktop-amd64.iso
	Preprocessed: ubuntu mate 20 4 4 desktop amd64 iso
	TF-IDF:
			4_5: 0.562
			20_3: 0.527
			4_4: 0.393
			4: 0.358
			20: 0.351

	Original:     ubuntu-mate-21.10-desktop-amd64.iso
	Preprocessed: ubuntu mate 21 10 desktop amd64 iso
	TF-IDF:
			21_3: 0.624
			10_4: 0.538
			21: 0.488
			10: 0.288

For further refinement of the algorithm, the following paper could be used as a reference: "Optimized TF-IDF Algorithm with the Adaptive Weight of Position of Word" available at https://www.atlantis-press.com/proceedings/aiie-16/25866330. This research suggests possible avenues for enhancing the complexity and effectiveness of our approach, indicating more advanced strategies for incorporating positional information into text vectorization.

Cluster 4 (features: ):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu (Silhouette Score: 1.000)
	Ubuntu Linux ebook pack (Silhouette Score: 1.000)
	Ubuntu Linux основы администрирования (Silhouette Score: 1.000)
	Ubuntu Netbook Remix (Silhouette Score: 1.000)
	Ubuntu reducido (Silhouette Score: 1.000)
	Ubuntu-Book_RU.djvu (Silhouette Score: 1.000)
	ubuntu (Silhouette Score: 1.000)

Cluster 31 (features: 20_2: 1.950, 20: 1.827, 4_3: 0.997, 4: 0.931):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu - 20.04 - X64 - UNTOUCHED - David1893 (Silhouette Score: 1.000)
	ubuntu-20.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 14 (features: 11: 1.812, 11_2: 1.812, 10_3: 1.135, 10: 1.069):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 11.10 Oneiric Ocelot (Silhouette Score: 1.000)
	ubuntu-11.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-11.10-dvd-amd64.iso (Silhouette Score: 1.000)

Cluster 18 (features: 12: 1.182, 12_2: 1.182, 10_3: 0.799, 10: 0.753):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 12.10 Desktop (i386) (Silhouette Score: 1.000)
	ubuntu-12.10-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 26 (features: 1_4: 1.058, 1: 0.993, 20_2: 0.895, 20: 0.839, 4_3: 0.458):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.1 Desktop.iso (Silhouette Score: 1.000)
	ubuntu-20.04.1-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 22 (features: 3_4: 1.098, 3: 1.018, 20_2: 0.861, 20: 0.807, 4_3: 0.441):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 20.04.3 (AMD64) (Server) (Silhouette Score: 1.000)
	ubuntu-20.04.3-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 8 (features: 9_2: 1.287, 9: 1.207, 10_3: 0.685, 10: 0.646):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu 9.10 (Silhouette Score: 1.000)
	Ubuntu 9.10 Пользовательская сборка (Silhouette Score: 1.000)

Cluster 5 (features: 2015: 1.414, 2015_4: 1.414):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu Facile - Aprile 2015.pdf (Silhouette Score: 1.000)
	Ubuntu Facile Marzo 2015.pdf (Silhouette Score: 1.000)

Cluster 30 (features: 5: 1.033, 5_4: 1.033, 16_2: 0.874, 16: 0.854, 4_3: 0.447):
Average Silhouette Score = 1.000 (the higher the better)
	Ubuntu-16.04.5 (Silhouette Score: 1.000)
	ubuntu-16.04.5-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 13 (features: 11: 1.318, 11_2: 1.318, 4_3: 0.529, 4: 0.494):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-11.04-alternate-i386.iso (Silhouette Score: 1.000)
	ubuntu-11.04-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 12 (features: 12: 1.438, 12_2: 1.438, 5: 1.438, 5_4: 1.438, 4_3: 0.623):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-12.04.5-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-12.04.5-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-12.04.5-dvd-i386.iso (Silhouette Score: 1.000)

Cluster 21 (features: 6: 1.036, 6_4: 1.036, 14: 0.866, 14_2: 0.866, 4_3: 0.433):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.04.6-desktop-amd64+mac.iso (Silhouette Score: 1.000)
	ubuntu-14.04.6-desktop-i386.iso (Silhouette Score: 1.000)

Cluster 28 (features: 14: 1.691, 14_2: 1.691, 10_3: 1.319, 10: 1.242):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-14.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-14.10-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-14.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 25 (features: 6: 1.575, 6_4: 1.575, 16_2: 1.285, 16: 1.257, 4_3: 0.658):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.6-desktop-i386.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.6-server-i386.iso (Silhouette Score: 1.000)

Cluster 11 (features: 7: 1.138, 7_4: 1.138, 16_2: 0.759, 16: 0.742, 4_3: 0.388):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-16.04.7-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-16.04.7-server-amd64.iso (Silhouette Score: 1.000)

Cluster 24 (features: 18: 1.148, 18_2: 1.148, 10_3: 0.850, 10: 0.801):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-18.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-18.10-server-amd64.iso (Silhouette Score: 1.000)

Cluster 2 (features: 19_2: 1.352, 19: 1.292, 4_3: 0.518, 4: 0.484):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-19.04-server-amd64.iso (Silhouette Score: 1.000)

Cluster 3 (features: 19_2: 1.243, 19: 1.188, 10_3: 0.744, 10: 0.701):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-19.10-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-19.10-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 32 (features: 4_4: 1.030, 20_2: 0.981, 4: 0.937, 20: 0.919, 4_3: 0.502):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-20.04.4-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-20.04.4-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 6 (features: 21_2: 1.352, 21: 1.292, 4_3: 0.518, 4: 0.484):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-21.04-desktop-amd64.iso (Silhouette Score: 1.000)
	ubuntu-21.04-live-server-amd64.iso (Silhouette Score: 1.000)

Cluster 7 (features: 21_2: 1.243, 21: 1.188, 10_3: 0.744, 10: 0.701):
Average Silhouette Score = 1.000 (the higher the better)
	ubuntu-21.10-beta-pack (Silhouette Score: 1.000)
	ubuntu-21.10-desktop-amd64.iso (Silhouette Score: 1.000)

Cluster 27 (features: 14: 3.036, 14_2: 3.036, 4: 1.644, 4_3: 1.517, 4_4: 0.502):
Average Silhouette Score = 0.725 (the higher the better)
	ubuntu-14.04-desktop-amd64.iso (Silhouette Score: 0.810)
	ubuntu-14.04-desktop-i386.iso (Silhouette Score: 0.810)
	ubuntu-14.04-server-amd64.ova (Silhouette Score: 0.810)
	ubuntu-14.04-server-i386.iso (Silhouette Score: 0.810)
	ubuntu-14.04.4-desktop-amd64.iso (Silhouette Score: 0.385)

Cluster 23 (features: 18: 2.431, 18_2: 2.431, 4: 1.299, 4_3: 1.153, 4_4: 0.490):
Average Silhouette Score = 0.657 (the higher the better)
	Ubuntu-18.04 (Silhouette Score: 0.744)
	ubuntu-18.04-desktop-amd64.iso (Silhouette Score: 0.744)
	ubuntu-18.04-live-server-amd64.iso (Silhouette Score: 0.744)
	ubuntu-18.04.4-desktop-amd64.iso (Silhouette Score: 0.396)

Cluster 29 (features: 16_2: 2.890, 16: 2.825, 10_3: 1.798, 10: 1.693, 4_3: 0.327):
Average Silhouette Score = 0.652 (the higher the better)
	Ubuntu 16.10 (Silhouette Score: 0.807)
	ubuntu-16.10-desktop-amd64.iso (Silhouette Score: 0.807)
	ubuntu-16.10-desktop-i386.iso (Silhouette Score: 0.807)
	ubuntu-16.10-server-arm64.iso (Silhouette Score: 0.807)
	ubuntu-16.04-desktop-i386.iso (Silhouette Score: 0.033)

Cluster 20 (features: 2_4: 1.399, 2: 1.337, 20_2: 1.048, 20: 0.982, 0_5: 0.949):
Average Silhouette Score = 0.496 (the higher the better)
	Ubuntu 20.04.2.0 Desktop (64-bit) (Silhouette Score: 0.653)
	ubuntu-20.04.2.0-desktop-amd64.iso (Silhouette Score: 0.653)
	ubuntu-20.04.2-desktop-amd64.iso (Silhouette Score: 0.183)

Cluster 15 (features: 17: 1.300, 17_2: 1.300, 10_3: 0.334, 10: 0.315, 4_3: 0.229):
Average Silhouette Score = 0.441 (the higher the better)
	ubuntu-17.04-server-amd64.iso (Silhouette Score: 0.441)
	ubuntu-17.10-desktop-amd64.iso (Silhouette Score: 0.441)

Cluster 16 (features: 23: 1.300, 23_2: 1.300, 10_3: 0.334, 10: 0.315, 4_3: 0.229):
Average Silhouette Score = 0.441 (the higher the better)
	ubuntu-23.04-live-server-amd64.iso (Silhouette Score: 0.441)
	ubuntu-23.10-beta-desktop-amd64.iso (Silhouette Score: 0.441)

Cluster 0 (features: 10_2: 1.486, 10: 1.077, 10_3: 0.352, 4_3: 0.281, 4: 0.263):
Average Silhouette Score = 0.399 (the higher the better)
	Ubuntu 10.04 Netbook (Silhouette Score: 0.399)
	ubuntu-10.10-xenon-beta5 (Silhouette Score: 0.399)

Cluster 17 (features: 12: 1.177, 12_2: 1.177, 4: 0.688, 4_3: 0.510, 4_4: 0.466):
Average Silhouette Score = 0.302 (the higher the better)
	ubuntu-12.04.4-desktop-amd64+mac.iso (Silhouette Score: 0.384)
	ubuntu-12.04-server-i386.iso (Silhouette Score: 0.220)

Cluster 19 (features: 22_2: 3.424, 22: 3.196, 4_3: 1.175, 4: 1.096, 2_4: 0.515):
Average Silhouette Score = 0.195 (the higher the better)
	ubuntu-22.04-desktop-amd64.iso (Silhouette Score: 0.419)
	ubuntu-22.04-live-server-amd64.iso (Silhouette Score: 0.419)
	ubuntu-22.10-desktop-amd64.iso (Silhouette Score: 0.166)
	ubuntu-22.04.2-desktop-amd64.iso (Silhouette Score: 0.079)
	ubuntu-22.04.1-desktop-amd64.iso (Silhouette Score: 0.073)
	ubuntu-22.04.3-live-server-amd64.iso (Silhouette Score: 0.017)

Cluster 10 (features: 1_3: 1.215, 2014: 1.203, 2014_4: 1.203, 1: 0.898, 4_4: 0.479):
Average Silhouette Score = -0.016 (the higher the better)
	Ubuntu Facile 01 2014.pdf (Silhouette Score: 0.153)
	Ubuntu Facile 04 2014.pdf (Silhouette Score: -0.065)
	ubuntu-ultimate-1.4-dvd (Silhouette Score: -0.137)

Cluster 9 (features: 20_3: 1.513, 4_4: 1.489, 3_5: 1.049, 20: 1.008, 4: 0.856):
Average Silhouette Score = -0.046 (the higher the better)
	ubuntu-mate-20.04.3-desktop-amd64.iso (Silhouette Score: 0.075)
	ubuntu-mate-20.04.4-desktop-amd64.iso (Silhouette Score: -0.063)
	Ubuntu Server 20.04.2 LTS (Silhouette Score: -0.067)
	ubuntu-budgie-22.04.3-desktop-amd64.iso (Silhouette Score: -0.130)

Cluster 1 (features: 15: 2.001, 15_2: 2.001, 4_3: 0.940, 4: 0.877, 1_4: 0.877):
Average Silhouette Score = -0.176 (the higher the better)
	ubuntu-15.04-desktop-amd64.iso (Silhouette Score: -0.000)
	ubuntu-15.04-desktop-i386.iso (Silhouette Score: -0.000)
	ubuntu-15.04-server-amd64.iso (Silhouette Score: -0.000)
	Ubuntu Ultimate Edition 1.9 (Silhouette Score: -0.272)
	[Ubuntu] Anonymous OS 0.1 (Silhouette Score: -0.272)
	ubuntu-13.04-desktop-i386.iso (Silhouette Score: -0.278)
	Ubuntu Server Essentials - 6685 [ECLiPSE] (Silhouette Score: -0.293)
	Ubuntu Unleashed 2019 Edition (Silhouette Score: -0.293)

Cluster -1 (features: 18: 1.811, 18_2: 1.811, 4: 1.806, 10_4: 1.646, 4_3: 1.523):
Average Silhouette Score = -0.338 (the higher the better)
	ubuntu-unity-22.10-desktop-amd64.iso (Silhouette Score: -0.261)
	ubuntu-mate-19.10-desktop-amd64.iso (Silhouette Score: -0.263)
	ubuntu-mate-21.10-desktop-amd64.iso (Silhouette Score: -0.263)
	ubuntu-pack-16.04-unity (Silhouette Score: -0.277)
	Ubuntu Satanic Edition 666.4 (Silhouette Score: -0.284)
	ubuntu-18.04.3-live-server-amd64.iso (Silhouette Score: -0.350)
	ubuntu-18.04.5-live-server-amd64.iso (Silhouette Score: -0.364)
	ubuntu-16.04.3-server-amd64.iso (Silhouette Score: -0.366)
	ubuntu-18.04.6-desktop-amd64.iso (Silhouette Score: -0.372)
	ubuntu-18.04.1-desktop-amd64.iso (Silhouette Score: -0.375)
	ubuntu-14.04.5-server-amd64.iso (Silhouette Score: -0.383)
	ubuntu-14.04.1-server-amd64.iso (Silhouette Score: -0.393)
	ubuntu-20.10-desktop-amd64.iso (Silhouette Score: -0.439)

Disclaimer: The code for the scripts above was generated by ChatGPT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

3 participants