-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KnowledgeCommunity: Content Bundle #7837
Comments
We deployed Network Buzz within Tribler in 2010. This is what Reddit had to say then about it: Please do a 10-day prototype @drew2a. You now did top-down design exploration. Time for bottom-up "learn-by-doing". We do not know if "content bundling" can be done exclusive and perfect with local heuristics and zero database changes. Or we need to store, gather rich metadata and offer content enrichment plus database changes. What about the near-duplicates we studied for years (never could fix)? Lets leave the anti-spam for future sprints ❗ No need to distract @grimadas from deployment and debugging the low-level rendezvous component. Only in 2025 we will re-visit the |
The first attempt at trying to group search results locally doesn't offer much hope, as it tends to group quite random torrents together without organizing them into the same content group. The developed script:
from collections import defaultdict
from pathlib import Path
import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
# Load titles from a text file
results = list(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
)
print('Results:')
for r in results[:50]:
print(f'\t{r}')
first_title = results[0]
# Preprocess each title
print('\nPreprocessed results:')
preprocessed_results = [preprocess_text(title) for title in results]
for r in preprocessed_results:
print(f'\t{r}')
# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_results)
print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)
# Output clustering results
labels = kmeans.labels_
clusters = defaultdict(list)
# Group titles by their clusters
for i, label in enumerate(labels):
clusters[label].append(results[i])
# Print clustering results by cluster
print("Clustering results by cluster:")
for cluster, titles in clusters.items():
print(f"\nCluster {cluster}:")
for title in titles:
print(f"- {title}") Results for ubuntu:
|
Expanding on the same idea: what if instead of searching for all similar entries in the search results, we look for entries similar only to the first (most relevant) result? The developed script:
But the results of the experiment are still far from ideal and even from a minimally viable product. from pathlib import Path
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
# Load titles from a text file
results = list(
r for r in set(Path('/ubuntu.txt').read_text().split('\n')) if r
)
print('Results:')
for r in results[:50]:
print(f'\t{r}')
first_title = results[0]
# Preprocess each title
preprocessed_results = [preprocess_text(title) for title in results]
for r in preprocessed_results:
print(f'\t{r}')
# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(preprocessed_results)
# Calculate cosine similarity
similarity_matrix = cosine_similarity(X[0:1], X)
# Define a similarity threshold
similarity_threshold = 0.4 # Adjust this threshold as needed
# Filter indices by similarity threshold and keep similarity values
filtered_indices_and_similarity = [(i, similarity_matrix[0, i]) for i in range(similarity_matrix.shape[1]) if
similarity_matrix[0, i] >= similarity_threshold]
# Sort filtered indices by similarity with the first title, keeping similarity values
sorted_filtered_indices_and_similarity = sorted(filtered_indices_and_similarity, key=lambda x: -x[1])
# Gather similar titles with their similarity values
similar_titles_with_similarity = [(results[i], similarity) for i, similarity in sorted_filtered_indices_and_similarity]
# Print similar titles with similarity values
print(f"\nTitles similar to the first title ({first_title}):")
for title, similarity in similar_titles_with_similarity:
print(f"- {title} (Similarity: {similarity:.3f})") Output:
|
We learned something! I genuinely don't find it a bad start. Can you create a tiny example with equal Ubuntu-server.iso filename and try get -numeric- clusters? With 18.04 and 18.10 together plus 22.04 and 22.10. Seems number signal is thrown away? |
I've modified the original script to address the question "Seems number signal is thrown away?" and to print all TF-IDF values for each term. Indeed, it was discovered that certain digits were being ignored, specifically those consisting of a single character. This occurred because the vectorizer, by default, disregards all terms that are composed of only one character. In the example below the number 5 are ignored:
I've fix it and also added more output to discern the terms around which titles were grouped into clusters, I analyzed the centroids of the clusters determined by the K-means algorithm. The centroids represent the "center" or "mean" vector of each cluster in the feature space, essentially capturing the average importance of each term within the cluster. By examining these centroids, we can identify which terms have the highest TF-IDF values across the documents in a cluster, giving us insight into the thematic essence of each cluster. This information should provide us with a better understanding of the details involved in grouping items into clusters.
The script: from collections import defaultdict
from pathlib import Path
import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
# Load titles from a text file
results = list(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
)
first_title = results[0]
# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]
# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\w+\b')
X = vectorizer.fit_transform(preprocessed_results)
# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()
print(f'Features: \n{feature_names}')
# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
# Accessing the i-th TF-IDF vector in sparse format directly
tfidf_vector = X[i]
# Extracting indices of non-zero elements (words that are actually present in the document)
non_zero_indices = tfidf_vector.nonzero()[1]
# Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
# Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
# Formatting the sorted TF-IDF values into a string for easy display
sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
# Print sorted TF-IDF values
print(f'\tOriginal: {original}')
print(f'\tPreprocessed: {preprocessed}')
print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')
print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)
# Getting cluster centroids
centroids = kmeans.cluster_centers_
# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
sorted_feature_indices = centroid.argsort()[::-1]
top_n = 5 # Number of key words
top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
cluster_top_features_with_weights[i] = top_features_with_weights
# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)
# Grouping titles by their clusters
for i, label in enumerate(labels):
clusters[label].append(results[i])
# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)
# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
top_features_str = ', '.join(
[f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]])
print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
for title in clusters[cluster]:
# Find the index of the current title
title_index = results.index(title)
# Calculate "fit" metric as the distance to the centroid of its cluster
# The distance itself is used as a metric of fit
fit_metric = distances_to_centroids[title_index, cluster]
print(f"\t{title} (Distance to Centroid: {fit_metric:.3f})") |
I've slightly modified the pattern for the
The script: from collections import defaultdict
from pathlib import Path
import nltk
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
# Load titles from a text file
results = list(sorted(
r for r in set(Path('/ubuntu.txt').read_text().split('\n')) if r
))
first_title = results[0]
# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]
# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)
# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()
print(f'Features: \n{feature_names}')
# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
# Accessing the i-th TF-IDF vector in sparse format directly
tfidf_vector = X[i]
# Extracting indices of non-zero elements (words that are actually present in the document)
non_zero_indices = tfidf_vector.nonzero()[1]
# Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
# Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
# Formatting the sorted TF-IDF values into a string for easy display
sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
# Print sorted TF-IDF values
print(f'\tOriginal: {original}')
print(f'\tPreprocessed: {preprocessed}')
print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')
print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)
# Getting cluster centroids
centroids = kmeans.cluster_centers_
# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
sorted_feature_indices = centroid.argsort()[::-1]
top_n = 5 # Number of key words
top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
cluster_top_features_with_weights[i] = top_features_with_weights
# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)
# Grouping titles by their clusters
for i, label in enumerate(labels):
clusters[label].append(results[i])
# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)
# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
top_features_str = ', '.join(
f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
)
print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
# Prepare a list to hold titles and their distances
titles_and_distances = []
for title in clusters[cluster]:
# Find the index of the current title
title_index = results.index(title)
# Calculate "fit" metric as the distance to the centroid of its cluster
fit_metric = distances_to_centroids[title_index, cluster]
# Add title and its distance to the list
titles_and_distances.append((title, fit_metric))
# Sort titles within the cluster by their distance to the centroid (ascending order)
sorted_titles_and_distances = sorted(titles_and_distances, key=lambda x: x[1])
# Print sorted titles by their distance to centroid
for title, distance in sorted_titles_and_distances:
print(f"\t{title} (Distance to Centroid: {distance:.3f})") |
In the previous example, elements were grouped fairly well, but there was a cluster containing elements close to noise (Cluster 4). To identify such a cluster (and filter it out in the future), we attempted to calculate the intra-cluster dispersion. This approach aimed to quantify the cohesion within each cluster by measuring the average distance of points from their cluster centroid. The rationale behind this method is that a cluster with a higher average distance among its points might be less cohesive and potentially contain more noise, making it a candidate for exclusion from further analysis. To implement this, we first grouped the indices of the elements belonging to each cluster. Then, for each cluster, we constructed a matrix of its points by vertically stacking the corresponding rows from the TF-IDF matrix X using the indices we had collected. This allowed us to calculate pairwise distances between each point in a cluster and its centroid, using the pairwise_distances function. By computing the mean of these distances, we obtained a measure of intra-cluster dispersion. The calculated average distances provided a clear metric to assess the tightness of each cluster. Clusters with lower average distances were deemed more cohesive, indicating that their elements were closely related to each other and to the cluster's overall theme. Conversely, clusters with higher average distances were scrutinized for potential exclusion, as their wide dispersion suggested a lack of a unifying theme or the presence of outlier elements. This methodological adjustment offered a systematic way to identify and potentially remove clusters that detract from the clarity and relevance of the clustering outcome, thereby refining the analysis.
The script: import re
import string
from collections import defaultdict
from pathlib import Path
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.sparse import vstack
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
# Load titles from a text file
results = list(sorted(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))
first_title = results[0]
# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]
# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)
# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()
print(f'Features: \n{feature_names}')
# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
# Accessing the i-th TF-IDF vector in sparse format directly
tfidf_vector = X[i]
# Extracting indices of non-zero elements (words that are actually present in the document)
non_zero_indices = tfidf_vector.nonzero()[1]
# Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
# Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
# Formatting the sorted TF-IDF values into a string for easy display
sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
# Print sorted TF-IDF values
print(f'\tOriginal: {original}')
print(f'\tPreprocessed: {preprocessed}')
print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')
print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)
# Getting cluster centroids
centroids = kmeans.cluster_centers_
# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
intra_cluster_distances = defaultdict(list)
# Grouping titles by their clusters
for i, label in enumerate(labels):
clusters[label].append(results[i])
clusters_indices[label].append(i)
for cluster, indices in clusters_indices.items():
if indices:
points_matrix = vstack([X.getrow(i) for i in indices])
distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean')
intra_cluster_distance = np.mean(distances)
intra_cluster_distances[cluster] = intra_cluster_distance
# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
sorted_feature_indices = centroid.argsort()[::-1]
top_n = 5 # Number of key words
top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
cluster_top_features_with_weights[i] = top_features_with_weights
# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)
# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
top_features_str = ', '.join(
f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
)
intra_cluster_distance = intra_cluster_distances[cluster]
print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
print(f'Intra-cluster distance: {intra_cluster_distance:.3f}')
# Prepare a list to hold titles and their distances
titles_and_distances = []
for title in clusters[cluster]:
# Find the index of the current title
title_index = results.index(title)
# Calculate "fit" metric as the distance to the centroid of its cluster
fit_metric = distances_to_centroids[title_index, cluster]
# Add title and its distance to the list
titles_and_distances.append((title, fit_metric))
# Sort titles within the cluster by their distance to the centroid (ascending order)
sorted_titles_and_distances = sorted(titles_and_distances, key=lambda x: x[1])
# Print sorted titles by their distance to centroid
for title, distance in sorted_titles_and_distances:
print(f"\t{title} (Distance to Centroid: {distance:.3f})") |
Another metric that can be utilized for filtering clusters is the silhouette coefficient(the coefficient values range from -1 to 1). This metric provides insight into the distance between clusters and the cohesion within them. By calculating the silhouette coefficient for each sample within the dataset, we gain the ability to evaluate not just the overall clustering performance but also the individual performance of each cluster. This granular analysis is crucial for identifying clusters that may not be well-defined or might contain elements that are essentially outliers, potentially skewing the overall analysis. To implement this, we first used the silhouette_samples function from scikit-learn, which computes the silhouette coefficient for each sample, giving us a detailed breakdown of how well each sample fits within its assigned cluster compared to neighboring clusters. By aggregating these scores on a per-cluster basis, we were able to compute an average silhouette score for each cluster. This average score serves as a proxy for the cluster's quality, with higher scores indicating tighter and more distinct clusters, and lower scores suggesting clusters with overlapping or diffuse boundaries. This approach allowed us to systematically evaluate each cluster's integrity. Clusters with low average silhouette scores were flagged for further exclusion.
The script: import re
import string
from collections import defaultdict
from pathlib import Path
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from scipy.sparse import vstack
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances, silhouette_samples
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
# Load titles from a text file
results = list(sorted(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))
first_title = results[0]
# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]
# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)
# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()
print(f'Features: \n{feature_names}')
# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
# Accessing the i-th TF-IDF vector in sparse format directly
tfidf_vector = X[i]
# Extracting indices of non-zero elements (words that are actually present in the document)
non_zero_indices = tfidf_vector.nonzero()[1]
# Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
# Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
# Formatting the sorted TF-IDF values into a string for easy display
sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
# Print sorted TF-IDF values
print(f'\tOriginal: {original}')
print(f'\tPreprocessed: {preprocessed}')
print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')
print("Clustering...")
# Cluster using K-means
kmeans = KMeans(random_state=42)
kmeans.fit(X)
# Getting cluster centroids
centroids = kmeans.cluster_centers_
# Output clustering results by cluster, including top features
labels = kmeans.labels_
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
intra_cluster_distances = defaultdict(list)
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')
cluster_silhouette_scores = defaultdict(list)
# Grouping titles by their clusters
for i, label in enumerate(labels):
clusters[label].append(results[i])
clusters_indices[label].append(i)
cluster_silhouette_scores[label].append(silhouette_vals[i])
for cluster, indices in clusters_indices.items():
if indices:
points_matrix = vstack([X.getrow(i) for i in indices])
distances = pairwise_distances(points_matrix, centroids[[cluster]], metric='euclidean')
intra_cluster_distance = np.mean(distances)
intra_cluster_distances[cluster] = intra_cluster_distance
# Identifying key words for each cluster and storing them in a dictionary
feature_names = vectorizer.get_feature_names_out()
cluster_top_features_with_weights = {}
for i, centroid in enumerate(centroids):
sorted_feature_indices = centroid.argsort()[::-1]
top_n = 5 # Number of key words
top_features_with_weights = [(feature_names[index], centroid[index]) for index in sorted_feature_indices[:top_n]]
cluster_top_features_with_weights[i] = top_features_with_weights
# Calculate distances of each point to cluster centroids
distances_to_centroids = kmeans.transform(X)
# Printing clustering results by cluster, including top features for each cluster
print("\nClustering results by cluster, including top features and their weights:")
for cluster in sorted(clusters.keys()):
top_features_str = ', '.join(
f"{word} ({weight:.3f})" for word, weight in cluster_top_features_with_weights[cluster]
)
intra_cluster_distance = intra_cluster_distances[cluster]
print(f"\nCluster {cluster} (Top Features: {top_features_str}):")
average_score = np.mean(cluster_silhouette_scores[cluster])
print(f'Intra-cluster distance: {intra_cluster_distance:.3f} (the less the better)')
print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")
# Prepare a list to hold titles, their distances, and silhouette scores
titles_distances_scores = []
for i, title_index in enumerate(clusters_indices[cluster]):
title = results[title_index]
fit_metric = distances_to_centroids[title_index, cluster]
silhouette_score = silhouette_vals[title_index]
titles_distances_scores.append((title, fit_metric, silhouette_score))
# Sort titles within the cluster by their distance to the centroid (ascending order)
sorted_titles_distances_scores = sorted(titles_distances_scores, key=lambda x: x[1])
# Print sorted titles by their distance to centroid and include silhouette score
for title, distance, silhouette_score in sorted_titles_distances_scores:
print(f"\t{title} (Distance to Centroid: {distance:.3f}, Silhouette Score: {silhouette_score:.3f})") |
Instead of integrating the current algorithm into Tribler, I decided to focus on its improvement and dedicate half of the current week to this task. I haven't yet focused on measuring the algorithm's performance because I want to first ensure that the clustering results are as accurate as possible. There are two main areas I'm currently working on to improve the quality of the clustering:
Once I'm confident that the algorithm is producing the best possible clustering outcomes, I'll turn my attention to optimizing its performance. So, the next iteration of the algorithm contains two modifications: Transition from KMeans to HDBSCAN for ClusteringInitially, our algorithm employed KMeans for clustering, which necessitates specifying the number of clusters a priori. This requirement posed a significant limitation, as determining the optimal number of clusters is not straightforward and can vary significantly depending on the dataset's nature and size. To address this challenge, we transitioned to using HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). Unlike KMeans, HDBSCAN does not require pre-specification of the number of clusters. Instead, it dynamically identifies clusters based on data density, offering several advantages: This shift aims to achieve more accurate and representative clustering by leveraging the data's natural structure, potentially enhancing the user experience through more precise content categorization. Incorporating N-Grams into TFIDF VectorizationThe original vectorization approach using TFIDF (Term Frequency-Inverse Document Frequency) focused on individual terms without considering the order or proximity of words. To capture the contextual nuances and the sequence in which terms appear, we integrated n-grams into our TFIDF vectorization ( Contextual Awareness: The algorithm can now recognize and give weight to term proximity and order, capturing more nuanced meanings.
The script: # This script performs cluster analysis using the HDBSCAN algorithm, enhanced by N-gram TF-IDF vectorization, applied
# to text data.
# It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number,
# calculating and interpreting key metrics like silhouette scores to evaluate cluster quality.
# The script also explores the integration of word position into the clustering process through N-gram vectorization,
# aiming to capture more nuanced relationships between terms.
# The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining
# each cluster, and understanding the contextual relationships within the data.
# This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure
# of the text data and the effectiveness of the modified clustering strategy.
import re
import string
from collections import defaultdict
from enum import Enum, auto
from pathlib import Path
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import HDBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_samples
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
class Vectorizer(Enum):
TFIDF = auto()
TFIDF_NGRAMM = auto()
vectorize_type = Vectorizer.TFIDF_NGRAMM
# Load titles from a text file
results = list(sorted(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
first_title = results[0]
# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]
# Initialize TfidfVectorizer with a custom token pattern to include single-character tokens (including single-digit numbers).
# The token_pattern r'(?u)\b\w+\b' matches any word of one or more alphanumeric characters, allowing the inclusion of single-letter words and digits in the analysis.
if vectorize_type == Vectorizer.TFIDF:
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b')
X = vectorizer.fit_transform(preprocessed_results)
elif vectorize_type == Vectorizer.TFIDF_NGRAMM:
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\d+\b', ngram_range=(1, 2))
X = vectorizer.fit_transform(preprocessed_results)
# Get feature names (words) used by the TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()
print(f'Features: \n{feature_names}')
# Output original and preprocessed titles and their TF-IDF vectors
print("\nOriginal and preprocessed titles with their TF-IDF vectors:\n")
for i, (original, preprocessed) in enumerate(zip(results, preprocessed_results)):
# Accessing the i-th TF-IDF vector in sparse format directly
tfidf_vector = X[i]
# Extracting indices of non-zero elements (words that are actually present in the document)
non_zero_indices = tfidf_vector.nonzero()[1]
# Creating a list of tuples with feature names and their corresponding TF-IDF values for the current title
tfidf_tuples = [(feature_names[j], tfidf_vector[0, j]) for j in non_zero_indices]
# Sorting the tuples by TF-IDF values in descending order to get the most relevant words on top
sorted_tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)
# Formatting the sorted TF-IDF values into a string for easy display
sorted_tfidf_str = "\n\t\t\t".join([f"{word}: {value:.3f}" for word, value in sorted_tfidf_tuples])
# Print sorted TF-IDF values
print(f'\tOriginal: {original}')
print(f'\tPreprocessed: {preprocessed}')
print(f'\tTF-IDF:\n\t\t\t{sorted_tfidf_str}\n')
print("Clustering...")
# Initialize and fit the HDBSCAN model
hdbscan = HDBSCAN(min_cluster_size=2)
hdbscan.fit(X)
# Retrieve cluster labels
labels = hdbscan.labels_
# Initialize dictionaries for storing clustering results
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
# Calculate silhouette scores for each data point in X based on their cluster membership
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')
# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)
# Group data points by their cluster label
for i, label in enumerate(labels):
clusters[label].append(results[i])
clusters_indices[label].append(i)
cluster_silhouette_scores[label].append(silhouette_vals[i])
# Initialize a dictionary to store the sum of TF-IDF values for features by cluster
cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1]))
# Sum up TF-IDF values for each feature within each cluster
for i, label in enumerate(labels):
cluster_feature_sums[label] += X[i].toarray()[0]
# Number of top features to select for each cluster
top_n_features = 5
feature_names = vectorizer.get_feature_names_out()
# Dictionary to store the top N features for each cluster
top_features_per_cluster = {}
for cluster, sums in cluster_feature_sums.items():
# Indices of features with sums greater than 0, sorted by their sum in descending order
positive_indices = [index for index, value in enumerate(sums) if value > 0]
top_indices = sorted(positive_indices, key=lambda index: sums[index], reverse=True)[:top_n_features]
# Extract the feature names and their sums for the top features with values greater than 0
top_features = [(feature_names[index], sums[index]) for index in top_indices if sums[index] > 0]
top_features_per_cluster[cluster] = top_features
# First, calculate the average silhouette score for each cluster
average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()}
# Then, sort the clusters by their average silhouette score
sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True)
# Output clustering results, now sorted by the average silhouette score
print("\nClustering results by cluster:")
for cluster in sorted_clusters:
features = top_features_per_cluster[cluster]
average_score = average_scores[cluster]
features_str = (f"{feature}: {value:.3f}" for feature, value in features)
features_line = ', '.join(features_str)
print(f"\nCluster {cluster} (features: {features_line}):")
print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")
# Prepare and sort titles within the cluster by their silhouette score
titles_scores = []
for title_index in clusters_indices[cluster]:
title = results[title_index]
silhouette_score = silhouette_vals[title_index]
titles_scores.append((title, silhouette_score))
sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True)
# Print each title with its silhouette score
for title, silhouette_score in sorted_titles_scores:
print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})") |
To achieve more specific clustering results, such as differentiating between clusters for "Ubuntu 20.04.X" instead of a more general "Ubuntu 20.04," the following HDBSCAN constructor parameters can be adjusted:
Conversely, to configure HDBSCAN for creating more general groups, the same parameters can be adjusted in the opposite direction. Decreasing By fine-tuning these parameters, HDBSCAN can be tailored to identify clusters at the desired level of specificity, from highly detailed clusters differentiating between minor variations to broader groups encompassing more general categories. Below are two examples:
The selected parameter values for the HDBSCAN constructor are not definitive but are intended to illustrate the potential for enhancing clustering quality through careful optimization. The optimal settings for these parameters can significantly vary, underscoring the importance of adjustment based on the specific clustering goals. Intuitively, the choice of these values should align with the user's objectives: broader topic identification might necessitate one set of parameters, while uncovering more detailed, dense information may require a different configuration. Ref: |
The next step involved a deeper exploration of vectorization algorithms to determine if there are more advanced options beyond TFIDF that could better suit our needs. This exploration led us to experiment with FastText, an advanced word embedding technique known for capturing the nuances of word semantics and relationships more effectively than traditional TFIDF. FastText, by leveraging neural network models, generates vector representations of words that incorporate the context in which words appear, as well as the morphology of the words themselves. While the results obtained with FastText were practically identical to those achieved with TFIDF, a key distinction emerged: FastText's flexibility in analyzing all presented tokens, not just the numeric ones as was the case in the previous version using TFIDF.
The script: # This script performs cluster analysis using the HDBSCAN algorithm, enhanced by FastText vectorization, applied
# to text data.
# It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number,
# calculating and interpreting key metrics like silhouette scores to evaluate cluster quality.
# The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining
# each cluster, and understanding the contextual relationships within the data.
# This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure
# of the text data and the effectiveness of the modified clustering strategy.
import re
import string
from collections import defaultdict
from pathlib import Path
import nltk
import numpy as np
from gensim.models import FastText
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import HDBSCAN
from sklearn.metrics import silhouette_samples
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
# Load titles from a text file
results = list(sorted(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
first_title = results[0]
# Preprocess titles
preprocessed_results = [preprocess_text(title) for title in results]
def extract_digits(tokens):
return [token for token in tokens if re.match(r'^\d+$', token)]
digit_only_titles = [extract_digits(title.split()) for title in preprocessed_results]
model = FastText(sentences=digit_only_titles, vector_size=100, window=5, min_count=1, workers=4)
X = np.array([np.mean([model.wv[word] for word in title.split() if word in model.wv], axis=0) for title in
preprocessed_results])
print("Clustering...")
# Initialize and fit the HDBSCAN model
hdbscan = HDBSCAN(min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0)
hdbscan.fit(X)
# Retrieve cluster labels
labels = hdbscan.labels_
# Initialize dictionaries for storing clustering results
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
# Calculate silhouette scores for each data point in X based on their cluster membership
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')
# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)
# Group data points by their cluster label
for i, label in enumerate(labels):
clusters[label].append(results[i])
clusters_indices[label].append(i)
cluster_silhouette_scores[label].append(silhouette_vals[i])
# Initialize a dictionary to store the sum of TF-IDF values for features by cluster
cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1]))
# First, calculate the average silhouette score for each cluster
average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()}
# Then, sort the clusters by their average silhouette score
sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True)
# Output clustering results, now sorted by the average silhouette score
print("\nClustering results by cluster:")
for cluster in sorted_clusters:
average_score = average_scores[cluster]
print(f"\nCluster {cluster}:")
print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")
# Prepare and sort titles within the cluster by their silhouette score
titles_scores = []
for title_index in clusters_indices[cluster]:
title = results[title_index]
silhouette_score = silhouette_vals[title_index]
titles_scores.append((title, silhouette_score))
sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True)
# Print each title with its silhouette score
for title, silhouette_score in sorted_titles_scores:
print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})") Ref: |
This endeavor was an attempt to leverage transformers, specifically the BERT model, as a tokenizer in our clustering process. When utilizing BERT as a tokenizer, we observe that it delivers inferior results compared to analyzing either the entire titles or only the extracted digits. Additionally, BERT is noticeably slower and more resource-intensive. I acknowledge that I haven't delved deeply into BERT's intricacies, as it's a complex model, and gaining a thorough understanding would require a substantial investment of time. My goal was to create a simple, almost out-of-the-box example to get a sense of how it operates.
The script: # This script performs cluster analysis using the HDBSCAN algorithm, enhanced by FastText vectorization, applied
# to text data.
# It includes steps for fitting the HDBSCAN model to identify optimal clusters without pre-specifying the number,
# calculating and interpreting key metrics like silhouette scores to evaluate cluster quality.
# The focus is on assessing the cohesion and separation of clusters, identifying the most significant features defining
# each cluster, and understanding the contextual relationships within the data.
# This approach provides a detailed exploration of the clustering results, offering deeper insights into the structure
# of the text data and the effectiveness of the modified clustering strategy.
import re
import string
from collections import defaultdict
from pathlib import Path
import nltk
import numpy as np
import torch
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.cluster import HDBSCAN
from sklearn.metrics import silhouette_samples
from transformers import BertModel, BertTokenizer
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
# Load titles from a text file
results = list(sorted(
r for r in set(Path('ubuntu.txt').read_text().split('\n')) if r
))
def preprocess_text(text):
# Remove text inside parentheses and brackets
text = re.sub(r'\[.*?\]|\(.*?\)', '', text)
# Convert to lowercase
text = text.lower()
# Replace punctuation and hyphens with spaces
text = re.sub(r'[' + string.punctuation + ']', ' ', text)
# Remove leading zeros
text = re.sub(r'\b0+(\d+)\b', r'\1', text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = text.split()
words = [word for word in words if word and word not in stop_words]
# Lemmatize
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(words)
first_title = results[0]
# Preprocess titles
print("Preprocessing titles...")
preprocessed_results = [preprocess_text(title) for title in results]
print("Loading BERT model and tokenizer...")
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
return embeddings
def extract_digits(text):
return ' '.join(re.findall(r'\b\d+\b', text))
print("Loading and preprocessing titles...")
digit_only_titles = [extract_digits(title) for title in preprocessed_results]
print("Transforming titles to BERT embeddings...")
X = np.array([get_bert_embedding(title) for title in digit_only_titles if title.strip() != ''])
print("Clustering...")
# Initialize and fit the HDBSCAN model
hdbscan = HDBSCAN(min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0)
hdbscan.fit(X)
# Retrieve cluster labels
labels = hdbscan.labels_
# Initialize dictionaries for storing clustering results
clusters = defaultdict(list)
clusters_indices = defaultdict(list)
# Calculate silhouette scores for each data point in X based on their cluster membership
silhouette_vals = silhouette_samples(X, labels, metric='euclidean')
# Store silhouette scores for each cluster for later analysis
cluster_silhouette_scores = defaultdict(list)
# Group data points by their cluster label
for i, label in enumerate(labels):
clusters[label].append(results[i])
clusters_indices[label].append(i)
cluster_silhouette_scores[label].append(silhouette_vals[i])
# Initialize a dictionary to store the sum of TF-IDF values for features by cluster
cluster_feature_sums = defaultdict(lambda: np.zeros(X.shape[1]))
# First, calculate the average silhouette score for each cluster
average_scores = {cluster: np.mean(scores) for cluster, scores in cluster_silhouette_scores.items()}
# Then, sort the clusters by their average silhouette score
sorted_clusters = sorted(average_scores.keys(), key=lambda cluster: average_scores[cluster], reverse=True)
# Output clustering results, now sorted by the average silhouette score
print("\nClustering results by cluster:")
for cluster in sorted_clusters:
average_score = average_scores[cluster]
print(f"\nCluster {cluster}:")
print(f"Average Silhouette Score = {average_score:.3f} (the higher the better)")
# Prepare and sort titles within the cluster by their silhouette score
titles_scores = []
for title_index in clusters_indices[cluster]:
title = results[title_index]
silhouette_score = silhouette_vals[title_index]
titles_scores.append((title, silhouette_score))
sorted_titles_scores = sorted(titles_scores, key=lambda x: x[1], reverse=True)
# Print each title with its silhouette score
for title, silhouette_score in sorted_titles_scores:
print(f"\t{title} (Silhouette Score: {silhouette_score:.3f})") Ref: |
The final improvement in this iteration was an attempt to modify the standard TF-IDF algorithm to account for the position of tokens, which led to better results (comparable to N-Grams with TFIDF) than all other experiments conducted. By incorporating token positioning, we were able to differentiate between identical tokens based on their locations within the text, offering a deeper insight into the document's structure. Though our implementation is somewhat naive and likely not the most efficient in terms of performance, it represents a swift and straightforward prototype. The trick is: class PositionalDigitTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
# Custom analyzer that extracts digits and their position within the document
def positional_digit_analyzer(doc):
# Splitting the document into words
words = doc.split()
# Initializing a list to store digits with their position
positional_digits = []
# Iterating over words and their indexes in the list
for index, word in enumerate(words, start=1): # Indexing starts from 1
if word.isdigit(): # Checking if the word is a digit
# Adding to the list in the format "digit_position_in_text"
positional_digits.append(f"{word}_{index}")
positional_digits.append(f"{word}")
return positional_digits
return positional_digit_analyzer This modification leads to tokens that look like:
For further refinement of the algorithm, the following paper could be used as a reference: "Optimized TF-IDF Algorithm with the Adaptive Weight of Position of Word" available at https://www.atlantis-press.com/proceedings/aiie-16/25866330. This research suggests possible avenues for enhancing the complexity and effectiveness of our approach, indicating more advanced strategies for incorporating positional information into text vectorization.
Disclaimer: The code for the scripts above was generated by ChatGPT. |
"Content Bundle" is a strategic feature in Tribler aimed at enhancing the organization and accessibility of digital content. It acts as an aggregation point for Content Items, bundling them together under a single, cohesive unit. This structure allows users to efficiently manage and access groups of related Content Items, simplifying navigation and retrieval. Ideal for categorizing content that shares common themes, attributes, or sources, the Content Bundle provides a streamlined way to handle complex sets of information, making it easier for users to find and interact with a rich array of content within the Tribler network.
The current representation of Content Items can be seen in the following picture:
We want them to have another layer of grouping:
Everything that we need already exists in our Knowledge Database, we can reuse the existing
CONTENT_ITEM
as follows:Or another structure:
Or
So it's an open question regarding the structure. Please suggest your ideas.
To complete this task, we need to:
Related:
The text was updated successfully, but these errors were encountered: