Welcome to the NLP Data Visualization project! This repository contains a collection of Python scripts designed to help you analyze and visualize text data in various ways. Each script focuses on a specific type of analysis or visualization, providing a comprehensive toolkit for text data exploration.
This project utilizes a range of powerful Python libraries such as matplotlib, seaborn, wordcloud, spacy, and textblob to perform tasks like frequency analysis, sentiment analysis, and parts of speech tagging. The resulting visualizations, including bar plots, histograms, pie charts, treemaps, violin plots, and word clouds, offer clear and insightful representations of the underlying data.
This project includes:
-
๐ Well-known Data Science libraries ๐
- Some them are:
matplotlib
,seaborn
,wordcloud
,spacy
,textblob
- Some them are:
-
๐ Colorful visualizations ๐
- Each graph is crafted by inspiration from its own document and its function is left open to configure freely. The graphs are:
bar plots
,histograms
,pie charts
,treemaps
,violin plots
,word clouds
- Each graph is crafted by inspiration from its own document and its function is left open to configure freely. The graphs are:
-
โจ Creative file handlings (writings+readings), tailored for large datasets โจ
- These are:
txt
,xml
,json
,xlsx
,csv
,sgm
- These are:
-
๐ Each scripts is a combination from one of these bulletpoints!
This repository contains Python scripts designed for text data analysis and visualization. Each script focuses on a specific type of analysis or visualization, allowing users to gain insights into text data through various methods.
๐งฌ Comprehensive Text Analysis ๐งฌ -> Perform sentiment analysis, frequency analysis, and more.
๐ฎ Diverse Visualizations ๐ฎ -> Generate bar plots, histograms, pie charts, treemaps, violin plots, and word clouds.
๐ Library Utilization ๐ -> Utilizes powerful Python libraries such as matplotlib, seaborn, wordcloud, spacy, and textblob.
๐งฎ Easy to Use ๐งฎ -> Scripts are designed to be easily run from the command line.
โ Configurable โ -> Input and output paths can be easily configured for different datasets.
โ๏ธ Logging โ๏ธ -> Each script includes logging to track the execution process and capture errors.
๐ project-root
โโโ ๐ config
โ โโโ ๐ __init__.py
โ โโโ ๐ common_config.py
โ โโโ ๐ constants.py
โ
โโโ ๐ scripts
โ โโโ ๐ __init__.py
โ โโโ ๐ entity_treemap.py
โ โโโ ๐ freq_barplot.py
โ โโโ ๐ len_histogram.py
โ โโโ ๐ len_violin.py
โ โโโ ๐ pos_cloud.py
โ โโโ ๐ sentiment_piechart.py
โ
โโโ ๐ (logs)
โ ...
โ
โโโ ๐ .gitignore
โโโ ๐ .gitattributes
โโโ ๐ main.py
config/: Contains configuration files.
- _init_.py: Imports configuration and utility functions.
- common_config.py: Contains common configuration functions and logging setup.
- constants.py: Defines constants used throughout the application.
src/: Contains source code files for the project.
- _init_.py: Initializes the source package and imports functions from individual modules.
Script | Description | Input File | Ouput File | Graph |
---|---|---|---|---|
freq_barplot.py | Generates frequency bar plots from XML data. | xml | txt | Bar Plot |
len_histogram.py | Creates histograms of sentence lengths from text files. | txt | - | Histogram |
sentiment_piechart.py | Generates pie charts based on sentiment analysis from text data. | sgm | - | Pie Chart |
entity_treemap.py | Categorizes entities and build treemaps from XML data using them. | xml | json | Tree Map |
len_violin_plot.py | Visualizes the length of email addresses using violin plots. | txt | xlsx | Violin Plot |
pos_cloud.py | Generates word clouds based on parts of speech from text data. | sgm | csv | Word Cloud |
logs: Stores log files after every execution. If removed or not present, its created again after execution.
.gitattributes: Ensures consistent line endings across different operating systems in the repository.
.gitignore: Specifies files and directories to be ignored by Git (e.g., virtual environments, build artifacts).
main.py: In our case, giving the complexity of tasks, input and outputs of each task, this file is left empty. Please consult each script separately.
The bar plot visualizes the frequency of specific entities or attributes. The script freq_barplot.py
is used to create a bar plot of the most frequent parts of speech (verbs, subjects, and objects).
def get_bar_plot(destination, frequency_dict, x_name, y_name, top_n=10, title=None, palette='viridis'):
try:
df = DataFrame(frequency_dict.items(), columns=[x_name, y_name])
df_sorted = df.sort_values(by=y_name, ascending=False).head(top_n)
plt.figure(figsize=(12, 8))
barplot(x=x_name, y=y_name, data=df_sorted, palette=palette, hue=x_name, legend=False)
plt.xlabel = x_name
plt.ylabel = y_name
if title:
plt.title(title, fontweight = "bold")
plt.savefig(destination)
plt.close()
logger.info(f"Graph created in {destination}")
except Exception as e:
logger.exception(f"Graph failed: {e} in {destination}")
The histogram displays the distribution of numerical data. The script len_histogram.py
generates a histogram showing the distribution of sentence lengths.
def get_histogram(data, destination, color = 'red', bins=20,
x_label = 'Sentence Length', y_label ='Number of Sentences',
graph_name='histogram.png', title="Distribution of Sentence Lengths"):
try:
plt.hist(data, edgecolor='black', histtype='bar', bins=bins, color=color, alpha=0.7, density=1)
plt.xlabel(x_label)
plt.ylabel(y_label)
if title:
plt.title("Distribution of Sentence Lengths", fontweight = "bold")
plt.savefig(destination)
plt.close()
logger.info(f"Graph: {graph_name} created in {destination}")
except Exception as e:
logger.exception(f"Graph: {graph_name} failed: {e} in {destination}")
return None
The pie chart shows the proportion of different categories within a dataset. The script sentiment_pychart.py
creates a pie chart of sentiment distribution.
def get_pie_chart(destination, sentiments_categorized, title=None, graph_name='pie_chart.png'):
try:
plt.figure(figsize=(8, 8))
plt.pie(
sentiments_categorized.values(),
labels=sentiments_categorized.keys(),
autopct='%1.1f%%',
colors=['#65d14f', '#66c2a5', '#c43737'],
explode=(0.1, 0, 0),
shadow=True,
startangle=90
)
if title:
plt.title(title, fontweight = "bold")
plt.savefig(destination)
plt.close()
logger.info(f"Pie chart: {graph_name} created in {destination}")
except Exception as e:
logger.error(f"Pie chart: {graph_name} failed: {e} in {destination}")
return None
The treemap provides a hierarchical view of data with nested rectangles. The script entity_treemap.py
generates a treemap visualization based on XML file content.
def get_treemap(destination, entity_dict, title='Distribution of Entity Labels'):
try:
labels = list(entity_dict.keys())
frequencies = list(entity_dict.values())
squarify.plot(sizes=frequencies, label=labels, color=color_palette("Spectral", len(labels)), alpha=0.7, pad=2)
if title:
plt.title(title, fontweight = "bold")
plt.savefig(destination)
plt.close()
logger.info(f"Treemap created in {destination}")
except Exception as e:
logger.exception(f"Treemap failed: {e} in {destination}")
The violin plot shows data distribution across several categories. The script len_violin.py
generates a violin plot of email lengths.
def get_violin_plot(destination, data, column_name, title=None, color='Yellow'):
try:
data_df = DataFrame(data=data, columns=[column_name])
plt.figure(figsize=(10, 6))
violinplot(x=column_name, data=data_df, color=color)
plt.xlabel(column_name)
if title:
plt.title(title, fontweight = "bold")
plt.savefig(destination)
plt.close()
logger.info(f"Graph created in {destination}")
except Exception as e:
logger.exception(f"Graph failed: {e} in {destination}")
The word cloud visualizes the frequency of words in a text. The script pos_cloud.py
creates a word cloud of the most frequent verbs.
def get_word_cloud(destination, word_counts, title=None):
try:
wordcloud = WordCloud(
width=800,height=800,
background_color='white',
min_font_size=10
).generate_from_frequencies(word_counts)
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
if title:
plt.title(title, fontweight = "bold")
plt.savefig(destination)
plt.close()
logger.info(f"Graph created in {destination}")
except Exception as e:
logger.exception(f"Failed graph: {e} in {destination}")
The scripts included in this project analyze text data by performing tasks such as frequency analysis, sentiment analysis, and length distribution analysis. They generate visualizations that help in understanding the underlying patterns and characteristics of the text data.
-
Ensure you have Python 3.x installed.
-
Install the required libraries and instances when needed.
-
Run the scripts by executing them from the command line:
python freq_barplot.py
python len_histogram.py
python sentiment_piechart.py
python entity_treemap.py
python len_violin.py
python pos_cloud.py
This project is licensed under the GNU General Public License v3.0 (GPL-3.0) - see the LICENSE file for details.
Let me know if there are any specific details youโd like to adjust or additional sections you want to include!
- Email: kivancgordu@hotmail.com
- Version: 1.0.0
- Date: 31-07-2024