Task Assigned by Black Coffer
This project aims to extract textual data articles from given URLs and perform text analysis to compute variables in Python. The code is also written in Colab Notebook for its flexibility compared to scripting.
-
Ensure you have Python 3.x installed.
-
Install the required libraries using the following command:
pip install requests pandas beautifulsoup4 textblob nltk openpyxl
The project follows the following steps:
-
Extracting the data from the given URLs using the BeautifulSoup library (Web Scraping).
-
Cleaning the data by removing the HTML tags and converting the text to lowercase.
- Find the title of the article using the
<h1>
tag and obtain the content using the<div>
tag and class name selectors. - Since different articles may have different class selectors, try-except blocks are used to handle any exceptions.
- Find the title of the article using the
-
Tokenizing the text into individual words and removing the stop words.
-
Performing sentiment analysis using the TextBlob library.
-
Computing the sentiment score and polarity of the text.
-
Saving the results in a CSV file.
- Place the main.py script in the working directory.
- Update the paths in the script to point to the input Excel file and the folders containing the positive/negative words and stopwords.
- Run the script using the following command:
python main.py
- The script will generate an analysis_results.xlsx file containing the analysis results.
- load_words(file_path, encoding='utf-8'): Reads words from a file and returns a set of words. Handles different file encodings.
- load_specific_stopwords(folder_path, filenames, encoding='utf-8'): Reads stopwords from multiple files and returns a set of stopwords. Handles different file encodings.
- save_text_to_file(text, file_path): Saves text to a file.
- extract_article_content(url): Extracts the title and content of an article from a given URL.
- count_syllables(word): Counts the number of syllables in a word.
- analyze_text(text, positive_words, negative_words, stopwords): Analyzes text and calculates various linguistic and sentiment metrics.
- main workflow: Reads input data, loads dictionaries and stopwords, extracts articles, analyzes text, and saves results.
If you encounter errors related to file encoding (e.g., 'utf-8' codec can't decode byte errors), the script handles these issues by using the 'latin1'
encoding to read the files. This approach is flexible and does not require changing the file type.
The analysis results are saved in analysis_results.xlsx
, containing metrics such as:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Average Sentence Length
- Percentage of Complex Words
- Fog Index
- Average Number of Words per Sentence
- Complex Word Count
- Word Count
- Syllables per Word
- Personal Pronouns
- Average Word Length
Additionally, the extracted articles are saved in extracted_articles.csv
.
- main.py: The script containing the complete code.
- analysis_results.xlsx: The output file that will be generated after running the script.
- README.md: Instructions on how to run the script and an explanation of the approach.