Automation to extract informations from news in NY Times website.
Your challenge is to automate the process of extracting data from the news site. Link to the news site: www.nytimes.com
You must have 3 configured variables (you can save them in the configuration file, but it is better to put them to the Robocorp Cloud Work Items):
-
search phrase
-
news category or section
-
number of months for which you need to receive news
Example of how this should work: 0 or 1 - only the current month, 2 - current and previous month, 3 - current and two previous months, and so on
The main steps:
-
Open the site by following the link
-
Enter a phrase in the search field
-
On the result page, apply the following filters:
-
select a news category or section
your automation should have the option to choose from none to any number of categories/sections. This should be specified via the config file or/and Robocorp Cloud Work Items
-
choose the latest (i.e., newest) news
-
-
Get the values: title, date, and description.
-
Store in an Excel file:
- title
- date
- description (if available)
- picture filename
- count of search phrases in the title and description
- True or False, depending on whether the title or description contains any amount of money
Possible formats: $11.1 | $111,111.11 | 11 dollars | 11 USD
-
Download the news picture and specify the file name in the Excel file
-
Follow steps 4-6 for all news that falls within the required time period
The project is divided in three folders:
config
: configuration files;outputs
: output folders generated using current datetime for unique folders and containing the Excel file, a log file and the images folder with all images downloaded;src
: folder containing all scripts
Main libraries used in this project (also available in requirements.txt
):
- openpyxl==3.1.2
- pandas==2.0.2
- selenium==4.9.1
- tqdm==4.65.0
- urllib3==2.0.2
- webdriver-manager==3.8.6
Python version: 3.11
The config.ini
file is located in the config
folder
and contains all the configuration parameters divided by section:
- website_parameters: URLs, xpaths, ids and parameters related to the website http://www.nytimes.com;
- input_parameters: the configured variables by user:
- search phrase: must be separated by space;
- news category or section: must be a list, eg: [Books,Fashion,Movies,Opinion,U.S.];
- number of months for which you need to receive news: must be an integer
- browser_parameters: basically the chrome version;
- general_parameters: folders path, time to wait on clicks and regex structures
- Generate an .exe file to be more user-friendly with a Tkinter Interface;
- Use Docker to ensure functional application in other environments;
- Improve exception handling in code to avoid robot crashes.