rpa-project-nytimes

Automation to extract informations from news in NY Times website.

Your challenge is to automate the process of extracting data from the news site. Link to the news site: www.nytimes.com

You must have 3 configured variables (you can save them in the configuration file, but it is better to put them to the Robocorp Cloud Work Items):

search phrase
news category or section
number of months for which you need to receive news

Example of how this should work: 0 or 1 - only the current month, 2 - current and previous month, 3 - current and two previous months, and so on

The main steps:

Open the site by following the link
Enter a phrase in the search field
On the result page, apply the following filters:
- select a news category or section
  
  your automation should have the option to choose from none to any number of categories/sections. This should be specified via the config file or/and Robocorp Cloud Work Items
- choose the latest (i.e., newest) news
Get the values: title, date, and description.
Store in an Excel file:
- title
- date
- description (if available)
- picture filename
- count of search phrases in the title and description
- True or False, depending on whether the title or description contains any amount of money
Possible formats: $11.1 | $111,111.11 | 11 dollars | 11 USD
Download the news picture and specify the file name in the Excel file
Follow steps 4-6 for all news that falls within the required time period

Project structure

The project is divided in three folders:

config: configuration files;
outputs: output folders generated using current datetime for unique folders and containing the Excel file, a log file and the images folder with all images downloaded;
src: folder containing all scripts

Libraries

Main libraries used in this project (also available in requirements.txt):

openpyxl==3.1.2
pandas==2.0.2
selenium==4.9.1
tqdm==4.65.0
urllib3==2.0.2
webdriver-manager==3.8.6

Python version: 3.11

Configuration file

The config.ini file is located in the config folder and contains all the configuration parameters divided by section:

website_parameters: URLs, xpaths, ids and parameters related to the website http://www.nytimes.com;
input_parameters: the configured variables by user:
- search phrase: must be separated by space;
- news category or section: must be a list, eg: [Books,Fashion,Movies,Opinion,U.S.];
- number of months for which you need to receive news: must be an integer
browser_parameters: basically the chrome version;
general_parameters: folders path, time to wait on clicks and regex structures

Future improvements

Generate an .exe file to be more user-friendly with a Tkinter Interface;
Use Docker to ensure functional application in other environments;
Improve exception handling in code to avoid robot crashes.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
config		config
outputs/20230816_113112		outputs/20230816_113112
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
robocorp.zip		robocorp.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rpa-project-nytimes

Project structure

Libraries

Configuration file

Future improvements

About

Releases

Packages

Languages

License

lucaszanov/rpa-project-nytimes

Folders and files

Latest commit

History

Repository files navigation

rpa-project-nytimes

Project structure

Libraries

Configuration file

Future improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages