Skip to content

Automation to extract informations from news in NY Times website.

License

Notifications You must be signed in to change notification settings

lucaszanov/rpa-project-nytimes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rpa-project-nytimes

Automation to extract informations from news in NY Times website.

Your challenge is to automate the process of extracting data from the news site. Link to the news site: www.nytimes.com

You must have 3 configured variables (you can save them in the configuration file, but it is better to put them to the Robocorp Cloud Work Items):

  • search phrase

  • news category or section

  • number of months for which you need to receive news

    Example of how this should work: 0 or 1 - only the current month, 2 - current and previous month, 3 - current and two previous months, and so on

The main steps:

  1. Open the site by following the link

  2. Enter a phrase in the search field

  3. On the result page, apply the following filters:

    • select a news category or section

      your automation should have the option to choose from none to any number of categories/sections. This should be specified via the config file or/and Robocorp Cloud Work Items

    • choose the latest (i.e., newest) news

  4. Get the values: title, date, and description.

  5. Store in an Excel file:

    • title
    • date
    • description (if available)
    • picture filename
    • count of search phrases in the title and description
    • True or False, depending on whether the title or description contains any amount of money

    Possible formats: $11.1 | $111,111.11 | 11 dollars | 11 USD

  6. Download the news picture and specify the file name in the Excel file

  7. Follow steps 4-6 for all news that falls within the required time period

Project structure

The project is divided in three folders:

  • config: configuration files;
  • outputs: output folders generated using current datetime for unique folders and containing the Excel file, a log file and the images folder with all images downloaded;
  • src: folder containing all scripts

Libraries

Main libraries used in this project (also available in requirements.txt):

  • openpyxl==3.1.2
  • pandas==2.0.2
  • selenium==4.9.1
  • tqdm==4.65.0
  • urllib3==2.0.2
  • webdriver-manager==3.8.6

Python version: 3.11

Configuration file

The config.ini file is located in the config folder and contains all the configuration parameters divided by section:

  • website_parameters: URLs, xpaths, ids and parameters related to the website http://www.nytimes.com;
  • input_parameters: the configured variables by user:
    • search phrase: must be separated by space;
    • news category or section: must be a list, eg: [Books,Fashion,Movies,Opinion,U.S.];
    • number of months for which you need to receive news: must be an integer
  • browser_parameters: basically the chrome version;
  • general_parameters: folders path, time to wait on clicks and regex structures

Future improvements

  • Generate an .exe file to be more user-friendly with a Tkinter Interface;
  • Use Docker to ensure functional application in other environments;
  • Improve exception handling in code to avoid robot crashes.

About

Automation to extract informations from news in NY Times website.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages