Skip to content

Performed data collection using R programming language by scrapping and crawling the require fields from a given website for any year user wants the data for. The data scrapped is stored in the text file as per the requirement.

Notifications You must be signed in to change notification settings

Chan2k20/Data-Collection-using-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Title: Data Collection using R

Description/Problem Statement: The project involved developing two scripts: Scrapping.R and After.R, for comprehensive data collection from the Mobile DNA journal. The Scrapping.R script utilizes various R packages such as stringr, xml2, XML, writexl, and rvest to retrieve article information, including title, authors, affiliations, publication date, abstract, keywords, and more, from Mobile DNA articles for a specified year. The collected data is then organized into a dataframe and exported to both a plain text file (Summary.txt) and an Excel file (summary.xlsx), excluding full-text due to character limitations.

Skills Utilized:

  • Data Scraping
  • R Programming
  • Data Manipulation
  • Package Management (stringr, xml2, XML, writexl, rvest)

Solution:

  • The Scrapping.R script first retrieves the URLs for all articles published in Mobile DNA for a specified year provided by the user.
  • It then crawls each URL to extract relevant information about each article, such as title, authors, affiliations, publication date, abstract, and keywords.
  • The collected data is structured into a dataframe and exported to a plain text file (Summary.txt) and an Excel file (summary.xlsx), omitting the full-text due to Excel's character limit.
  • To collect data for multiple years, the After.R script interacts with the user, prompting for an input year. It then iterates through all years from the input year to 2020, calling Scrapping.R for each year.
  • Upon completion, Summary.txt contains comprehensive information for all Mobile DNA articles published in and after the input year.

About

Performed data collection using R programming language by scrapping and crawling the require fields from a given website for any year user wants the data for. The data scrapped is stored in the text file as per the requirement.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages