Skip to content

Scrapes information about studies from an input GEO search html page for accelerating meta-analysis.

Notifications You must be signed in to change notification settings

bioinfo-hcpa/GEO_scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Installation

  • Download the program files: Click on '<> Code' and then on 'Download ZIP'
  • Place the 'GEO-scraper_main.zip' file on the installation directory of your choice.
  • Chang into the installation directory and unzip the file (unzip GEO-scraper_main.zip)
  • Install necessary packages: selenium, beutifulsoup4, requests and lxml (pip install package)

Input File Preparation

  • This program is suitable to scrape data from any list of studies found after a Gene Expression Omnibus (GEO) search. The input file must be one or more html files with search results. Prepare the file as following:
  • 1 - Perform the GEO search with the proper key words or GEO accessions (https://www.ncbi.nlm.nih.gov/geo/) + apply desired filters.
  • 2 - If the search returns more than 20 results, click on '20 per page' and change the items per page as necessary.
  • 3 - Right click on the page, select 'Save Page As' and save the page as .html.
  • 4 - If the search returns more than 500 results you can save more than one page with another name and use it along the first one.

GEO_scraper Utilization

  • 1 - Add the GEO search results html files in the input directory (GEO_scraper-main/input)
  • 2 - Run the main.py file (python3 main.py)
  • 3 - A .csv file with basic information about these studies will be created in output folder

GEO_scraper Options

Currently, there are 3 options you can set when running the script - verbose, quick-mode and output name.

  • output name (-o): Use it to define the output name. Example: "-o My_data" will produce the My_data.csv output file. If not specified, the output file will be named 'output.csv'.
  • verbose (-v): Use it if you want the terminal to print the information while you are scraping. This doesn't alter the execution time or the final output.
  • quick-mode (-q): Use it if you want the program to run on quick mode, i.e., gather only basic information (doesnt collect data on Tissue type, Cell type and Race/Ethnicity/Ancestry from the study samples). This considerably reduces the execution time but excludes relevant study information.

This version retrieves the following information: Accession code, GEO link, Citation, Experiment type, Platform, Organisms, Number of samples, SRA availability, SRA link, Tissues, Cell Types, Cell Lines, Race/Ethnicity/Ancestry and Title

About

Scrapes information about studies from an input GEO search html page for accelerating meta-analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%