Skip to content

michael-pagan/BIOF-309-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BIOF 309 Project - Michael Pagan & Maria Casal-Dominguez

Accessing PubMed To Analyze ENCODE Project Publications

Introduction:

The goal of The Encyclopedia Of DNA Elements (ENCODE) Project is to identify all of the elements in the human and mouse genomes and make this information available as a resource to the biomedical community. The ENCODE Project is a collaboration of research groups funded by the National Human Genome Research Institute (NHGRI) that was planned as a follow-up to the Human Genome Project after it's conclusion in 2003. In February 2017, ENCODE began it's fourth funding phase. A large project that has been around for 15 years, the ENCODE Project has produced a lot of data and hundreds of publications. NHGRI is interested in curating the Consortium's publication information in-house to track the Consortium's progress.

Objective:

It is important for both researchers and the public to be able to access the information from databases like PubMed, GEO, etc. Here, we developed a method to access and extract publication information from specified PMIDs in PubMed utilzing the Entrez package within the Biopython module. We then assessed trends such as the number of ENCODE publications per year, the number of ENCODE publications over time and the journal most published in.

Methods:

Using Entrez, define a function to extract PMID's publication information from PubMed. More information on Biopython's Entrez.eftech can be found here.

#Import modules
import pandas as pd
import numpy as np
from Bio import Entrez
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from collections import Counter 

#Register with Entrez
Entrez.email = "maria.casal.dominguez@gmail.com"

#Define a function to grab the desired attributes of the PMIDs from PubMed
def get_pubmed_data(pmids, attrb_list):
   
    """Returns full PubMed data records for desired PMIDs in XML format. PMIDs can be found online 
    in PubMed and can be accepted individually or as a list. Desired data from PMIDs ('attrb_list')
    can be viewed here https://github.com/michael-pagan/BIOF-309-Project/blob/master/PubMed.txt"""
        
    pubs_list = []
    pmid_number = len(pmids.index)
    iteration_number = 1

    while True:
        if (iteration_number*200 > pmid_number):
            upper_limit = pmid_number
        else:
            upper_limit = iteration_number*200

        #Convert the CSV file into a list
        pmids_list = pmids["PMID"][iteration_number*200-200:upper_limit].to_csv(index=False)

        xml_doc = Entrez.efetch(db='pubmed', id=pmids_list, retmode='xml', rettype='docsum')

        for pub in Entrez.parse(xml_doc):
            pub_attribs = []
            for attrib in attrb_list:
                pub_attribs.append(pub[attrib])
            pubs_list.append(pub_attribs)

        if (iteration_number*200 > pmid_number):
            break
        else:
            iteration_number += 1

    pd_docs = pd.DataFrame(pubs_list, columns=attrb_list)

    #Show the DataFrame
    display(pd_docs.head(10))
    print(pd_docs.shape)

    #Export to .csv
    pd_docs.to_csv("pd_docs.csv")
    
    return pd_docs

Use the get_pubmed_data function to create a DataFrame of the desired publication information

#Import the csv with the pmids as a DataFrame
pmids = pd.read_csv("publications.csv")
  
#List desired PMID attributes to get from PubMed
attrb_list = ["Id", "FullJournalName", "LastAuthor", "PubDate"]

#Call the function to grab the data
pd_docs = get_pubmed_data(pmids, attrb_list)
PMID Full Journal Name Last Author Publication Date
0 18665130 Nature genetics Gingeras TR 2008
1 17568007 Genome research Stamatoyannopoulos JA 2007
2 17166863 Nucleic acids research Kent WJ 2007
3 17567993 Genome research Gerstein MB 2007
4 17567995 Genome research Sidow A 2007
5 18258921 Genome research Liu XS 2008
6 17568011 Genome research Baxevanis AD 2007
7 19425134 Genome informatics. International Confer Tullius TD 2008
8 21439813 Current opinion in structural biology Tullius TD 2011
9 19286520 Science (New York, N.Y.) Margulies EH 2009
(691, 4)

Clean the DataFrame

#Clean the data
pd_docs = pd_docs.rename(index=str, columns={"Id": "PMID", "FullJournalName": "Full Journal Name", 
                            "LastAuthor": "Last Author", "PubDate": "Publication Date"})
pd_docs['Publication Date'] = pd_docs['Publication Date'].apply(lambda x: str(x)[:4])
pd_docs['Full Journal Name'] = pd_docs['Full Journal Name'].apply(lambda x: str(x)[:40])
pd_docs["Author: Journal"] = pd_docs["Last Author"] + ": " + pd_docs["Full Journal Name"]

display(pd_docs.head(10))
print(pd_docs.shape)
PMID Full Journal Name Last Author Publication Date Author: Journal
0 18665130 Nature genetics Gingeras TR 2008 Gingeras TR: Nature genetics
1 17568007 Genome research Stamatoyannopoulos JA 2007 Stamatoyannopoulos JA: Genome research
2 17166863 Nucleic acids research Kent WJ 2007 Kent WJ: Nucleic acids research
3 17567993 Genome research Gerstein MB 2007 Gerstein MB: Genome research
4 17567995 Genome research Sidow A 2007 Sidow A: Genome research
5 18258921 Genome research Liu XS 2008 Liu XS: Genome research
6 17568011 Genome research Baxevanis AD 2007 Baxevanis AD: Genome research
7 19425134 Genome informatics. International Confer Tullius TD 2008 Tullius TD: Genome informatics. International ...
8 21439813 Current opinion in structural biology Tullius TD 2011 Tullius TD: Current opinion in structural biology
9 19286520 Science (New York, N.Y.) Margulies EH 2009 Margulies EH: Science (New York, N.Y.)
(691,5)

Define a function to sort the data from most to least frequently appearing

#Define a function to sort desired data
def get_variable(dataFrame, variable):
    
    """Gets desired column data from DataFrame and sorts from most to least frequent. 
    Data are assigned to 'variable_results', data frequency is assigned to 'counts'."""
      
    #Create an empty list
    frequency_list = []
   
    #Fill the list with the frequency of the data
    for i in dataFrame[variable]:
        frequency_list.append(i)
         
    #Use Counter to count how many times a journal appears
    journal_count = Counter(frequency_list)

    #Sort the frequency data by most to least common
    sorted_journal_count = journal_count.most_common()

    #Create 2 new lists for the variables that you want
    global variable_results
    variable_results = []
    global counts
    counts = []
    
    #Fill variable_results and counts
    for i in sorted_journal_count:
        key, count = i
        variable_results.append(key)
        counts.append(count)

    return variable_results, counts

Create the graphs

#Authenticate plotly
plotly.tools.set_credentials_file(username='mpagan2', api_key='9oJfHnTtef1NWTokn4lI')

def hbargraph(xaxis, yaxis, title):

    """Using plotly, develops a horizontal bargraph with hover-over value labels and a custom title"""
    
    data = [go.Bar(
                x= xaxis,
                y= yaxis,
                orientation = 'h',
                marker = dict(
                    color = 'rgba(0, 158, 28, 0.6)',
                    line = dict(
                        color = 'rgba(0, 158, 28, 1.0)',
                    width = 3))
                )
           ]

    layout = go.Layout(
            title = title,
            autosize=False,
            width=1000,
            height=500,
            margin=go.Margin(
                l=300,
                r=50,
                b=100,
                t=100,
                pad=4
            ),
        )

    fig = go.Figure(data=data, layout=layout)
    graph = py.iplot(fig, filename=title)
    
    return graph

#Call ‘get_variable’ to obtain the desired data to fill the horizontal bar graph with
get_variable(pd_docs, "Full Journal Name")

#Call ‘hbargraph’ to create the plot
hbargraph(counts[:10], variable_results[:10], "Top 10 Journals ENCODE Authors Publish In")
Top 10 Journals ENCODE Authors Publish In

View in plotly

get_variable(pd_docs, "Author: Journal")
hbargraph(counts[:10], variable_results[:10], "Top 10 Journals Published In By Single ENCODE Author")
Top 10 Journals Published In By Single ENCODE Author

View in plotly

get_variable(pd_docs, "Last Author")
hbargraph(counts[:10], variable_results[:10], "Top 10 ENCODE Authors")
Top 10 ENCODE Authors

View in plotly

#Define function to make bar graph
def bargraph(xaxis, yaxis, bar_labels, title):

    """Using plotly, develops a bar graph with custom hover-over value label descriptions and a custom title"""
    
    trace0 = go.Bar(
        x = xaxis,
        y = yaxis,
        text = bar_labels,
        marker=dict(
            color='rgb(153, 153, 255)',
            line=dict(
                color='rgb(8,48,107)',
                width=1.5,
            )
        ),
        opacity=0.6
    )

    data = [trace0]
    layout = go.Layout(
        title= title,
        xaxis=dict(
            autotick=False,
            ticks='outside',
            tick0=0,
            dtick=1,
            ticklen=8,
            tickwidth=4,
            tickcolor='#000'
        ),
    )

    fig = go.Figure(data=data, layout=layout)
    graph = py.iplot(fig, filename=title)
    return graph

#Grab unique years from "PubDate" column of pd_docs, sort the years, and create "'Year' Publications" labels
get_year_labels = pd_docs["Publication Date"].unique()
sorted_year_labels = sorted(get_year_labels)
labels = [i + " Publications" for i in sorted_year_labels]

#Get x-axis years and sort chronologically
years = pd_docs["Publication Date"].unique().tolist()
sorted_years = sorted(years)

#Get y-axis values
pubs_by_year = pd_docs.groupby('Publication Date')['PMID'].nunique().tolist()

#Call the function
bargraph(sorted_years, pubs_by_year, labels, "ENCODE Publications By Year")
ENCODE Publications By Year

View in plotly

def linegraph(xaxis, yaxis, line_labels, title):

    
    """Using plotly, develops a line graph with custom hover-over value label descriptions and a custom title"""
    
    trace = go.Scatter(
        x = xaxis,
        y = yaxis,
        text = line_labels,
        marker=dict(
                color='rgb(153, 153, 255)'
        ),
    )

    line_layout = go.Layout(
            title= title,
            xaxis=dict(
                autotick=False,
                ticks='outside',
                tick0=0,
                dtick=1,
                ticklen=8,
                tickwidth=4,
                tickcolor='#000'
            ),
    )

    line = [trace]

    fig = go.Figure(data=line, layout=line_layout)
    graph = py.iplot(fig, filename='Publications Over Time')
    
    return graph

#Sum the publications sequentially by year
sum_pubs = np.cumsum(pubs_by_year)

#Grab unique years from "PubDate" column of pd_docs, sort the years, 
#and create "Total Number of Publications in 'Year'" labels
get_total_year_labels = pd_docs["Publication Date"].unique()
sorted_total_year_labels = sorted(get_total_year_labels)
total_labels = ["Total Number of Publications in " + i for i in sorted_total_year_labels]

#Call the line graph function
linegraph(sorted_years, sum_pubs, total_labels, "ENCODE Publications Over Time")
Publications Over Time - Line

View in plotly

Special Thanks

Thanks to Martin, Ben, and Michael for a great semester as we took our first stab at Python! We are grateful for this experience and look forward to becoming even better programmers with the foundation we gained here!

Releases

No releases published

Packages

No packages published