-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw3.Rmd
239 lines (197 loc) · 8.55 KB
/
hw3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "pm566_hw3"
author: "Yiping Li"
output: github_document
date: "`r Sys.Date()`"
always_allow_html: true
---
```{r}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(rvest)
library(tidyverse)
library(httr)
library(stringr)
```
API
Question1: using the NCBI API, look for papers that show up under the term "sars-cov-2 trial vaccine." Look for the data in the pubmed database, and then retrieve the details of the paper as shown in lab 7. How many papers were you able to find?
```{r, wk7 lab}
# Downloading the website
website <- xml2::read_html("https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2+trial+vaccine")
# Finding the counts
counts <- xml2::xml_find_first(website, "/html/body/main/div[9]/div[2]/div[2]/div[1]/div[1]")
# Turning it into text
counts <- as.character(counts)
# Extracting the data using regex
stringr::str_extract(counts, "[0-9,]+")
#gives 4010
#Query & Get details about the articles
query_ids <- GET(
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
query = list(db='pubmed',
term = 'sars-cov-2 trial vaccine',
retmax= 5000)
)
# Extracting the content of the response of GET
ids <- httr::content(query_ids)
# Turn the result into a character vector
ids <- as.character(ids)
# Find all the ids
ids <- stringr::str_extract_all(ids,
"<Id>[[:digit:]]+</Id>")[[1]]
# Remove all the leading and trailing <Id> </Id>. Make use of "|"
ids <- stringr::str_remove_all(ids, "</?Id>")
head(ids)
length(ids)
```
A total of 4010 papers were found on Pubmed, and a total of 1083 papers were found on Pubmed API.
Question2: using the list of pubmed ids you retrieved, download each papers’ details using the query parameter rettype = abstract. If you get more than 250 ids, just keep the first 250.
```{r}
ids <- ids[1:250]
publications <- GET(
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
query = list(
db='pubmed',
id=paste(ids,collapse =','),
retmax=5000,
rettype='abstract'
)
)
# Turning the output into character vector
publications <- httr::content(publications)
publications_txt <- as.character(publications)
```
Questions3: as we did in lab 7 (see Dr. Siegmund's lab 7). Create a dataset containing the following:
Pubmed ID number,
Title of the paper,
Name of the journal where it was published,
Publication date, and
Abstract of the paper (if any).
```{r}
#We want to build a dataset which includes the title and the abstract of the paper. The title of all records is enclosed by the HTML tag ArticleTitle, and the abstract by Abstract.
#Before applying the functions to extract text directly, it will help to process the XML a bit. We will use the xml2::xml_children() function to keep one element per id. This way, if a paper is missing the abstract, or something else, we will be able to properly match PUBMED IDS with their corresponding records.
pub_char_list <- xml2::xml_children(publications)
pub_char_list <- sapply(pub_char_list, as.character)
#Now, extract the abstract and article title for each one of the elements of pub_char_list. You can either use sapply() as we just did, or simply take advantage of vectorization of stringr::str_extract
abstracts <- str_extract(pub_char_list, "<Abstract>[[:print:][:space:]]+</Abstract>")
abstracts <- str_remove_all(abstracts, "</?[[:alnum:]- =\"]+>")
abstracts <- str_replace_all(abstracts, "[[:space:]]+"," ")
#Now get the titles:
titles <- str_extract(pub_char_list, "<ArticleTitle>[[:print:][:space:]]+</ArticleTitle>")
titles <- str_remove_all(titles, "</?[[:alnum:]- =\"]+>")
#Now get the journal names:
journals <- str_extract(pub_char_list, "<Title>[[:print:][:space:]]+</Title>")
journals <- str_remove_all(journals, "</?[[:alnum:]- =\"]+>")
#Now get the dates:
dates <- str_extract(pub_char_list, "<PubDate>[[:print:][:space:]]+</PubDate>")
dates <- str_remove_all(dates, "</?[[:alnum:]]+>")
dates <- str_replace_all(dates, "[[:space:]]+"," ")
#Finally the dataset:
database <- data.frame(
PubMedId = ids,
Title = titles,
Journal = journals,
Date = dates,
Abstract = abstracts
)
knitr::kable(database[1:8,], caption = "Some papers about sars-cov-2 trial vaccine")
```
Test Mining:
A new dataset has been added to the data science data repository https://github.com/USCbiostats/data-science-data/tree/master/03_pubmed. The dataset contains 3241 abstracts from articles across 5 search terms. Your job is to analyse these abstracts to find interesting insights.
Question1: tokenize the abstracts and count the number of each token. Do you see anything interesting? Does removing stop words change what tokens appear as the most frequent? What are the 5 most common tokens for each search term after removing stopwords?
```{r}
library(readr)
urlfile1="https://raw.githubusercontent.com/USCbiostats/data-science-data/master/03_pubmed/pubmed.csv"
pub <- read_csv(url(urlfile1))
```
```{r, wk6 lab, tokenize abstracts}
library(tidytext)
pub %>%
unnest_tokens(token, abstract) %>%
count(token, sort = TRUE) %>%
top_n(20, n) %>%
knitr::kable()
```
The 5 most common tokens are stop words: "the", "of", "and", "in", and "to".
```{r, remove stop words}
pub %>%
unnest_tokens(output = word, input = abstract) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort=TRUE)%>%
top_n(20, n) %>%
knitr::kable()
```
Removing stop changes what tokens appear as the most frequent. After removing stop words, the 5 most common disease-relevant tokens are "covid", "patients", "cancer", "prostate", and "disease"
```{r, show 5 most frequent tokens for 5 search terms: covid, cystic fibrosis, meningitis , preeclampsia, prostate cancer}
#"term" is one column from pub
pub %>%
filter(term == "covid") %>%
unnest_tokens(output = token, input = abstract) %>%
count(token, sort = TRUE) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
top_n(5,n) %>%
knitr::kable(caption = "5 Most Frequent Tokens for covid")
pub %>%
filter(term == "cystic fibrosis") %>%
unnest_tokens(output = token, input = abstract) %>%
count(token, sort = TRUE) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
top_n(5,n) %>%
knitr::kable(caption = "5 Most Frequent Tokens for cystic fibrosis")
pub %>%
filter(term == "meningitis ") %>%
unnest_tokens(output = token, input = abstract) %>%
count(token, sort = TRUE) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
top_n(5,n) %>%
knitr::kable(caption = "5 Most Frequent Tokens for meningitis ")
pub %>%
filter(term == "preeclampsia") %>%
unnest_tokens(output = token, input = abstract) %>%
count(token, sort = TRUE) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
top_n(5,n) %>%
knitr::kable(caption = "5 Most Frequent Tokens for preeclampsia")
pub %>%
filter(term == "prostate cancer") %>%
unnest_tokens(output = token, input = abstract) %>%
count(token, sort = TRUE) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
top_n(5,n) %>%
knitr::kable(caption = "5 Most Frequent Tokens for prostate cancer")
```
```{r, the other method to view 5 most for each token}
pub %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words, by = "word") %>%
group_by(term) %>%
count(word, term, sort = TRUE) %>%
top_n(5, n) %>%
arrange(term, desc(n)) %>%
knitr::kable()
```
Question2: tokenize the abstracts into bigrams. Find the 10 most common bigram and visualize them with ggplot2.
```{r}
library(ggplot2)
pub %>%
unnest_ngrams(bigram, abstract, n=2) %>%
count(bigram, sort = TRUE) %>%
top_n(10, n) %>%
ggplot(aes(n, fct_reorder(bigram,n))) +
geom_col()
```
Question3: Calculate the TF-IDF value for each word-search term combination. (here you want the search term to be the “document”) What are the 5 tokens from each search term with the highest TF-IDF value? How are the results different from the answers you got in question 1?
```{r, wk6 slide}
pub %>%
unnest_tokens(word, abstract) %>%
count(word, term) %>%
bind_tf_idf(word, term, n) %>%
group_by(term) %>%
arrange(desc(tf_idf)) %>% #or combine the next line with this: arrange(desc(tf_idf), .by_group=TRUE)
arrange(term) %>%
top_n(5,tf_idf) %>%
select(term,word,n,tf,idf,tf_idf) %>% #rearrange per term instead of word
knitr::kable()
```
In question 1: the most frequent searching words for terms covid, cystic fibrosis, meningitis, preeclampsia, and prostate cancer are covid, fibrosis, patients, pre, and cancer correspondingly. While in question 3: the most frequent searching words for terms covid, cystic fibrosis, meningitis, preeclampsia, and prostate cancer are covid, cf, meningitis, eclampsia, and prostate correspondingly, which are different from those in question 1.