index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>François de Ryckel</title>
    <link>/</link>
      <atom:link href="/index.xml" rel="self" type="application/rss+xml" />
    <description>François de Ryckel</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Sun, 07 Jun 2020 00:00:00 +0000</lastBuildDate>
    <image>
      <url>/images/icon_hu0b7a4cb9992c9ac0e91bd28ffd38dd00_9727_512x512_fill_lanczos_center_2.png</url>
      <title>François de Ryckel</title>
      <link>/</link>
    </image>
    
    <item>
      <title>Example Page 1</title>
      <link>/courses/example/example1/</link>
      <pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
      <guid>/courses/example/example1/</guid>
      <description>&lt;p&gt;In this tutorial, I&amp;rsquo;ll share my top 10 tips for getting started with Academic:&lt;/p&gt;
&lt;h2 id=&#34;tip-1&#34;&gt;Tip 1&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.&lt;/p&gt;
&lt;p&gt;Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.&lt;/p&gt;
&lt;p&gt;Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.&lt;/p&gt;
&lt;p&gt;Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.&lt;/p&gt;
&lt;p&gt;Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.&lt;/p&gt;
&lt;h2 id=&#34;tip-2&#34;&gt;Tip 2&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.&lt;/p&gt;
&lt;p&gt;Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.&lt;/p&gt;
&lt;p&gt;Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.&lt;/p&gt;
&lt;p&gt;Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.&lt;/p&gt;
&lt;p&gt;Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Example Page 2</title>
      <link>/courses/example/example2/</link>
      <pubDate>Sun, 05 May 2019 00:00:00 +0100</pubDate>
      <guid>/courses/example/example2/</guid>
      <description>&lt;p&gt;Here are some more tips for getting started with Academic:&lt;/p&gt;
&lt;h2 id=&#34;tip-3&#34;&gt;Tip 3&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.&lt;/p&gt;
&lt;p&gt;Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.&lt;/p&gt;
&lt;p&gt;Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.&lt;/p&gt;
&lt;p&gt;Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.&lt;/p&gt;
&lt;p&gt;Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.&lt;/p&gt;
&lt;h2 id=&#34;tip-4&#34;&gt;Tip 4&lt;/h2&gt;
&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.&lt;/p&gt;
&lt;p&gt;Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.&lt;/p&gt;
&lt;p&gt;Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.&lt;/p&gt;
&lt;p&gt;Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.&lt;/p&gt;
&lt;p&gt;Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Disaster Tweets  - Part iii</title>
      <link>/post/disaster-tweets-part-iii/</link>
      <pubDate>Sun, 07 Jun 2020 00:00:00 +0000</pubDate>
      <guid>/post/disaster-tweets-part-iii/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(readr)      # to read and write (import / export) any type into our R console.
library(dplyr)      # for pretty much all our data wrangling
library(ggplot2)
library(stringr)
library(forcats)
library(purrr)

library(janitor)    # to clear variable names with clean_names()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-glove-embedding&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Using glove embedding&lt;/h1&gt;
&lt;p&gt;GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;GloVe encodes the ratios of word-word co-occurrence probabilities, which is thought to represent some crude form of meaning associated with the abstract concept of the word, as vector difference. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence.&lt;/p&gt;
&lt;p&gt;The simple workflow for vectorizing tweet text into glove embeddings is as follows - ^/[&lt;a href=&#34;https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/&#34; class=&#34;uri&#34;&gt;https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/&lt;/a&gt;]&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Tokenize incoming tweet texts in the training data.&lt;/li&gt;
&lt;li&gt;Download and parse glove embeddings into an embedding matrix for the tokenized words.&lt;/li&gt;
&lt;li&gt;Generate embeddings vector for tweets text in training data.&lt;/li&gt;
&lt;li&gt;Generate embeddings vector for tweets text in test data.&lt;/li&gt;
&lt;li&gt;Append to given tweets features and export.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We will not stem or lemmatize the tweets at first; this will keep most of the meaning in the word used.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clean_tweets &amp;lt;- function(df){
  df &amp;lt;- df  %&amp;gt;% 
    mutate(number_hashtag = str_count(string = text, pattern = &amp;quot;#&amp;quot;), 
           number_number = str_count(string = text, pattern = &amp;quot;[0-9]&amp;quot;) %&amp;gt;% as.numeric(), 
           number_http = str_count(string = text, pattern = &amp;quot;http&amp;quot;) %&amp;gt;% as.numeric(), 
           number_mention = str_count(string = text, pattern = &amp;quot;@&amp;quot;) %&amp;gt;% as.numeric(), 
           number_location = if_else(!is.na(location), 1, 0), 
           number_keyword = if_else(!is.na(keyword), 1, 0), 
           number_repeated_char = str_count(string = text, pattern = &amp;quot;([a-z])\\1{2}&amp;quot;) %&amp;gt;% as.numeric(),  
           text = str_replace_all(string = text, pattern = &amp;quot;http[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;@[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           number_char = nchar(text),   #add the length of the tweet in character. 
           number_word = str_count(string = text, pattern = &amp;quot;\\w+&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;[0-9]&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = future_map(text, function(.x) stringi::stri_trans_general(.x, &amp;quot;Latin-ASCII&amp;quot;)) %&amp;gt;% unlist(.), 
           text = str_replace_all(string = text, pattern  = &amp;quot;\u0089&amp;quot;, replacement = &amp;quot;&amp;quot;)) %&amp;gt;% 
  select(-keyword, -location) 
  return(df)
}

library(furrr)
plan(&amp;quot;multicore&amp;quot;)
df_train &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/train.csv&amp;quot;) %&amp;gt;% clean_tweets()

# sorting out the same tweets, different target issues 
temp &amp;lt;- df_train %&amp;gt;% group_by(text) %&amp;gt;% 
  mutate(mean_target = mean(target), 
         new_target = if_else(mean_target &amp;gt; 0.5, 1, 0)) %&amp;gt;% ungroup() %&amp;gt;% 
  mutate(target = new_target, 
         target_bin = factor(if_else(target == 1, &amp;quot;a_truth&amp;quot;, &amp;quot;b_false&amp;quot;))) %&amp;gt;% 
  select(-new_target, -mean_target, -target)

df_train &amp;lt;- temp&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using keras’ text_tokenizer to tokenize the text in tweets dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(keras)

# we assign each word in the whole tweets df corpus an ID 
tokenizer &amp;lt;- text_tokenizer() %&amp;gt;% fit_text_tokenizer(df_train$text)

# if we want to check how many different words were in the corpus. 
# we do +1 because we&amp;#39;re dealing with Python. 
num_words &amp;lt;- length(tokenizer$word_index) + 1

# Using the above fit tokenizer, one now convert all the text to an actual sequences of indices.
sequences &amp;lt;- texts_to_sequences(tokenizer, df_train$text)

## how long is the longest tweet?  33 words! We can use that as the base for padding. 
summary(map_int(sequences, length))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   13.00   13.64   18.00   32.00&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;max_tweet_length &amp;lt;- max(map_int(sequences, length))

# now, we need to pad all other tweet to a length of 33. 
# by default we pad first, then put the text. 
padded_sequences &amp;lt;- pad_sequences(sequences = sequences, maxlen = max_tweet_length)

# checking that we do have a 7613 tweets x 32 columns matrix. 
dim(padded_sequences) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 7613   32&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s have a look at the first 5 tweet were, their conversion into indices and their final padded form.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# the first 5 tweets in words
df_train$text[1:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all&amp;quot;                                                                
## [2] &amp;quot;Forest fire near La Ronge Sask. Canada&amp;quot;                                                                                               
## [3] &amp;quot;All residents asked to &amp;#39;shelter in place&amp;#39; are being notified by officers. No other evacuation or shelter in place orders are expected&amp;quot;
## [4] &amp;quot;, people receive #wildfires evacuation orders in California&amp;quot;                                                                          
## [5] &amp;quot;Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# the first 5 tweets in indices
sequences[1:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [[1]]
##  [1]  113 4389   20    1  830    5   18  247  135 1562 4390   84   36
## 
## [[2]]
## [1]  184   42  215  764 6440 6441 1354
## 
## [[3]]
##  [1]   36 1690 1563    4 6442    3 6443   20  128 6444   17 1691   35  419  241
## [16]   53 2085    3  686 1355   20 1070
## 
## [[4]]
## [1]   58 4391 1447  241 1355    3   91
## 
## [[5]]
##  [1]   30   92 1182   18  312   19 6445 2356   26  256   19 1447 6446   66    2
## [16]  179&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# And the first tweet with padding 
padded_sequences[1:5, ]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
## [2,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
## [3,]    0    0    0    0    0    0    0    0    0     0    36  1690  1563     4
## [4,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
## [5,]    0    0    0    0    0    0    0    0    0     0     0     0     0     0
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,]     0     0     0     0     0   113  4389    20     1   830     5    18
## [2,]     0     0     0     0     0     0     0     0     0     0     0   184
## [3,]  6442     3  6443    20   128  6444    17  1691    35   419   241    53
## [4,]     0     0     0     0     0     0     0     0     0     0     0    58
## [5,]     0     0    30    92  1182    18   312    19  6445  2356    26   256
##      [,27] [,28] [,29] [,30] [,31] [,32]
## [1,]   247   135  1562  4390    84    36
## [2,]    42   215   764  6440  6441  1354
## [3,]  2085     3   686  1355    20  1070
## [4,]  4391  1447   241  1355     3    91
## [5,]    19  1447  6446    66     2   179&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;??????? A total of 22701 unique words were assigned an index in the tokenization.&lt;/p&gt;
&lt;p&gt;Borrowing the code from Aditya Mangal’s blog &lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; for parsing and generating glove embedding matrix from my deepSentimentR package.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;parse_glove_embeddings &amp;lt;- function(file_path) {
  lines &amp;lt;- readLines(file_path)
  embeddings_index &amp;lt;- new.env(hash = TRUE, parent = emptyenv())
  for (i in 1:length(lines)) {
    line &amp;lt;- lines[[i]]
    values &amp;lt;- strsplit(line, &amp;quot; &amp;quot;)[[1]]
    word &amp;lt;- values[[1]]
    embeddings_index[[word]] &amp;lt;- as.double(values[-1])
  }
  cat(&amp;quot;Found&amp;quot;, length(embeddings_index), &amp;quot;word vectors.\n&amp;quot;)
  return(embeddings_index)
}

generate_embedding_matrix &amp;lt;- function(word_index, embedding_dim, max_words, glove_file_path) {
  embeddings_index &amp;lt;- parse_glove_embeddings(glove_file_path)

  embedding_matrix &amp;lt;- array(0, c(max_words, embedding_dim))
  for (word in names(word_index)) {
    index &amp;lt;- word_index[[word]]
    if (index &amp;lt; max_words) {
      embedding_vector &amp;lt;- embeddings_index[[word]]
      if (!is.null(embedding_vector)) {
        embedding_matrix[index+1,] &amp;lt;- embedding_vector
      }
    }
  }

  return(embedding_matrix)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Glove project has a Twitter dataset trained on 2B tweets with 27B tokens. It comes with word vectors that are 25d, 50d, 100d or 200d.&lt;/p&gt;
&lt;p&gt;We’ll try different variant and we’ll adjust in functions of our results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# To pick the length of each word vectors 
embedding_dim &amp;lt;- 25
#embedding_dim &amp;lt;- 50

# this operation is the crux of the whole numerization of our text. 
# we basically assign a word-vector for each word. We decided to go with a 50d dense vector.  
embedding_matrix &amp;lt;- generate_embedding_matrix(tokenizer$word_index, embedding_dim = 25, max_words = num_words, 
                                             &amp;quot;~/glove/glove.twitter.27B.25d.txt&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Found 1193514 word vectors.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#embedding_matrix &amp;lt;- generate_embedding_matrix(tokenizer$word_index, embedding_dim = 50, max_words = num_words, 
 #                                            &amp;quot;data_glove.twitter.27B/glove.twitter.27B.50d.txt&amp;quot;)

#there were around 12,638 different words in all the tweets.  We have change all of these words in a 50d vectors. 
# so now we should have a matrix of dimension 12638 by 50
dim(embedding_matrix)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 15093    25&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Let&amp;#39;s save that precious matrix for further use
#write_rds(x = embedding_matrix, path = &amp;quot;data/embedding_matrix_50d.rds&amp;quot;)
write_rds(x = embedding_matrix, path = &amp;quot;~/disaster_tweets/data/embedding_matrix_25d.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the Keras modeling framework to generate embeddings for the given training data. We basically create a simple sequential model with one embedding layer whose weights we will freeze based on our embedding matrix created above, and a flattening layer that will flatten the output into a 2D matrix of dimensions 7613, 32x25 for 25d and (7613, 32x50) for 50d word vectors.&lt;/p&gt;
&lt;p&gt;Remember the longest tweet had 32 words. Each words is a 50d vector. So we want at the end matrix of 7613 x 1600 or (32x50). For many tweets, that matrix going to start with a bunch of zeros because of the padding. Remember the padding is at the start in our case.&lt;/p&gt;
&lt;p&gt;So we now we need to apply that embedding to each of the 7613 tweet. Keras will do that for us.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;embedding_matrix &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/embedding_matrix_25d.rds&amp;quot;)
#embedding_matrix &amp;lt;- read_rds(&amp;quot;data/embedding_matrix_50d.rds&amp;quot;)

model_embedding &amp;lt;- keras_model_sequential() %&amp;gt;% 
  layer_embedding(input_dim = num_words, #number of total words in all of the tweets  
                  output_dim = embedding_dim, #the length of our embedding vectors (50d in this case)
                  input_length = max_tweet_length, #the number of words of the longest tweet.  All other tweets will be padded to have that length
                  name = &amp;quot;embedding&amp;quot;) %&amp;gt;% 
  layer_flatten(name = &amp;quot;flatten&amp;quot;)

model_embedding %&amp;gt;% 
  get_layer(name = &amp;quot;embedding&amp;quot;) %&amp;gt;% 
  set_weights(list(embedding_matrix)) %&amp;gt;% 
  freeze_weights()

tweets_embedding &amp;lt;- model_embedding %&amp;gt;% predict(padded_sequences)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, let’s make sense of what is happening. Each tweets is now 800 variables long (32 words x 25d). The first tweet was: [1] “Our deed be the Reason of this # earthquake May ALLAH Forgive us all”. This tweet is 13 words long. So the last 325 variables should be filled, when the first 475 should be 0s. Let’s check that.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(tweets_embedding)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  num [1:7613, 1:800] 0 0 0 0 0 0 0 0 0 0 ...&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# and part of the first tweet. 
tweets_embedding[1, 450:500]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
##  [8]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## [15]  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## [22]  0.000000  0.000000  0.000000  0.000000  0.000000 -0.420470  0.565260
## [29] -0.033577  0.310190  0.189300 -0.645880  1.387600 -0.574840 -0.138960
## [36] -0.390030 -0.169110 -0.073094 -5.702100  0.812640 -0.412840 -0.438670
## [43]  0.361850 -0.344710  0.146530  0.076999 -1.275600 -0.631900 -0.635160
## [50] -0.517290 -0.901670&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can now add these matrix to our initial df.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_train_glove &amp;lt;- bind_cols(df_train, as_tibble(tweets_embedding, .name_repair = &amp;quot;unique&amp;quot;) %&amp;gt;% clean_names()) %&amp;gt;% 
  clean_names()

# and let&amp;#39;s save all this had work! 
write_rds(x = df_train_glove, path = &amp;quot;~/disaster_tweets/data/train_glove_25d.rds&amp;quot;)
#write_rds(x = df_train_glove, path = &amp;quot;data/train_glove_50d.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before we go on and model, we still need to process our test data.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://nlp.stanford.edu/projects/glove/&#34; class=&#34;uri&#34;&gt;https://nlp.stanford.edu/projects/glove/&lt;/a&gt;&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/&#34; class=&#34;uri&#34;&gt;https://www.adityamangal.com/2020/02/nlp-with-disaster-tweets-part-1/&lt;/a&gt;&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Disaster Tweets - Part II</title>
      <link>/post/disaster-tweets-part-ii/</link>
      <pubDate>Tue, 26 May 2020 00:00:00 +0000</pubDate>
      <guid>/post/disaster-tweets-part-ii/</guid>
      <description>


&lt;p&gt;In the second part of this NLP task, we will use Singular Value Decomposition to help us transform a sparse matrix (from the Document Term Matrix - dtm) into a dense matrix. Hence this is still very much a BOW approach. This approach combined with xgboost gave us the best results without using word-embedding (or word-vectors) techniques. That said, we are not sure how this approach would work in production as it seems we would have to constantly regenerate the dense matrix (which is quite computationally intense). We would love to see / hear from others on how to use svd in this type of task.&lt;/p&gt;
&lt;p&gt;In a sense, SVD can be seen as a dimensionality reduction technique:going from a very wide sparse matrix (as many columns as there are different words in all the tweets), to a dense one.&lt;/p&gt;
&lt;p&gt;So let’s first to build that sparse matrix: on the rows, the document number (in this case the tweet ID) on the columns the word (1 word per column)&lt;/p&gt;
&lt;p&gt;Because the dimensionality reduction is based on the words, we need to use the whole dataset for this task. Of course this is not really reasonable in the case of new cases.&lt;/p&gt;
&lt;p&gt;Also, since we have already developed a whole cleaning workflow, let’s re-use it on the whole df.&lt;/p&gt;
&lt;div id=&#34;setting-up&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Setting up&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(readr)      # to read and write (import / export) any type into our R console.
library(dplyr)      # for pretty much all our data wrangling
library(ggplot2)
library(stringr)
library(forcats)
library(purrr)

library(kableExtra)

library(rsample)    # to use initial_split() and some other resampling techniques later on. 
library(recipes)      # to use the recipe() and step_() functions
library(parsnip)      # the main engine that run the models 
library(workflows)    # to use workflow()
library(tune)         # to fine tune the hyperparameters 
library(dials)        # to use grid_regular(), tune_grid(), penalty()
library(yardstick)    # to create the measure of accuracy, f1 score and ROC-AUC 

library(doParallel)   #to parallelize the work - useful  in tune()

library(tidytext)
library(textrecipes)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We’ll be reusing the same clean_tweets() function we have used on part I to clean the tweets. We just copy-paste it here and repurpose it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_train &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/train.csv&amp;quot;) %&amp;gt;% as_tibble() %&amp;gt;% select(id, text, keyword, location) 
df_test &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/test.csv&amp;quot;) %&amp;gt;% as_tibble() %&amp;gt;% select(id, text, keyword, location)
df_all &amp;lt;- bind_rows(df_train, df_test)

clean_tweets &amp;lt;- function(df){
  df &amp;lt;- df  %&amp;gt;% 
    mutate(number_hashtag = str_count(string = text, pattern = &amp;quot;#&amp;quot;), 
           number_number = str_count(string = text, pattern = &amp;quot;[0-9]&amp;quot;) %&amp;gt;% as.numeric(), 
           number_http = str_count(string = text, pattern = &amp;quot;http&amp;quot;) %&amp;gt;% as.numeric(), 
           number_mention = str_count(string = text, pattern = &amp;quot;@&amp;quot;) %&amp;gt;% as.numeric(), 
           number_location = if_else(!is.na(location), 1, 0), 
           number_keyword = if_else(!is.na(keyword), 1, 0), 
           number_repeated_char = str_count(string = text, pattern = &amp;quot;([a-z])\\1{2}&amp;quot;) %&amp;gt;% as.numeric(),  
           text = str_replace_all(string = text, pattern = &amp;quot;http[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;@[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           number_char = nchar(text),   #add the length of the tweet in character. 
           number_word = str_count(string = text, pattern = &amp;quot;\\w+&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;[0-9]&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = map(text, textstem::lemmatize_strings) %&amp;gt;% unlist(.), 
           text = map(text, function(.x) stringi::stri_trans_general(.x, &amp;quot;Latin-ASCII&amp;quot;)) %&amp;gt;% unlist(.), 
           text = str_replace_all(string = text, pattern  = &amp;quot;\u0089&amp;quot;, replacement = &amp;quot;&amp;quot;)) %&amp;gt;% 
  select(-keyword, -location) 
  return(df)
}

df_all &amp;lt;- clean_tweets(df_all)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;finding-the-svd-matrix&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Finding the SVD matrix&lt;/h1&gt;
&lt;p&gt;Let’s now works on our sparse matrix with the bind_tf_idf() functions. First, we’ll need to tokenize the tweets and remove stop-words. To be able to use the tf_idf, we’ll also need to count the occurrence of each word in each tweet.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_all_tok &amp;lt;- df_all %&amp;gt;% 
  unnest_tokens(word, text) %&amp;gt;% anti_join(stop_words %&amp;gt;% filter(lexicon == &amp;quot;snowball&amp;quot;)) %&amp;gt;% 
  mutate(word_stem = textstem::stem_words(word)) %&amp;gt;% count(id, word_stem)

df_all_tf_idf &amp;lt;- df_all_tok %&amp;gt;% bind_tf_idf(term = word_stem, document = id, n = n)

# turning the tf_idf into a matrix. 
dtm_df_all &amp;lt;- cast_dtm(term = word_stem, document = id, value = tf_idf, data = df_all_tf_idf)
mat_df_all &amp;lt;- as.matrix(dtm_df_all)
dim(mat_df_all)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 10873 13802&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;length(unique(df_all$id)) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 10876&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# I have a problem! Some tweets have not made it to our matrix.  
# That&amp;#39;s probably because there were just a link, or just a number or just stop words.  
# which one are those links.   This is also why I have hanged the corpus of stop-words. 
# so 3 tweets have not made it at all if we consider both training and testing set. &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s have a look at our sparse matrix to better understand what’s going on.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mat_df_all[1:10, 1:20]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     Terms
## Docs       car     crash    happen      just  terribl     allah     deed
##    0 0.8580183 0.7837519 0.9588457 0.6432791 1.330996 0.0000000 0.000000
##    1 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.9851632 1.228699
##    2 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##    3 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##    4 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##    5 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##    6 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##    7 0.0000000 0.0000000 0.0000000 0.3216396 0.000000 0.0000000 0.000000
##    8 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##    9 0.0000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000
##     Terms
## Docs earthquak    forgiv       mai    reason         u      citi   differ
##    0 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    1 0.7270493 0.9851632 0.6197444 0.7839108 0.4343156 0.0000000 0.000000
##    2 0.7270493 0.0000000 0.0000000 0.0000000 0.0000000 0.6864859 0.873712
##    3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    4 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    5 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    6 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    7 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    8 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##    9 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.000000
##     Terms
## Docs   everyon      hear     safe      stai    across      fire
##    0 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
##    1 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
##    2 0.7207918 0.6628682 0.873712 0.7776986 0.0000000 0.0000000
##    3 0.0000000 0.0000000 0.000000 0.0000000 0.6664668 0.3448581
##    4 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.4433889
##    5 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
##    6 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
##    7 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000
##    8 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.2586435
##    9 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The values in the matrix are not the frequency but their tf_idf.&lt;/p&gt;
&lt;p&gt;Let’s now fix the issues of the missing tweets or we will have some issues later on during the modeling workflow. We see that the matrix is ordered by ID&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Let&amp;#39;s identify which tweets didn&amp;#39;t make it into our df3 and save them. 
df_mat_rowname &amp;lt;- tibble(id = as.numeric(rownames(mat_df_all)))
df_rowname &amp;lt;- tibble(id = df_all$id)
missing_id &amp;lt;- df_rowname %&amp;gt;% anti_join(df_mat_rowname)

# Let&amp;#39;s add empty rows with the right id as rowname to our matrix. 
yo &amp;lt;- matrix(0.0, nrow = nrow(missing_id), ncol = ncol(mat_df_all))
rownames(yo) &amp;lt;- missing_id$id

mat_df &amp;lt;- rbind(mat_df_all, yo)
dim(mat_df)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 10876 13802&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#mat_df3[7601:7613, 11290:11302]

### trying to keep track of the order of the matrix
mat_df_id &amp;lt;- rownames(mat_df)
head(mat_df_id, 20)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;0&amp;quot;  &amp;quot;1&amp;quot;  &amp;quot;2&amp;quot;  &amp;quot;3&amp;quot;  &amp;quot;4&amp;quot;  &amp;quot;5&amp;quot;  &amp;quot;6&amp;quot;  &amp;quot;7&amp;quot;  &amp;quot;8&amp;quot;  &amp;quot;9&amp;quot;  &amp;quot;10&amp;quot; &amp;quot;11&amp;quot; &amp;quot;12&amp;quot; &amp;quot;13&amp;quot; &amp;quot;14&amp;quot;
## [16] &amp;quot;15&amp;quot; &amp;quot;16&amp;quot; &amp;quot;17&amp;quot; &amp;quot;18&amp;quot; &amp;quot;19&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tail(mat_df_id, 20)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  [1] &amp;quot;10859&amp;quot; &amp;quot;10860&amp;quot; &amp;quot;10861&amp;quot; &amp;quot;10862&amp;quot; &amp;quot;10863&amp;quot; &amp;quot;10864&amp;quot; &amp;quot;10865&amp;quot; &amp;quot;10866&amp;quot; &amp;quot;10867&amp;quot;
## [10] &amp;quot;10868&amp;quot; &amp;quot;10869&amp;quot; &amp;quot;10870&amp;quot; &amp;quot;10871&amp;quot; &amp;quot;10872&amp;quot; &amp;quot;10873&amp;quot; &amp;quot;10874&amp;quot; &amp;quot;10875&amp;quot; &amp;quot;6394&amp;quot; 
## [19] &amp;quot;9697&amp;quot;  &amp;quot;43&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we solved that issue of missing rows (which took almost a all day to figure out), we can move to finding the dense matrix. We will use the &lt;strong&gt;irlba&lt;/strong&gt; library to help with the decomposition.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;incomplete.cases &amp;lt;- which(!complete.cases(mat_df))
mat_df[incomplete.cases,] &amp;lt;- rep(0.0, ncol(mat_df))
dim(mat_df) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 10876 13802&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;svd_mat &amp;lt;- irlba::irlba(t(mat_df), nv = 750, maxit = 2000)
write_rds(x = svd_mat, path = &amp;quot;~/disaster_tweets/data/svd.rds&amp;quot;)

# And then to save it the whole df with ID + svd
svd_mat &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/svd.rds&amp;quot;)
yo &amp;lt;- as_tibble(svd_mat$v)
dim(yo)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 10876   750&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df4 &amp;lt;- bind_cols(id = as.numeric(mat_df_id), yo)
write_rds(x = df4, path = &amp;quot;~/disaster_tweets/data/svd_df_all750.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It is worth mentioning that singular value decomposition didn’t parallelized on my machine and it took a bit over 3hrs to get the matrix. That’s why we have saved it for further used.
[When I used irlba on our university computer (84 cores, over 750 Gb of RAM), it did parallelized very nicely on all core and it didn’t take more than 5 min.]&lt;/p&gt;
&lt;p&gt;Now that we have our dense matrix, we can start to fit back all the pieces together for our modelling process.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_train &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/train.csv&amp;quot;) %&amp;gt;% clean_tweets()

# sorting out the same tweets, different target issues 
temp &amp;lt;- df_train %&amp;gt;% group_by(text) %&amp;gt;% 
  mutate(mean_target = mean(target), 
         new_target = if_else(mean_target &amp;gt; 0.5, 1, 0)) %&amp;gt;% ungroup() %&amp;gt;% 
  mutate(target = new_target, 
         target_bin = factor(if_else(target == 1, &amp;quot;a_truth&amp;quot;, &amp;quot;b_false&amp;quot;))) %&amp;gt;% 
  select(-new_target, -mean_target, -target)


df_svd &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/svd_df_all750.rds&amp;quot;)

df_train &amp;lt;- left_join(temp, df_svd, by = &amp;quot;id&amp;quot;) %&amp;gt;% 
  select(-text)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;svd-with-lasso&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;SVD with Lasso&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(0109)
rsplit_df &amp;lt;- initial_split(df_train, strata = target_bin, prop = 0.85)
df_train_tr &amp;lt;- training(rsplit_df)
df_train_te &amp;lt;- testing(rsplit_df)

# reusing the same df_train, df_train_tr, df_train_te from before.  
recipe_tweet &amp;lt;- recipe(formula = target_bin ~ ., data = df_train_tr) %&amp;gt;% 
  update_role(id, new_role = &amp;quot;ID&amp;quot;) %&amp;gt;% 
  step_zv(all_numeric(), -all_outcomes()) %&amp;gt;% 
  step_normalize(all_numeric())

# we &amp;#39;ll assign 40 different values for our penalty. 
# we noticed earlier that best values are between penalties 0.001 and 0.005
grid_lambda &amp;lt;- expand.grid(penalty = seq(0.0014,0.005, length = 45)) 

# This time we&amp;#39;ll use 10 folds cross-validation
set.seed(0109)
folds_training &amp;lt;- vfold_cv(df_train, v = 10, repeats = 1) 

model_lasso &amp;lt;- logistic_reg(mode = &amp;quot;classification&amp;quot;, 
                            penalty = tune(), mixture = 1) %&amp;gt;% 
  set_engine(&amp;quot;glmnet&amp;quot;) 

# starting our worflow
wf_lasso &amp;lt;- workflow() %&amp;gt;% 
  add_recipe(recipe_tweet) %&amp;gt;% 
  add_model(model_lasso) 

library(doParallel)
registerDoParallel(cores = 64)

# run a lasso regression with cross-validation, on 40 different levels of penalty
tune_lasso &amp;lt;- tune_grid(
  wf_lasso, 
  resamples = folds_training, 
  grid = grid_lambda, 
  metrics = metric_set(roc_auc, f_meas, accuracy), 
  control = control_grid(verbose = TRUE)
) 

tune_lasso %&amp;gt;% collect_metrics() %&amp;gt;% 
  write_csv(&amp;quot;~/disaster_tweets/data/metrics_lasso_svd750.csv&amp;quot;)

best_metric &amp;lt;- tune_lasso %&amp;gt;% select_best(&amp;quot;f_meas&amp;quot;)

wf_lasso &amp;lt;- finalize_workflow(wf_lasso, best_metric)

last_fit(wf_lasso, rsplit_df) %&amp;gt;% collect_metrics()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 accuracy binary         0.798
## 2 roc_auc  binary         0.860&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save the final lasso model
model_lasso_svd &amp;lt;- fit(wf_lasso, df_train)
write_rds(x = model_lasso_svd, path = &amp;quot;~/disaster_tweets/data/model_lasso_svd750.rds&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note 1 Lasso: svd with 1000L, normalize all, penalty 0.001681, scores: f1=73.99, acc =79.3, roc=85.4&lt;/p&gt;
&lt;div id=&#34;analysis-of-grid-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Analysis of grid results&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# we read the results of our sample to see the penalty values and their performances. 
metrics &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/metrics_lasso_svd750.csv&amp;quot;) 

metrics %&amp;gt;% 
  ggplot(aes(x = penalty, y = mean, color = .metric)) + 
  geom_line() + 
  facet_wrap(~.metric) + 
  scale_x_log10()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/disaster-tweets-II/index_files/figure-html/grid-lasso-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;make-predictions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Make predictions&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_test &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/test.csv&amp;quot;)  %&amp;gt;% clean_tweets()
df_svd &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/svd_df_all750.rds&amp;quot;)
df_test &amp;lt;- left_join(df_test, df_svd, by = &amp;quot;id&amp;quot;) 

library(glmnet)
prediction_lasso_svd &amp;lt;- tibble(id = df_test$id, 
                               target = if_else(predict(model_lasso_svd, new_data = df_test) == &amp;quot;a_truth&amp;quot;, 1, 0))

prediction_lasso_svd %&amp;gt;% write_csv(path = &amp;quot;~/disaster_tweets/data/prediction_svd_lasso750.csv&amp;quot;)

# clean everything 
rm(list =  ls())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On the training set with cross-validation, this model with a penalty of 0.001681, gave us f1 = 73.99, accuracy = 79.3, roc = 85.4. On Kaggle, this model gave us a public score of 76.79. This is not really good considering we got much better results earlier with our &lt;a href=&#34;https://fderyckel.github.io/post/disaster-tweets-part-i/#baseline-with-some-additional-features&#34;&gt;enhanced approach&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;svd-with-xgboost&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;SVD with Xgboost&lt;/h1&gt;
&lt;p&gt;We can use the same idea with xgboost.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clean_tweets &amp;lt;- function(df){
  df &amp;lt;- df  %&amp;gt;% 
    mutate(number_hashtag = str_count(string = text, pattern = &amp;quot;#&amp;quot;), 
           number_number = str_count(string = text, pattern = &amp;quot;[0-9]&amp;quot;) %&amp;gt;% as.numeric(), 
           number_http = str_count(string = text, pattern = &amp;quot;http&amp;quot;) %&amp;gt;% as.numeric(), 
           number_mention = str_count(string = text, pattern = &amp;quot;@&amp;quot;) %&amp;gt;% as.numeric(), 
           number_location = if_else(!is.na(location), 1, 0), 
           number_keyword = if_else(!is.na(keyword), 1, 0), 
           number_repeated_char = str_count(string = text, pattern = &amp;quot;([a-z])\\1{2}&amp;quot;) %&amp;gt;% as.numeric(),  
           text = str_replace_all(string = text, pattern = &amp;quot;http[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;@[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           number_char = nchar(text),   #add the length of the tweet in character. 
           number_word = str_count(string = text, pattern = &amp;quot;\\w+&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;[0-9]&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = map(text, textstem::lemmatize_strings) %&amp;gt;% unlist(.), 
           text = map(text, function(.x) stringi::stri_trans_general(.x, &amp;quot;Latin-ASCII&amp;quot;)) %&amp;gt;% unlist(.), 
           text = str_replace_all(string = text, pattern  = &amp;quot;\u0089&amp;quot;, replacement = &amp;quot;&amp;quot;)) %&amp;gt;% 
  select(-keyword, -location) 
  return(df)
}

df_train &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/train.csv&amp;quot;) %&amp;gt;% clean_tweets()

# sorting out the same tweets, different target issues 
temp &amp;lt;- df_train %&amp;gt;% group_by(text) %&amp;gt;% 
  mutate(mean_target = mean(target), 
         new_target = if_else(mean_target &amp;gt; 0.5, 1, 0)) %&amp;gt;% ungroup() %&amp;gt;% 
  mutate(target = new_target, 
         target_bin = factor(if_else(target == 1, &amp;quot;a_truth&amp;quot;, &amp;quot;b_false&amp;quot;))) %&amp;gt;% 
  select(-new_target, -mean_target, -target)


df_svd &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/svd_df_all750.rds&amp;quot;)

df_train &amp;lt;- left_join(temp, df_svd, by = &amp;quot;id&amp;quot;) %&amp;gt;% 
  select(-text)

recipe_tweet &amp;lt;- recipe(formula = target_bin ~ ., data = df_train) %&amp;gt;% 
  update_role(id, new_role = &amp;quot;ID&amp;quot;)

# xgboost classification, tuning on trees, tree-depth  and mtry
model_xgboost &amp;lt;- boost_tree(mode = &amp;quot;classification&amp;quot;, trees = tune(), 
                            learn_rate = 0.01, tree_depth = tune(), mtry = tune()) %&amp;gt;% 
  set_engine(&amp;quot;xgboost&amp;quot;, nthread = 64)

# starting our workflow
wf_xgboost &amp;lt;- workflow() %&amp;gt;% 
  add_recipe(recipe_tweet) %&amp;gt;% 
  add_model(model_xgboost)

# This time we use 5 folds cross-validation.  
#  xgboost is extremely resource intensive on wide df. 
set.seed(0109)
folds_training &amp;lt;- vfold_cv(df_train, v = 5, repeats = 1)
grid_xgboost &amp;lt;- expand.grid(trees = c(2000), 
                            tree_depth = c(5, 6), 
                            mtry = c(150, 300))

library(doParallel)
registerDoParallel(cores = 64)

# run a xgboost classification with cross-validation
tune_xgboost &amp;lt;- tune_grid(
  wf_xgboost, 
  resamples = folds_training, 
  grid = grid_xgboost, 
  metrics = metric_set(roc_auc, f_meas, accuracy), 
  control = control_grid(verbose = TRUE, save_pred = TRUE)
)

tune_xgboost %&amp;gt;% collect_metrics() %&amp;gt;% 
  write_csv(&amp;quot;~/disaster_tweets/data/metrics_xgboost_svd750.csv&amp;quot;)

best_metric &amp;lt;- tune_xgboost %&amp;gt;% select_best(&amp;quot;f_meas&amp;quot;)

wf_xgboost &amp;lt;- finalize_workflow(wf_xgboost, best_metric)

last_fit(wf_xgboost, rsplit_df) %&amp;gt;% collect_metrics()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 accuracy binary         0.825
## 2 roc_auc  binary         0.883&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save the final lasso model
model_xgboost_svd &amp;lt;- fit(wf_xgboost, df_train)
write_rds(x = model_xgboost_svd, path = &amp;quot;~/disaster_tweets/data/model_xgboost_svd750.rds&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using xgboost in combination with svd gives much better results. Here are a few things that we have tried with our training data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;svd 1000 wide matrix and xgboost with 150 mtry, 2500 trees, 5 tree-depth, gave us f1 = 74.77, accuracy = 80.90, roc = 86.45&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;svd 750 wide matrix and xgboost with 150 mtry, 2000 trees, 6 tree-depth, gave us f1 = 74.99, accuracy = 81.05, roc = 87&lt;/li&gt;
&lt;li&gt;svd 500 wide matrix and xgboost with 200 mtry, 2000 trees, 6 tree-depth, gave us f1 = 75.11, accuracy = 81.02, roc = 86.87&lt;/li&gt;
&lt;li&gt;svd 250 wide matrix and xgboost with 125 mtry, 1500 trees, 5 tree-depth, gave us f1 = 74.93, accuracy = 80.81, roc = 86.62&lt;/li&gt;
&lt;/ul&gt;
&lt;div id=&#34;variable-importance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;variable importance&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(vip)
model_xgboost_svd %&amp;gt;% 
  pull_workflow_fit() %&amp;gt;% 
  vip::vip(geom = &amp;quot;point&amp;quot;, num_features=20) #%&amp;gt;% arrange(desc(Importance)) %&amp;gt;% &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/disaster-tweets-II/index_files/figure-html/vip-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Clearly, we can’t interpret anymore our variables as they are the result of singular variable decomposition of a tf-idf sparse matrix. However, we are happy to see that our extra variables have played a role in determining if a tweet was about real disaster or not.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;submission-of-results&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Submission of results&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_test &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/test.csv&amp;quot;)  %&amp;gt;% clean_tweets()
df_svd &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/svd_df_all750.rds&amp;quot;)
df_test &amp;lt;- left_join(df_test, df_svd, by = &amp;quot;id&amp;quot;) 

library(xgboost)
prediction_xgboost_svd &amp;lt;- tibble(id = df_test$id, 
                                 target = if_else(predict(model_xgboost_svd, new_data = df_test) == &amp;quot;a_truth&amp;quot;, 1, 0))

prediction_xgboost_svd %&amp;gt;% write_csv(path = &amp;quot;~/disaster_tweets/data/prediction_svd_xgboost750.csv&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note 1: majority voting, svd with 850 wide, using lasso, got 77% public score.&lt;/p&gt;
&lt;p&gt;Note 2: majority voting, svd 500 wide, using xgboost with 200 mtry, 2000 trees, 6 tree-depth, got a 80.01 public score.&lt;/p&gt;
&lt;p&gt;Note 3: majority voting, svd with 750 wide, using xgboost with 200 mtry, 2000 trees, 6 tree-depth, got 81.29% public score. Yeahhh!!!!!!!&lt;/p&gt;
&lt;p&gt;Here is a screenshot of our results:&lt;br /&gt;
&lt;img src=&#34;/img/screenshot-results.png&#34; alt=&#34;screenshot of results&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;To help with the use of irlba and &lt;a href=&#34;https://www.kaggle.com/barun2104/nlp-with-disaster-eda-dfm-svd-ensemble&#34;&gt;check for the complete matrix&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Disaster Tweets - Part I</title>
      <link>/post/disaster-tweets-part-i/</link>
      <pubDate>Mon, 25 May 2020 00:00:00 +0000</pubDate>
      <guid>/post/disaster-tweets-part-i/</guid>
      <description>
&lt;script src=&#34;/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#baseline-model---lasso-model-on-just-text&#34;&gt;Baseline model - Lasso model on just text&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#creating-a-model-workflow&#34;&gt;Creating a model workflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#analysis-of-results&#34;&gt;Analysis of results&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#picking-the-best-model&#34;&gt;Picking the best model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#variable-importance&#34;&gt;variable importance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#submission-of-results&#34;&gt;Submission of results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#baseline-with-some-additional-features&#34;&gt;Baseline with some additional features&lt;/a&gt;&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#rebuilding-the-data-frame-and-variables&#34;&gt;Rebuilding the data frame and variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#creating-and-tuning-a-model&#34;&gt;Creating and tuning a model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#variable-importances&#34;&gt;Variable importances&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#submission-of-results-1&#34;&gt;Submission of results&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#wonderings-and-lessons-learned.&#34;&gt;Wonderings and lessons learned.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#references&#34;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;&lt;em&gt;Real or Not? NLP with Disaster Tweets&lt;/em&gt; Predict which Tweets are about real disasters and which ones are not.&lt;br /&gt;
The task comes from a &lt;a href=&#34;https://www.kaggle.com/c/nlp-getting-started&#34;&gt;Kaggle competition&lt;/a&gt; which is to detect if a tweet about an emergency disaster is real. Hence, this is an NLP classification problem.&lt;/p&gt;
&lt;p&gt;It is kind of easy for a human to see if a tweet is real or not, but it is harder for a machine to detect it. For instance, the tweet &lt;em&gt;“look at the sky last night, it was ABLAZE”&lt;/em&gt;. Although there is the use of a disaster keyword like “ablaze”, the use of that word in this context wasn’t meant to refer to an emergency disaster. This task is seen as &lt;em&gt;“a getting started”&lt;/em&gt; problem by Kaggle.&lt;/p&gt;
&lt;p&gt;As I’m a volunteer firefighter in my local community for the last 3 years, this Kaggle task struck a chord with me. And yes, that is me on the picture. Imagine this heavy, well insulated PPE, super intense physical challenge and then the Saudi heat with the humidity of the Red Sea ;-)&lt;/p&gt;
&lt;p&gt;I am planning on a 3 parts post.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The first part is very much BOW (bag of word) approach using Lasso.&lt;/li&gt;
&lt;li&gt;The second part is still BOW approaches using SVD. Modelling with Lasso and Xgboost.&lt;/li&gt;
&lt;li&gt;The third part is word embedding using Glove. (Still trying to make it work with Bert pre-trained models. Maybe I’ll have that sort out by the end. )&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Throughout these posts, I will use packages from 3 main sets: the &lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;tidyverse&lt;/a&gt; for data wrangling, the &lt;a href=&#34;https://www.tidymodels.org/&#34;&gt;tidymodels&lt;/a&gt; for modelling and the &lt;a href=&#34;https://www.tidytextmining.com/&#34;&gt;tidytext&lt;/a&gt; for dealing with text data. These sets of packages make a coherent whole and, in my opinion, makes it easier to learn the data analysis &amp;amp; modelling workflow. It is, of course, not the only one. There are many other alternatives in R.&lt;/p&gt;
&lt;p&gt;Loading the libraries first.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(readr)      # to read and write (import / export) any type into our R console.
library(dplyr)      # for pretty much all our data wrangling
library(stringr)    # to deal with strings.  this is a NLP task, so lots of it ;-) 
library(purrr)      # to map functions over rows
library(forcats)    # to deal with categorical variables: the fct_reorder() function
library(stringr)    # to use str_remove() and many other regex functions later 
library(ggplot2)    # to plot

library(kableExtra) # for making pretty table on html

library(rsample)    # to split df with initial_split() 
                    # to use resampling techniques with bootstrap() and vfold_cv()
library(parsnip)    # the main engine that run the models 
library(recipes)    # to use the recipe() functions
library(textrecipes) # to use the step_tokenize()  and step_tfidf()
library(workflows)  # to use workflow()
library(tune)       # to fine tune the hyper-parameters using tune()
library(dials)      # to create grid of parameters using grid_regular(), tune_grid(), penalty()
library(yardstick)  # to create the measure of accuracy, f1 score and ROC-AUC 

library(glmnet)     # to use lasso, it is called automatically when calling set_engine() 
                    # but it isn&amp;#39;t call later on when doing using predict()

library(vip)        # tidy framework to check variables importance&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Without further adue, let’s get started by loading our training set and check its structure.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# loading our training data  
df_train &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/train.csv&amp;quot;) %&amp;gt;% as_tibble()  

# let&amp;#39;s have a look at it 
skimr::skim(df_train)&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#39;width: auto;&#39;
        class=&#39;table table-condensed&#39;&gt;
&lt;caption&gt;
&lt;span id=&#34;tab:loading-data&#34;&gt;Table 1: &lt;/span&gt;Data summary
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Name
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
df_train
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Number of rows
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
7613
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Number of columns
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
5
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
_______________________
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Column type frequency:
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
character
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
3
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
numeric
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
________________________
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Group variables
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
None
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Variable type: character&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
skim_variable
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n_missing
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
complete_rate
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
min
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
max
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
empty
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n_unique
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
whitespace
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
keyword
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
61
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.99
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
21
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
221
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
location
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2534
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.67
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
49
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3279
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
text
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.00
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
157
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7503
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Variable type: numeric&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
skim_variable
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n_missing
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
complete_rate
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
mean
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
sd
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p0
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p25
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p50
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p75
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p100
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
hist
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
id
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5441.93
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3137.12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2734
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5408
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8146
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10873
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▇▇▇▇
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
target
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.43
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.50
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▁▁▁▆
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At first look: 7613 observations, 5 variables (one target + 4 predictors). Many missing values for the &lt;strong&gt;location&lt;/strong&gt; variable and a few missing on the &lt;strong&gt;keyword&lt;/strong&gt; variable as well. The &lt;strong&gt;Text&lt;/strong&gt; variable (which are the tweets themselves) has 0 missing values. Notice, we have an ID variable (don’t think that it has any use).&lt;/p&gt;
&lt;p&gt;let’s just have a look at 10 tweets and the &lt;em&gt;target&lt;/em&gt; column will tell us if the tweet is being considered as one about a real emergency disaster.&lt;/p&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;caption&gt;
&lt;span id=&#34;tab:first-ten-tweet&#34;&gt;Table 2: &lt;/span&gt;10 random tweets
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
target
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
text
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
First night with retainers in. It’s quite weird. Better get used to it; I have to wear them every single night for the next year at least.
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Deputies: Man shot before Brighton home set ablaze &lt;a href=&#34;http://t.co/gWNRhMSO8k&#34; class=&#34;uri&#34;&gt;http://t.co/gWNRhMSO8k&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Man wife get six years jail for setting ablaze niece
&lt;a href=&#34;http://t.co/eV1ahOUCZA&#34; class=&#34;uri&#34;&gt;http://t.co/eV1ahOUCZA&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintendent Lanford Salmon has r … - &lt;a href=&#34;http://t.co/vplR5Hka2u&#34; class=&#34;uri&#34;&gt;http://t.co/vplR5Hka2u&lt;/a&gt; &lt;a href=&#34;http://t.co/SxHW2TNNLf&#34; class=&#34;uri&#34;&gt;http://t.co/SxHW2TNNLf&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Police: Arsonist Deliberately Set Black Church In North CarolinaåÊAblaze &lt;a href=&#34;http://t.co/pcXarbH9An&#34; class=&#34;uri&#34;&gt;http://t.co/pcXarbH9An&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Noches El-Bestia ‘&lt;span class=&#34;citation&#34;&gt;@Alexis_Sanchez&lt;/span&gt;: happy to see my teammates and training hard ?? goodnight gunners.?????? &lt;a href=&#34;http://t.co/uc4j4jHvGR&#34; class=&#34;uri&#34;&gt;http://t.co/uc4j4jHvGR&lt;/a&gt;’
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
#Kurds trampling on Turkmen flag later set it ablaze while others vandalized offices of Turkmen Front in #Diyala &lt;a href=&#34;http://t.co/4IzFdYC3cg&#34; class=&#34;uri&#34;&gt;http://t.co/4IzFdYC3cg&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
TRUCK ABLAZE : R21. VOORTREKKER AVE. OUTSIDE OR TAMBO INTL. CARGO SECTION. &lt;a href=&#34;http://t.co/8kscqKfKkF&#34; class=&#34;uri&#34;&gt;http://t.co/8kscqKfKkF&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Set our hearts ablaze and every city was a gift And every skyline was like a kiss upon the lips @Û_ &lt;a href=&#34;https://t.co/cYoMPZ1A0Z&#34; class=&#34;uri&#34;&gt;https://t.co/cYoMPZ1A0Z&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
They sky was ablaze tonight in Los Angeles. I’m expecting IG and FB to be filled with sunset shots if I know my peeps!!
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
How the West was burned: Thousands of wildfires ablaze in #California alone &lt;a href=&#34;http://t.co/iCSjGZ9tE1&#34; class=&#34;uri&#34;&gt;http://t.co/iCSjGZ9tE1&lt;/a&gt; #climate #energy &lt;a href=&#34;http://t.co/9FxmN0l0Bd&#34; class=&#34;uri&#34;&gt;http://t.co/9FxmN0l0Bd&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Because this is a classification problem, we need to make our target variable a factor.&lt;/p&gt;
&lt;p&gt;Alhtough, we will just use our train dataframe for modeling, we’ll still split it to get a testing set from it (which we will test our models on).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# The target variable should be a factor as this is classification problem. 
df_train &amp;lt;- df_train %&amp;gt;% 
  mutate(target_bin = factor(if_else(target == 1, &amp;quot;a_truth&amp;quot;, &amp;quot;b_false&amp;quot;))) %&amp;gt;% 
  select(-target)

# Just checking how balanced is our data.  It seems well balanced. 
prop.table(table(df_train$target_bin))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##   a_truth   b_false 
## 0.4296598 0.5703402&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# initial split with strata will keep the same proportion of target variable as in the original df. 
set.seed(0109)
rsplit_df &amp;lt;- initial_split(df_train, strata = target_bin, prop = 0.85)

# If we use cross-validation, we do not normally really need to do this. 
# we still check our accuracy on that set of data (our unseen data) .
df_train_tr &amp;lt;- training(rsplit_df)

# and just checking again about the ratio of target variable
prop.table(table(df_train_tr$target_bin))   # same as original set. &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
##   a_truth   b_false 
## 0.4296972 0.5703028&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;df_train_te &amp;lt;- testing(rsplit_df)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;strong&gt;initial_split()&lt;/strong&gt; function gives a rsplit object (rsplit_df in our case) that can be used with the training() and testing() functions to extract the data in each split. The strata argument “help ensure that the number of data points in the training data is equivalent to the proportions in the original data set.”&lt;/p&gt;
&lt;p&gt;A good thing to notice is a well-balanced data set with a 57% - 43% in the occurrence of the outcomes (0, 1). So we won’t need to add more/remove data in our set to over-compensate. That’s one less problem that we have to deal with.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;baseline-model---lasso-model-on-just-text&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Baseline model - Lasso model on just text&lt;/h1&gt;
&lt;p&gt;In this post, we skip some of the usual data exploration (wordcloud are pretty, but are there really useful?) and start straight into building a model. This very first model, will be our base case. We will build a Lasso classification model based just on a cleaner version of the text of the tweets.&lt;/p&gt;
&lt;p&gt;For a Lasso modelling task, we can only use numerical values. Then we will need to normalize them. Also, we cannot include missing value. So, we remove these columns with missing values. So basically, we just use the text data as the predictor. We’ll numerize that text column using the tf_idf. For more on transforming text into tf_idf, you can check &lt;a href=&#34;https://www.tidytextmining.com/tfidf.html#the-bind_tf_idf-function&#34;&gt;this section&lt;/a&gt; of the David Robinson &amp;amp; Julia Silge book on &lt;em&gt;Tidy Text Mining&lt;/em&gt;. Most of the ideas here come from her book and &lt;a href=&#34;https://juliasilge.com/&#34;&gt;blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To clean our tweets, we will use the &lt;strong&gt;recipes&lt;/strong&gt; and &lt;strong&gt;textrecipes&lt;/strong&gt; packages. So, the exact same steps can later be done more easily on the testing set.&lt;/p&gt;
&lt;p&gt;In order, we’ll tokenize the tweets (at the same time, that will remove punctuations and lowercase all text), remove stop words, and keep only the first 1250 tokens. On the last step, we’ll transform our words into numerical values by transforming them to an tf_idf.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidytext)
library(textrecipes)

recipe_tweet &amp;lt;- recipe(formula = target_bin ~ text + id, data = df_train_tr) %&amp;gt;% 
  update_role(id, new_role = &amp;quot;ID&amp;quot;) %&amp;gt;% 
  step_tokenize(text) %&amp;gt;%        # Tokenize the tweets into words
  step_stopwords(text) %&amp;gt;%       # Filtering off stopwords from the tokenlist variable
  step_tokenfilter(text, max_tokens = 1250) %&amp;gt;%  # Only keep the 1250 most important words
  step_tfidf(text) %&amp;gt;%           # transform each words by its tf_idf values. 
  step_normalize(all_numeric())  # normalizing the tf_idf values&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once a recipe is written, we can check what it does to the original data frame by prepping and then juicing it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# checking on the prep() function. 
df_train_tr_processed &amp;lt;- recipe_tweet %&amp;gt;% prep %&amp;gt;% juice()
dim(df_train_tr_processed)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 6472 1252&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice, we have now 1252 columns. The id column + the target variable + the 1250 tokens.&lt;/p&gt;
&lt;div id=&#34;creating-a-model-workflow&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating a model workflow&lt;/h2&gt;
&lt;p&gt;With the new tidymodel API, you can now create a workflow for a model, that can be reused later on.&lt;/p&gt;
&lt;p&gt;To find the most appropriate penalty for this data set, we’ll boostrap 25 samples on each penalty. We’ll do that using the &lt;strong&gt;rsample&lt;/strong&gt; library.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(doParallel)
registerDoParallel(cores = 16)   #let&amp;#39;s work on 16 cores. 

# defining our model.  It is a logistic regression, using glmnet. 
# notice that the penaly is set to tune()... We&amp;#39;ll create a grid for that. 
# mixture = 1, means we are dealing with LASSO. 
model_lasso &amp;lt;- logistic_reg(mode = &amp;quot;classification&amp;quot;, 
                            penalty = tune(), mixture = 1) %&amp;gt;% 
  set_engine(&amp;quot;glmnet&amp;quot;)

# if we are to tune the penalty parameters, let&amp;#39;s create a grid of possible values
# we &amp;#39;ll assign 40 different values for our penalty. 
grid_lambda &amp;lt;- grid_regular(penalty(), levels = 40)

# And we&amp;#39;ll test each penalty value on 25 bootstrap. 
## So that&amp;#39;s like a 1000 models to fit.  
folds_training &amp;lt;- bootstraps(df_train_tr, strata = target_bin, times = 25)

# starting our worflow
wf_lasso &amp;lt;- workflow() %&amp;gt;% 
  add_recipe(recipe_tweet) %&amp;gt;% 
  add_model(model_lasso)

# the tune_grid() will fit our 1000 models using parallel processing
# we are looking at 3 measures of validity: roc, f1 and accuracy. 
tune_lasso &amp;lt;- tune_grid(
  wf_lasso, 
  resamples = folds_training, 
  grid = grid_lambda, 
  metrics = metric_set(roc_auc, f_meas, accuracy), 
  control   = control_grid(verbose = TRUE)
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I believe that the tune_grid() function is the only place in our modelling workflow where parallel processing is being used. All the other tasks are single core processing.&lt;/p&gt;
&lt;p&gt;What have we done? First, we have created a model workflow, with workflow(), using a recipe for pre-processing and a model type, logistic_reg(). Second, we fine-tuned the parameters of our model. In this case, we only fine-tuned the penalty parameter.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;analysis-of-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Analysis of results&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# we collect the results of our sample to see what penalty value we will pick. 
metrics &amp;lt;- tune_lasso %&amp;gt;% collect_metrics()

# save the metric for later use
metrics %&amp;gt;% write_csv(&amp;quot;~/disaster_tweets/data/metrics_lasso_base.csv&amp;quot;) 

metrics %&amp;gt;% 
  ggplot(aes(x = penalty, y = mean, color = .metric)) + 
  geom_line() + 
  facet_wrap(~.metric) + 
  scale_x_log10()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/disaster-tweets-I/index_files/figure-html/checking-metrics_lasso-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Using the plots, we can see that there is only a small window of penalty values that will increase the performance of of the model. We keep that in mind for the next time we create our grid of penalties.&lt;/p&gt;
&lt;p&gt;Because our dataset is somehow balance, we could choose &lt;strong&gt;accuracy&lt;/strong&gt; as a measure of model validity. The best accuracy on this base model is 76.2 % for the training set. That said because Kaggle choose F1 as a performance metric, let’s choose the penalty with highest &lt;strong&gt;f_meas&lt;/strong&gt; and fit it to our testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# let check the penalties values that give the best performances
metrics %&amp;gt;% group_by(.metric) %&amp;gt;% top_n(4, mean) %&amp;gt;% arrange(.metric, desc(mean)) %&amp;gt;% 
  kable(&amp;quot;html&amp;quot;, caption = &amp;quot;Penalties with best performances&amp;quot;) %&amp;gt;% 
  kable_styling(bootstrap_options = c(&amp;quot;striped&amp;quot;, &amp;quot;hoover&amp;quot;), full_width = F, position = &amp;quot;center&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;caption&gt;
&lt;span id=&#34;tab:metric-base-model&#34;&gt;Table 3: &lt;/span&gt;Penalties with best performances
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
penalty
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
.metric
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
.estimator
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
mean
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
std_err
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0049239
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7633420
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0012918
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0088862
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7591922
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0012877
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0027283
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7584935
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0011643
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0015118
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7507548
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0013591
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0027283
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.7020865
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0017578
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0015118
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.6994592
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0017849
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0049239
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.6977248
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0019471
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0008377
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.6921877
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0022374
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0049239
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8183548
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0014059
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0088862
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8160248
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0015715
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0027283
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8128313
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0013520
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0015118
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8045593
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
25
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0013837
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;picking-the-best-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Picking the best model&lt;/h2&gt;
&lt;p&gt;We’ll pick the penalty that gives us the best F1 score. The finalize_workflow() functions will take the existing workflow and add to it the chosen parameters. Finally, we will save our model for later.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;best_metric &amp;lt;- tune_lasso %&amp;gt;% select_best(&amp;quot;f_meas&amp;quot;)

wf_lasso &amp;lt;- finalize_workflow(wf_lasso, best_metric)

# to summarize, this is how our workflow looked like 
wf_lasso&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 5 Recipe Steps
## 
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_normalize()
## 
## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.00272833337648676
##   mixture = 1
## 
## Computational engine: glmnet&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# to check our model performance on the unseen data set, we use the last_fit() 
## Notice how the last_fit() works on the rsplit object
last_fit(wf_lasso, rsplit_df) %&amp;gt;% collect_metrics()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 accuracy binary         0.777
## 2 roc_auc  binary         0.831&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# got 78% accuracy on unseen data (the df_test) with 82.9% ROC. 

# To save our model for later use, we first need to fit it, 
model_lasso_base &amp;lt;- fit(wf_lasso, df_train)
write_rds(x = model_lasso_base, path = &amp;quot;~/disaster_tweets/data/model_lasso_base.rds&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;variable-importance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;variable importance&lt;/h2&gt;
&lt;p&gt;We can also check for the important variables that determine if a tweet is about a real emergency or not.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;read_rds(&amp;quot;~/disaster_tweets/data/model_lasso_base.rds&amp;quot;) %&amp;gt;% 
  pull_workflow_fit() %&amp;gt;% 
  vi(lamda = best_metric$penalty) %&amp;gt;% 
  group_by(Sign) %&amp;gt;%
  top_n(20, wt = abs(Importance)) %&amp;gt;%
  ungroup() %&amp;gt;%
  mutate(Importance = abs(Importance), 
         Variable = str_remove(Variable, &amp;quot;tfidf_text_&amp;quot;), 
         Variable = fct_reorder(Variable, Importance)) %&amp;gt;%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Sign, scales = &amp;quot;free_y&amp;quot;) +
  labs(y = NULL) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;/post/disaster-tweets-I/index_files/figure-html/checking-variable-importance-lasso-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For some reasons, I need to change the label of each graph. The “POS” terms will most likely give place to not a real emergency tweet.&lt;/p&gt;
&lt;p&gt;Although I am glad that the “lmao” expression is more often related to tweets that are not about real emergency, I am confused as to why “https” is one side and “t.co” on the other sides. They are both about links. 🤔&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;submission-of-results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Submission of results&lt;/h2&gt;
&lt;p&gt;Let’s now apply our base model to our test set to create the prediction (target variable).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test &amp;lt;- read_csv(&amp;quot;~/disaster_tweets/data/test.csv&amp;quot;)

prediction_lasso_base &amp;lt;- tibble(id = test$id,  
                                prediction = predict(read_rds(&amp;quot;~/disaster_tweets/data/model_lasso_base.rds&amp;quot;), new_data = test)) %&amp;gt;% 
  mutate(target = if_else(prediction == &amp;quot;a_truth&amp;quot;, 1, 0))

prediction_lasso_base %&amp;gt;% select(id, target) %&amp;gt;% write_csv(path = &amp;quot;~/disaster_tweets/data/prediction_lasso_base.csv&amp;quot;)
# this submission gave 77.8 % on the Kaggle public score. &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;baseline-with-some-additional-features&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Baseline with some additional features&lt;/h1&gt;
&lt;p&gt;In this new model, we will do some feature engineering at a basic level and see if that helps to increase the model performance and especially its accuracy. We continue to use the same lasso model. That means everything has to be converted back to numerical variables.&lt;/p&gt;
&lt;p&gt;We create another version of the basic df.
Here are the feature engineering steps we will take:&lt;/p&gt;
&lt;div id=&#34;rebuilding-the-data-frame-and-variables&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Rebuilding the data frame and variables&lt;/h2&gt;
&lt;p&gt;In the following step, there were a lots of trials and errors in regards of the order with which we were performing the changes on the tweets.&lt;/p&gt;
&lt;p&gt;We add the following variables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;add a variable for the number of hashtag in a tweet (like #xxx)&lt;/li&gt;
&lt;li&gt;add a variable for the number of http link in a tweet (like &lt;a href=&#34;http://xxxx&#34; class=&#34;uri&#34;&gt;http://xxxx&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;add a variable for the number of mention in a tweet (like &lt;span class=&#34;citation&#34;&gt;@xxxx&lt;/span&gt;)&lt;/li&gt;
&lt;li&gt;add a variable if the tweet contains a location&lt;/li&gt;
&lt;li&gt;add a variable if the tweet contains a keyword&lt;/li&gt;
&lt;li&gt;remove all mentions and links&lt;/li&gt;
&lt;li&gt;add a variable for the number of digits in a tweet&lt;/li&gt;
&lt;li&gt;add a variable for the number of character in a tweet&lt;/li&gt;
&lt;li&gt;add a variable for the number of words in a tweet&lt;/li&gt;
&lt;li&gt;remove all the numbers in tweets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;On the text itself, we perform the following steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;remove all digits (although I am suspecting that 4 digits date like 2015 or 2017 could have an influence)&lt;/li&gt;
&lt;li&gt;remove all mentions (who is being mentioned might add little value, we have a variable if someone is mentioned)&lt;/li&gt;
&lt;li&gt;remove all http links. The link itself is not discriminatory (does not add any information). We just recorded at the previous step if we have a link, so we can delete it&lt;/li&gt;
&lt;li&gt;lemmatize all words&lt;/li&gt;
&lt;li&gt;remove all non latin letters&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;count if there are multiple repeated characters in a row (like omggggg). My thinking is that real emergency tweets might use less of these.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because of all the cleaning, better make it a function that we can apply to both training and later on testing set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clean_tweets &amp;lt;- function(file_path){
  df &amp;lt;- read_csv(file_path) %&amp;gt;% as_tibble()  %&amp;gt;% 
    mutate(number_hashtag = str_count(string = text, pattern = &amp;quot;#&amp;quot;), 
           number_number = str_count(string = text, pattern = &amp;quot;[0-9]&amp;quot;) %&amp;gt;% as.numeric(), 
           number_http = str_count(string = text, pattern = &amp;quot;http&amp;quot;) %&amp;gt;% as.numeric(), 
           number_mention = str_count(string = text, pattern = &amp;quot;@&amp;quot;) %&amp;gt;% as.numeric(), 
           number_location = if_else(!is.na(location), 1, 0), 
           number_keyword = if_else(!is.na(keyword), 1, 0), 
           number_repeated_char = str_count(string = text, pattern = &amp;quot;([a-z])\\1{2}&amp;quot;) %&amp;gt;% as.numeric(),  
           text = str_replace_all(string = text, pattern = &amp;quot;http[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;@[^[:space:]]*&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           number_char = nchar(text),   #add the length of the tweet in character. 
           number_word = str_count(string = text, pattern = &amp;quot;\\w+&amp;quot;), 
           text = str_replace_all(string = text, pattern = &amp;quot;[0-9]&amp;quot;, replacement = &amp;quot;&amp;quot;), 
           text = map(text, textstem::lemmatize_strings) %&amp;gt;% unlist(.), 
           text = map(text, function(.x) stringi::stri_trans_general(.x, &amp;quot;Latin-ASCII&amp;quot;)) %&amp;gt;% unlist(.), 
           text = str_replace_all(string = text, pattern  = &amp;quot;\u0089&amp;quot;, replacement = &amp;quot;&amp;quot;)) %&amp;gt;% 
  select(-keyword, -location) 
  return(df)
}

df_train &amp;lt;- clean_tweets(&amp;quot;~/disaster_tweets/data/train.csv&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A little more checking.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# to help me see what other changes still have to be made
yo &amp;lt;- df_train %&amp;gt;% select(id, text) %&amp;gt;% 
  unnest_tokens(word, text) %&amp;gt;% 
  anti_join(stop_words %&amp;gt;% filter(lexicon == &amp;quot;snowball&amp;quot;)) %&amp;gt;% 
  count(word) %&amp;gt;% arrange(desc(n))

# I wanted to check what are the different stop-words dictionary 
# and see if it could make a difference. 
yo &amp;lt;- stop_words %&amp;gt;% group_by(lexicon) %&amp;gt;% summarize(n = n())

# just checking if everything is as expected
skimr::skim(df_train)&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#39;width: auto;&#39;
        class=&#39;table table-condensed&#39;&gt;
&lt;caption&gt;
(#tab:skimr_lasso)Data summary
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Name
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
df_train
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Number of rows
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
7613
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Number of columns
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
12
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
_______________________
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Column type frequency:
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
character
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
numeric
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
11
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
________________________
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Group variables
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
None
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Variable type: character&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
skim_variable
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n_missing
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
complete_rate
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
min
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
max
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
empty
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n_unique
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
whitespace
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
text
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
157
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6890
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Variable type: numeric&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
skim_variable
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n_missing
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
complete_rate
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
mean
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
sd
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p0
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p25
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p50
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p75
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
p100
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
hist
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
id
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5441.93
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3137.12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2734
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5408
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8146
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10873
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▇▇▇▇
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
target
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.43
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.50
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▁▁▁▆
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_hashtag
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.45
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1.10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
13
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▁▁▁▁
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_number
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
2.04
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3.01
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
39
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▁▁▁▁
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_http
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.62
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.66
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▇▂▁▁
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_mention
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.36
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.72
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▁▁▁▁
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_location
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.67
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.47
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▃▁▁▁▇
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_keyword
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.99
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.09
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▁▁▁▁▇
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_repeated_char
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.02
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.17
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▇▁▁▁▁
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_char
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
83.20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
32.32
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
59
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
84
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
112
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
157
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▂▆▇▇▂
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
number_word
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
14.24
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
6.11
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
14
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
18
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
34
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
▃▇▆▃▁
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Checking on the tweets, I can see that some digits are left. I am not sure how that happened (any suggestion welcome!)&lt;/p&gt;
&lt;p&gt;Also, I notice that there are around 2000 words with n &amp;gt;= 6.&lt;/p&gt;
&lt;p&gt;Some tweets are in the dataset more than once, but … they do not have the same target value. Yes, that’s weird and it is due to bad encoding. So we’ll use a voting sytem, to make them equal.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;yo &amp;lt;- df_train %&amp;gt;% group_by(text) %&amp;gt;% 
  mutate(mean_target = mean(target), 
         new_target = if_else(mean_target &amp;gt; 0.5, 1, 0)) 

df_train &amp;lt;- yo %&amp;gt;% 
  mutate(target = new_target, 
         target_bin = factor(if_else(target == 1, &amp;quot;a_truth&amp;quot;, &amp;quot;b_false&amp;quot;))) %&amp;gt;% 
  select(-new_target, -mean_target, -target)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, we start our modeling workflow.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-and-tuning-a-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating and tuning a model&lt;/h2&gt;
&lt;p&gt;There are 2 things we’d like to try in this step. Would lemmatize or stemming work better? We’ll try both, but keep the one with best result. Also we try to see if doing a dimensionality reductions with PCA would help.&lt;/p&gt;
&lt;p&gt;Also because we use tf_idf, should we normalize? Does this step add anything on our model?&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;recipe_tweet &amp;lt;- recipe(formula = target_bin ~ ., data = df_train) %&amp;gt;% 
  update_role(id, new_role = &amp;quot;ID&amp;quot;) %&amp;gt;% 
  step_normalize(contains(&amp;quot;number&amp;quot;), -id) %&amp;gt;% 
  step_tokenize(text) %&amp;gt;%        
  step_stopwords(text, stopword_source = &amp;quot;snowball&amp;quot;) %&amp;gt;%  
  step_tokenfilter(text, max_tokens = 2500) %&amp;gt;%  
  step_tfidf(text) %&amp;gt;%    
  step_pca(contains(&amp;quot;tfidf&amp;quot;), threshold = 0.95)

# to check how our df is now looking as it has been pre-processed. 
#df_train_processed &amp;lt;- recipe_tweet %&amp;gt;% prep() %&amp;gt;% juice()
#dim(df_train_processed)

registerDoParallel(cores = 16)

# we &amp;#39;ll assign 40 different values for our penalty. 
# we noticed earlier that best values are between penalties 0.001 and 0.005
grid_lambda &amp;lt;- expand.grid(penalty = seq(0.0017,0.005, length = 40)) 

# This time we&amp;#39;ll use 10 folds cross-validation
set.seed(0109)
folds_training &amp;lt;- vfold_cv(df_train, v = 10, repeats = 2) 

model_lasso &amp;lt;- logistic_reg(mode = &amp;quot;classification&amp;quot;, 
                            penalty = tune(), mixture = 1) %&amp;gt;% 
  set_engine(&amp;quot;glmnet&amp;quot;) 

# starting our worflow
wf_lasso &amp;lt;- workflow() %&amp;gt;% 
  add_recipe(recipe_tweet) %&amp;gt;% 
  add_model(model_lasso) 

# run a lasso regression with bootstrap, on 40 different levels of penalty
tune_lasso &amp;lt;- tune_grid(
  wf_lasso, 
  resamples = folds_training, 
  grid = grid_lambda, 
  metrics = metric_set(roc_auc, f_meas, accuracy), 
  control = control_grid(verbose = TRUE)
) 

tune_lasso %&amp;gt;% collect_metrics() %&amp;gt;% 
  write_csv(&amp;quot;~/disaster_tweets/data/metrics_lasso_enhanced.csv&amp;quot;) 

tune_lasso %&amp;gt;% collect_metrics() %&amp;gt;% 
  group_by(.metric) %&amp;gt;% top_n(4, mean) %&amp;gt;% arrange(.metric, desc(mean)) %&amp;gt;% 
  kable() %&amp;gt;% 
  kable_styling(bootstrap_options = c(&amp;quot;striped&amp;quot;, &amp;quot;hoover&amp;quot;), full_width = F, position = &amp;quot;center&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table table-striped&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
penalty
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
.metric
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
.estimator
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
mean
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
n
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
std_err
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0028000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8003415
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0036401
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0028846
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8001444
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0038903
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0032231
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8001444
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0037499
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0031385
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
accuracy
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8000787
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0037480
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0048308
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8343883
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0030775
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0049154
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8340468
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0030167
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0050000
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8340169
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0030807
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0032231
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
f_meas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8338193
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0031357
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0035615
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8588161
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0030108
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0036462
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8588096
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0030060
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0037308
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8588078
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0029970
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0038154
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
roc_auc
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
binary
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.8588028
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
20
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0.0029785
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Note1: the transformed df after the recipes steps is 7613 x 1237, that is with 1750 maxtokens and pca on 95% variance. Also we only do PCA on the words (not the extra variable) accuracy = 79.8 and f1 = 74.6.&lt;/p&gt;
&lt;p&gt;Note2: And with 2000 maxtokens, 95% threshold on pca, no normalization we got accuracy = 79.8% and f1 = 74.9%. df is 7613 x 1369.&lt;/p&gt;
&lt;p&gt;Note3: And with 1750 maxtokens, no pca, we got accuracy = 79.9% ad f1 = 74.2%.&lt;/p&gt;
&lt;p&gt;Note4: And with 2500 maxtokens, 95% threshold on pca, no normalization we got accuracy = 80.0% and f1 = 74.98%. df is 7613 x 1630. And we get the exact same results if we use normalization.&lt;/p&gt;
&lt;p&gt;All this fine tuning gave us a very small increase in both accuracy and F1 scores. There is also an increase in ROC-AUC. Considering these are the performances on the trained data, and considering the leakage due to normalization, pca and tf-idf, these might be optimist results. Let’s see.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;best_metric &amp;lt;- tune_lasso %&amp;gt;% select_best(&amp;quot;f_meas&amp;quot;)

wf_lasso &amp;lt;- finalize_workflow(wf_lasso, best_metric)

wf_lasso&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## ══ Workflow ══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_normalize()
## ● step_tokenize()
## ● step_stopwords()
## ● step_tokenfilter()
## ● step_tfidf()
## ● step_pca()
## 
## ── Model ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.00483076923076923
##   mixture = 1
## 
## Computational engine: glmnet&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#save the final lasso model
fit(wf_lasso, df_train) %&amp;gt;% write_rds(path = &amp;quot;~/disaster_tweets/data/model_lasso_enhanced.rds&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;variable-importances&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Variable importances&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(vip)

read_rds(&amp;quot;~/disaster_tweets/data/model_lasso_enhanced.rds&amp;quot;) %&amp;gt;% pull_workflow_fit() %&amp;gt;% 
  vi(lambda = best_metric$penalty) %&amp;gt;% 
  group_by(Sign) %&amp;gt;%
  top_n(20, wt = abs(Importance)) %&amp;gt;%
  ungroup() %&amp;gt;%
  mutate(Importance = abs(Importance), 
         #Variable = str_remove(Variable, &amp;quot;tfidf_text_&amp;quot;), 
         Variable = fct_reorder(Variable, Importance)) %&amp;gt;%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Sign, scales = &amp;quot;free_y&amp;quot;) +
  labs(y = NULL, x = &amp;quot;Sign&amp;quot;)  &lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span id=&#34;fig:vip-lasso-enhanced&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;/post/disaster-tweets-I/index_files/figure-html/vip-lasso-enhanced-1.png&#34; alt=&#34;Most important variables.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Most important variables.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;One thing worth noticing is that not many of our feature engineering made it to the top 20 of important variables.&lt;br /&gt;
What about all our fancy extra variables like number of character? number of http link? Number of # and @?&lt;br /&gt;
Well …&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;read_rds(&amp;quot;~/disaster_tweets/data/model_lasso_enhanced.rds&amp;quot;) %&amp;gt;% pull_workflow_fit() %&amp;gt;% 
  vi(lambda = best_metric$penalty) %&amp;gt;% 
  filter(str_detect(Variable, pattern = &amp;quot;number&amp;quot;)) %&amp;gt;% 
  arrange(desc(Importance))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 9 x 3
##   Variable             Importance Sign 
##   &amp;lt;chr&amp;gt;                     &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;
## 1 number_char              0.941  POS  
## 2 number_http              0.376  POS  
## 3 number_number            0.102  POS  
## 4 number_mention           0.0639 POS  
## 5 number_hashtag           0.0358 POS  
## 6 number_location         -0.0122 NEG  
## 7 number_keyword          -0.0584 NEG  
## 8 number_repeated_char    -0.0912 NEG  
## 9 number_word             -0.310  NEG&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ugh! They are not looking that important! That’s pretty sad, I though I was becoming THE feature engineer guy of NLP. Nope! just humble pie instead…&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;submission-of-results-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Submission of results&lt;/h2&gt;
&lt;p&gt;Applying the model on the test data. We first need to reprocessed the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test &amp;lt;- clean_tweets(&amp;quot;~/disaster_tweets/data/test.csv&amp;quot;) 

model_lasso_enhanced &amp;lt;- read_rds(&amp;quot;~/disaster_tweets/data/model_lasso_enhanced.rds&amp;quot;)

library(glmnet)

prediction_lasso_enhanced &amp;lt;- tibble(id = test$id, 
                                    prediction = predict(model_lasso_enhanced, new_data = test)) %&amp;gt;% 
  mutate(target = if_else(prediction == &amp;quot;a_truth&amp;quot;, 1, 0))

write_csv(prediction_lasso_enhanced %&amp;gt;% select(id, target), path = &amp;quot;~/disaster_tweets/data/prediction_lasso_enhanced.csv&amp;quot;)

rm(list = ls())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note 1: maxtokens - 1750, 95% threshold for PCA, I got 76.8% public score.&lt;/p&gt;
&lt;p&gt;Note 2: with maxtoken = 2000 and 95% treshold for PCA, I got actually a worst accuracy score.&lt;/p&gt;
&lt;p&gt;Note 3: majority voting, maxtokens = 1750 and NO PCA, result with 78.7%.&lt;/p&gt;
&lt;p&gt;Note 4: majority voting, maxtokens = 2500, pca at 95% (df is 1630 wide), results with 78.3% (exact same results with normalization).&lt;/p&gt;
&lt;p&gt;Note 5: majority voting, maxtokens = 4000, pca at 90%. Got 78.3% public score.&lt;/p&gt;
&lt;p&gt;Yep! That second attempt at features engeneering is just a little success. It only added 1% accuracy in the submission. It took a lot of work to make it happen but brought not much increase in F1 score. It’s part of the game!&lt;/p&gt;
&lt;p&gt;We need to use another method to numerize our tweets. This is what we’ll do in Part II and III as we consider SVD and word-embedding.&lt;/p&gt;
&lt;p&gt;I am looking forward for comments / feedback.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;wonderings-and-lessons-learned.&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Wonderings and lessons learned.&lt;/h1&gt;
&lt;p&gt;There are a few things I still wonder how to improve in the modelling workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the relationship between tf_idf and max-tokens in recipe. It seemed that increasing the max_tokens before the tf_idf step didn’t add much values. I am wondering until what point that is the case&lt;/li&gt;
&lt;li&gt;can we parallelize the map functions from the purrr package? It takes quite a bit of times for instance to do lemmatize each tweets using map(text, textstemm::lemmatize()). I do think that might be possible.&lt;/li&gt;
&lt;li&gt;can we parallelize the recipe() %&amp;gt;% prep() %&amp;gt;% juice()? I also find that steps very slow. I do not think that it is possible. The only reason I like to run that line is to check that the recipe steps are doing what I intended them to do&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Any feedback/advises on the 2 things above are really welcome ;-)&lt;/p&gt;
&lt;p&gt;Lessons learned:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the easiest model in the second lasso was the best. All the other fanciers, more computationally intense attempts didn’t provide better results.&lt;/li&gt;
&lt;li&gt;normalization has been recommended on Lasso models. In our case, normalizing after the tf_idf step or after the pca didn’t change much to our model. So in regards to being parsimonious, I would say I can skip these steps in the future.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;p&gt;I have copied ideas from several Kaggle notebooks and blogs.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extensive use of the &lt;strong&gt;textrecipes&lt;/strong&gt; library. Go check his posts on &lt;a href=&#34;https://www.hvitfeldt.me/post/&#34;&gt;his blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;To get ideas on how to use Lasso modeling on a NLP task using the tidymodel framework. Julia Silge blog: &lt;a href=&#34;https://juliasilge.com/blog/animal-crossing/&#34;&gt;Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;To get ideas to feature engineer the original tweets&lt;a href=&#34;https://www.kaggle.com/barun2104/nlp-with-disaster-eda-dfm-svd-ensemble&#34;&gt;NLP with Disaster - EDA | DFM | SVD | Ensemble&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Writing technical content in Academic</title>
      <link>/post/writing-technical-content/</link>
      <pubDate>Fri, 12 Jul 2019 00:00:00 +0000</pubDate>
      <guid>/post/writing-technical-content/</guid>
      <description>&lt;p&gt;Academic is designed to give technical content creators a seamless experience. You can focus on the content and Academic handles the rest.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Highlight your code snippets, take notes on math classes, and draw diagrams from textual representation.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;On this page, you&amp;rsquo;ll find some examples of the types of technical content that can be rendered with Academic.&lt;/p&gt;
&lt;h2 id=&#34;examples&#34;&gt;Examples&lt;/h2&gt;
&lt;h3 id=&#34;code&#34;&gt;Code&lt;/h3&gt;
&lt;p&gt;Academic supports a Markdown extension for highlighting code syntax. You can enable this feature by toggling the &lt;code&gt;highlight&lt;/code&gt; option in your &lt;code&gt;config/_default/params.toml&lt;/code&gt; file.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```python
import pandas as pd
data = pd.read_csv(&amp;quot;data.csv&amp;quot;)
data.head()
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;import pandas as pd
data = pd.read_csv(&amp;quot;data.csv&amp;quot;)
data.head()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;math&#34;&gt;Math&lt;/h3&gt;
&lt;p&gt;Academic supports a Markdown extension for $\LaTeX$ math. You can enable this feature by toggling the &lt;code&gt;math&lt;/code&gt; option in your &lt;code&gt;config/_default/params.toml&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;To render &lt;em&gt;inline&lt;/em&gt; or &lt;em&gt;block&lt;/em&gt; math, wrap your LaTeX math with &lt;code&gt;$...$&lt;/code&gt; or &lt;code&gt;$$...$$&lt;/code&gt;, respectively.&lt;/p&gt;
&lt;p&gt;Example &lt;strong&gt;math block&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-tex&#34;&gt;$$\gamma_{n} = \frac{ 
\left | \left (\mathbf x_{n} - \mathbf x_{n-1} \right )^T 
\left [\nabla F (\mathbf x_{n}) - \nabla F (\mathbf x_{n-1}) \right ] \right |}
{\left \|\nabla F(\mathbf{x}_{n}) - \nabla F(\mathbf{x}_{n-1}) \right \|^2}$$
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;p&gt;$$\gamma_{n} = \frac{ \left | \left (\mathbf x_{n} - \mathbf x_{n-1} \right )^T \left [\nabla F (\mathbf x_{n}) - \nabla F (\mathbf x_{n-1}) \right ] \right |}{\left |\nabla F(\mathbf{x}_{n}) - \nabla F(\mathbf{x}_{n-1}) \right |^2}$$&lt;/p&gt;
&lt;p&gt;Example &lt;strong&gt;inline math&lt;/strong&gt; &lt;code&gt;$\nabla F(\mathbf{x}_{n})$&lt;/code&gt; renders as $\nabla F(\mathbf{x}_{n})$.&lt;/p&gt;
&lt;p&gt;Example &lt;strong&gt;multi-line math&lt;/strong&gt; using the &lt;code&gt;\\\\&lt;/code&gt; math linebreak:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-tex&#34;&gt;$$f(k;p_0^*) = \begin{cases} p_0^* &amp;amp; \text{if }k=1, \\\\
1-p_0^* &amp;amp; \text {if }k=0.\end{cases}$$
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;p&gt;$$f(k;p_0^*) = \begin{cases} p_0^* &amp;amp; \text{if }k=1, \\&lt;br&gt;
1-p_0^* &amp;amp; \text {if }k=0.\end{cases}$$&lt;/p&gt;
&lt;h3 id=&#34;diagrams&#34;&gt;Diagrams&lt;/h3&gt;
&lt;p&gt;Academic supports a Markdown extension for diagrams. You can enable this feature by toggling the &lt;code&gt;diagram&lt;/code&gt; option in your &lt;code&gt;config/_default/params.toml&lt;/code&gt; file or by adding &lt;code&gt;diagram: true&lt;/code&gt; to your page front matter.&lt;/p&gt;
&lt;p&gt;An example &lt;strong&gt;flowchart&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```mermaid
graph TD
A[Hard] --&amp;gt;|Text| B(Round)
B --&amp;gt; C{Decision}
C --&amp;gt;|One| D[Result 1]
C --&amp;gt;|Two| E[Result 2]
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-mermaid&#34;&gt;graph TD
A[Hard] --&amp;gt;|Text| B(Round)
B --&amp;gt; C{Decision}
C --&amp;gt;|One| D[Result 1]
C --&amp;gt;|Two| E[Result 2]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An example &lt;strong&gt;sequence diagram&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```mermaid
sequenceDiagram
Alice-&amp;gt;&amp;gt;John: Hello John, how are you?
loop Healthcheck
    John-&amp;gt;&amp;gt;John: Fight against hypochondria
end
Note right of John: Rational thoughts!
John--&amp;gt;&amp;gt;Alice: Great!
John-&amp;gt;&amp;gt;Bob: How about you?
Bob--&amp;gt;&amp;gt;John: Jolly good!
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-mermaid&#34;&gt;sequenceDiagram
Alice-&amp;gt;&amp;gt;John: Hello John, how are you?
loop Healthcheck
    John-&amp;gt;&amp;gt;John: Fight against hypochondria
end
Note right of John: Rational thoughts!
John--&amp;gt;&amp;gt;Alice: Great!
John-&amp;gt;&amp;gt;Bob: How about you?
Bob--&amp;gt;&amp;gt;John: Jolly good!
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An example &lt;strong&gt;Gantt diagram&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```mermaid
gantt
section Section
Completed :done,    des1, 2014-01-06,2014-01-08
Active        :active,  des2, 2014-01-07, 3d
Parallel 1   :         des3, after des1, 1d
Parallel 2   :         des4, after des1, 1d
Parallel 3   :         des5, after des3, 1d
Parallel 4   :         des6, after des4, 1d
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-mermaid&#34;&gt;gantt
section Section
Completed :done,    des1, 2014-01-06,2014-01-08
Active        :active,  des2, 2014-01-07, 3d
Parallel 1   :         des3, after des1, 1d
Parallel 2   :         des4, after des1, 1d
Parallel 3   :         des5, after des3, 1d
Parallel 4   :         des6, after des4, 1d
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An example &lt;strong&gt;class diagram&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```mermaid
classDiagram
Class01 &amp;lt;|-- AveryLongClass : Cool
&amp;lt;&amp;lt;interface&amp;gt;&amp;gt; Class01
Class09 --&amp;gt; C2 : Where am i?
Class09 --* C3
Class09 --|&amp;gt; Class07
Class07 : equals()
Class07 : Object[] elementData
Class01 : size()
Class01 : int chimp
Class01 : int gorilla
class Class10 {
  &amp;lt;&amp;lt;service&amp;gt;&amp;gt;
  int id
  size()
}
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-mermaid&#34;&gt;classDiagram
Class01 &amp;lt;|-- AveryLongClass : Cool
&amp;lt;&amp;lt;interface&amp;gt;&amp;gt; Class01
Class09 --&amp;gt; C2 : Where am i?
Class09 --* C3
Class09 --|&amp;gt; Class07
Class07 : equals()
Class07 : Object[] elementData
Class01 : size()
Class01 : int chimp
Class01 : int gorilla
class Class10 {
  &amp;lt;&amp;lt;service&amp;gt;&amp;gt;
  int id
  size()
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;An example &lt;strong&gt;state diagram&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;```mermaid
stateDiagram
[*] --&amp;gt; Still
Still --&amp;gt; [*]
Still --&amp;gt; Moving
Moving --&amp;gt; Still
Moving --&amp;gt; Crash
Crash --&amp;gt; [*]
```
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-mermaid&#34;&gt;stateDiagram
[*] --&amp;gt; Still
Still --&amp;gt; [*]
Still --&amp;gt; Moving
Moving --&amp;gt; Still
Moving --&amp;gt; Crash
Crash --&amp;gt; [*]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&#34;todo-lists&#34;&gt;Todo lists&lt;/h3&gt;
&lt;p&gt;You can even write your todo lists in Academic too:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;- [x] Write math example
- [x] Write diagram example
- [ ] Do something else
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input checked=&#34;&#34; disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Write math example&lt;/li&gt;
&lt;li&gt;&lt;input checked=&#34;&#34; disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Write diagram example&lt;/li&gt;
&lt;li&gt;&lt;input disabled=&#34;&#34; type=&#34;checkbox&#34;&gt; Do something else&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;tables&#34;&gt;Tables&lt;/h3&gt;
&lt;p&gt;Represent your data in tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;| First Header  | Second Header |
| ------------- | ------------- |
| Content Cell  | Content Cell  |
| Content Cell  | Content Cell  |
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;First Header&lt;/th&gt;
&lt;th&gt;Second Header&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Content Cell&lt;/td&gt;
&lt;td&gt;Content Cell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Content Cell&lt;/td&gt;
&lt;td&gt;Content Cell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;asides&#34;&gt;Asides&lt;/h3&gt;
&lt;p&gt;Academic supports a 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/#alerts&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;shortcode for asides&lt;/a&gt;, also referred to as &lt;em&gt;notices&lt;/em&gt;, &lt;em&gt;hints&lt;/em&gt;, or &lt;em&gt;alerts&lt;/em&gt;. By wrapping a paragraph in &lt;code&gt;{{% alert note %}} ... {{% /alert %}}&lt;/code&gt;, it will render as an aside.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{% alert note %}}
A Markdown aside is useful for displaying notices, hints, or definitions to your readers.
{{% /alert %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    A Markdown aside is useful for displaying notices, hints, or definitions to your readers.
  &lt;/div&gt;
&lt;/div&gt;
&lt;h3 id=&#34;spoilers&#34;&gt;Spoilers&lt;/h3&gt;
&lt;p&gt;Add a spoiler to a page to reveal text, such as an answer to a question, after a button is clicked.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{&amp;lt; spoiler text=&amp;quot;Click to view the spoiler&amp;quot; &amp;gt;}}
You found me!
{{&amp;lt; /spoiler &amp;gt;}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;div class=&#34;spoiler &#34; &gt;
  &lt;p&gt;
    &lt;a class=&#34;btn btn-primary&#34; data-toggle=&#34;collapse&#34; href=&#34;#spoiler-1&#34; role=&#34;button&#34; aria-expanded=&#34;false&#34; aria-controls=&#34;spoiler-1&#34;&gt;
      Click to view the spoiler
    &lt;/a&gt;
  &lt;/p&gt;
  &lt;div class=&#34;collapse card &#34; id=&#34;spoiler-1&#34;&gt;
    &lt;div class=&#34;card-body&#34;&gt;
      You found me!
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;
&lt;h3 id=&#34;icons&#34;&gt;Icons&lt;/h3&gt;
&lt;p&gt;Academic enables you to use a wide range of 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/page-builder/#icons&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;icons from &lt;em&gt;Font Awesome&lt;/em&gt; and &lt;em&gt;Academicons&lt;/em&gt;&lt;/a&gt; in addition to 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/#emojis&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;emojis&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Here are some examples using the &lt;code&gt;icon&lt;/code&gt; shortcode to render icons:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{&amp;lt; icon name=&amp;quot;terminal&amp;quot; pack=&amp;quot;fas&amp;quot; &amp;gt;}} Terminal  
{{&amp;lt; icon name=&amp;quot;python&amp;quot; pack=&amp;quot;fab&amp;quot; &amp;gt;}} Python  
{{&amp;lt; icon name=&amp;quot;r-project&amp;quot; pack=&amp;quot;fab&amp;quot; &amp;gt;}} R
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;renders as&lt;/p&gt;
&lt;p&gt;
  &lt;i class=&#34;fas fa-terminal  pr-1 fa-fw&#34;&gt;&lt;/i&gt; Terminal&lt;br&gt;

  &lt;i class=&#34;fab fa-python  pr-1 fa-fw&#34;&gt;&lt;/i&gt; Python&lt;br&gt;

  &lt;i class=&#34;fab fa-r-project  pr-1 fa-fw&#34;&gt;&lt;/i&gt; R&lt;/p&gt;
&lt;h3 id=&#34;did-you-find-this-page-helpful-consider-sharing-it-&#34;&gt;Did you find this page helpful? Consider sharing it 🙌&lt;/h3&gt;
</description>
    </item>
    
    <item>
      <title>An example preprint / working paper</title>
      <link>/publication/preprint/</link>
      <pubDate>Sun, 07 Apr 2019 00:00:00 +0000</pubDate>
      <guid>/publication/preprint/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Slides&lt;/em&gt; button above to demo Academic&amp;rsquo;s Markdown slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Supplementary notes can be added here, including 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code and math&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Display Jupyter Notebooks with Academic</title>
      <link>/post/jupyter/</link>
      <pubDate>Tue, 05 Feb 2019 00:00:00 +0000</pubDate>
      <guid>/post/jupyter/</guid>
      <description>&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;from IPython.core.display import Image
Image(&#39;https://www.python.org/static/community_logos/python-logo-master-v3-TM-flattened.png&#39;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;./index_1_0.png&#34; alt=&#34;png&#34;&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;print(&amp;quot;Welcome to Academic!&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;Welcome to Academic!
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;install-python-and-jupyterlab&#34;&gt;Install Python and JupyterLab&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&#34;https://www.anaconda.com/distribution/#download-section&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Install Anaconda&lt;/a&gt; which includes Python 3 and JupyterLab.&lt;/p&gt;
&lt;p&gt;Alternatively, install JupyterLab with &lt;code&gt;pip3 install jupyterlab&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;create-or-upload-a-jupyter-notebook&#34;&gt;Create or upload a Jupyter notebook&lt;/h2&gt;
&lt;p&gt;Run the following commands in your Terminal, substituting &lt;code&gt;&amp;lt;MY-WEBSITE-FOLDER&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;SHORT-POST-TITLE&amp;gt;&lt;/code&gt; with the file path to your Academic website folder and a short title for your blog post (use hyphens instead of spaces), respectively:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;mkdir -p &amp;lt;MY-WEBSITE-FOLDER&amp;gt;/content/post/&amp;lt;SHORT-POST-TITLE&amp;gt;/
cd &amp;lt;MY-WEBSITE-FOLDER&amp;gt;/content/post/&amp;lt;SHORT-POST-TITLE&amp;gt;/
jupyter lab index.ipynb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;jupyter&lt;/code&gt; command above will launch the JupyterLab editor, allowing us to add Academic metadata and write the content.&lt;/p&gt;
&lt;h2 id=&#34;edit-your-post-metadata&#34;&gt;Edit your post metadata&lt;/h2&gt;
&lt;p&gt;The first cell of your Jupter notebook will contain your post metadata (
&lt;a href=&#34;https://sourcethemes.com/academic/docs/front-matter/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;front matter&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In Jupter, choose &lt;em&gt;Markdown&lt;/em&gt; as the type of the first cell and wrap your Academic metadata in three dashes, indicating that it is YAML front matter:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;---
title: My post&#39;s title
date: 2019-09-01

# Put any other Academic metadata here...
---
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Edit the metadata of your post, using the 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;documentation&lt;/a&gt; as a guide to the available options.&lt;/p&gt;
&lt;p&gt;To set a 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#featured-image&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;featured image&lt;/a&gt;, place an image named &lt;code&gt;featured&lt;/code&gt; into your post&amp;rsquo;s folder.&lt;/p&gt;
&lt;p&gt;For other tips, such as using math, see the guide on 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;writing content with Academic&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;convert-notebook-to-markdown&#34;&gt;Convert notebook to Markdown&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&#34;language-bash&#34;&gt;jupyter nbconvert index.ipynb --to markdown --NbConvertApp.output_files_dir=.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id=&#34;example&#34;&gt;Example&lt;/h2&gt;
&lt;p&gt;This post was created with Jupyter. The orginal files can be found at &lt;a href=&#34;https://github.com/gcushen/hugo-academic/tree/master/exampleSite/content/post/jupyter&#34;&gt;https://github.com/gcushen/hugo-academic/tree/master/exampleSite/content/post/jupyter&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Slides</title>
      <link>/slides/example/</link>
      <pubDate>Tue, 05 Feb 2019 00:00:00 +0000</pubDate>
      <guid>/slides/example/</guid>
      <description>&lt;h1 id=&#34;create-slides-in-markdown-with-academic&#34;&gt;Create slides in Markdown with Academic&lt;/h1&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Academic&lt;/a&gt; | 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Documentation&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;features&#34;&gt;Features&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Efficiently write slides in Markdown&lt;/li&gt;
&lt;li&gt;3-in-1: Create, Present, and Publish your slides&lt;/li&gt;
&lt;li&gt;Supports speaker notes&lt;/li&gt;
&lt;li&gt;Mobile friendly slides&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;controls&#34;&gt;Controls&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Next: &lt;code&gt;Right Arrow&lt;/code&gt; or &lt;code&gt;Space&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Previous: &lt;code&gt;Left Arrow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Start: &lt;code&gt;Home&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Finish: &lt;code&gt;End&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Overview: &lt;code&gt;Esc&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Speaker notes: &lt;code&gt;S&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Fullscreen: &lt;code&gt;F&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Zoom: &lt;code&gt;Alt + Click&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://github.com/hakimel/reveal.js#pdf-export&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;PDF Export&lt;/a&gt;: &lt;code&gt;E&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;code-highlighting&#34;&gt;Code Highlighting&lt;/h2&gt;
&lt;p&gt;Inline code: &lt;code&gt;variable&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Code block:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-python&#34;&gt;porridge = &amp;quot;blueberry&amp;quot;
if porridge == &amp;quot;blueberry&amp;quot;:
    print(&amp;quot;Eating...&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2 id=&#34;math&#34;&gt;Math&lt;/h2&gt;
&lt;p&gt;In-line math: $x + y = z$&lt;/p&gt;
&lt;p&gt;Block math:&lt;/p&gt;
&lt;p&gt;$$
f\left( x \right) = ;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}}
$$&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;fragments&#34;&gt;Fragments&lt;/h2&gt;
&lt;p&gt;Make content appear incrementally&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Press &lt;code&gt;Space&lt;/code&gt; to play!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
One
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
&lt;strong&gt;Two&lt;/strong&gt;
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
Three
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;A fragment can accept two optional parameters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;class&lt;/code&gt;: use a custom style (requires definition in custom CSS)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;weight&lt;/code&gt;: sets the order in which a fragment appears&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;speaker-notes&#34;&gt;Speaker Notes&lt;/h2&gt;
&lt;p&gt;Add speaker notes to your presentation&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Press the &lt;code&gt;S&lt;/code&gt; key to view the speaker notes!&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Only the speaker can read these notes&lt;/li&gt;
&lt;li&gt;Press &lt;code&gt;S&lt;/code&gt; key to view&lt;/li&gt;
&lt;/ul&gt;

&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;themes&#34;&gt;Themes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;black: Black background, white text, blue links (default)&lt;/li&gt;
&lt;li&gt;white: White background, black text, blue links&lt;/li&gt;
&lt;li&gt;league: Gray background, white text, blue links&lt;/li&gt;
&lt;li&gt;beige: Beige background, dark text, brown links&lt;/li&gt;
&lt;li&gt;sky: Blue background, thin dark text, blue links&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;night: Black background, thick white text, orange links&lt;/li&gt;
&lt;li&gt;serif: Cappuccino background, gray text, brown links&lt;/li&gt;
&lt;li&gt;simple: White background, black text, blue links&lt;/li&gt;
&lt;li&gt;solarized: Cream-colored background, dark green text, blue links&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;

&lt;section data-noprocess data-shortcode-slide
  
      
      data-background-image=&#34;/img/boards.jpg&#34;
  &gt;

&lt;h2 id=&#34;custom-slide&#34;&gt;Custom Slide&lt;/h2&gt;
&lt;p&gt;Customize the slide style and background&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{&amp;lt; slide background-image=&amp;quot;/img/boards.jpg&amp;quot; &amp;gt;}}
{{&amp;lt; slide background-color=&amp;quot;#0000FF&amp;quot; &amp;gt;}}
{{&amp;lt; slide class=&amp;quot;my-style&amp;quot; &amp;gt;}}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2 id=&#34;custom-css-example&#34;&gt;Custom CSS Example&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s make headers navy colored.&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;assets/css/reveal_custom.css&lt;/code&gt; with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-css&#34;&gt;.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h1 id=&#34;questions&#34;&gt;Questions?&lt;/h1&gt;
&lt;p&gt;
&lt;a href=&#34;https://spectrum.chat/academic&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Ask&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Documentation&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>A Philosophical Odyssey in The Concept of Time</title>
      <link>/talk/timeodyssey/</link>
      <pubDate>Wed, 23 Jan 2019 11:00:00 +0000</pubDate>
      <guid>/talk/timeodyssey/</guid>
      <description>&lt;div class=&#34;responsive-wrap&#34;&gt;
  &lt;iframe src=&#34;https://docs.google.com/presentation/d/e/2PACX-1vQa6PAaw5udo0gszSTiniZ8-0hJ3h4k55CBG0e-zaG8y3-Jf6ZSjqYlHLnEF5tLYbUOrCo4Bp2_z8Av/embed?start=true&amp;amp;loop=false&amp;amp;delayms=5000&#34; frameborder=&#34;0&#34; width=&#34;960&#34; height=&#34;569&#34; allowfullscreen=&#34;true&#34; mozallowfullscreen=&#34;true&#34; webkitallowfullscreen=&#34;true&#34;&gt;&lt;/iframe&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Privacy Policy</title>
      <link>/privacy/</link>
      <pubDate>Thu, 28 Jun 2018 00:00:00 +0100</pubDate>
      <guid>/privacy/</guid>
      <description>&lt;p&gt;&amp;hellip;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Terms</title>
      <link>/terms/</link>
      <pubDate>Thu, 28 Jun 2018 00:00:00 +0100</pubDate>
      <guid>/terms/</guid>
      <description>&lt;p&gt;&amp;hellip;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>External Project</title>
      <link>/project/external-project/</link>
      <pubDate>Wed, 27 Apr 2016 00:00:00 +0000</pubDate>
      <guid>/project/external-project/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Internal Project</title>
      <link>/project/internal-project/</link>
      <pubDate>Wed, 27 Apr 2016 00:00:00 +0000</pubDate>
      <guid>/project/internal-project/</guid>
      <description>&lt;p&gt;Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.&lt;/p&gt;
&lt;p&gt;Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.&lt;/p&gt;
&lt;p&gt;Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.&lt;/p&gt;
&lt;p&gt;Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.&lt;/p&gt;
&lt;p&gt;Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Academic: the website builder for Hugo</title>
      <link>/post/getting-started/</link>
      <pubDate>Wed, 20 Apr 2016 00:00:00 +0000</pubDate>
      <guid>/post/getting-started/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Create a free website with Academic using Markdown, Jupyter, or RStudio. Choose a beautiful color theme and build anything with the Page Builder - over 40 &lt;em&gt;widgets&lt;/em&gt;, &lt;em&gt;themes&lt;/em&gt;, and &lt;em&gt;language packs&lt;/em&gt; included!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://academic-demo.netlify.com/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Check out the latest &lt;strong&gt;demo&lt;/strong&gt;&lt;/a&gt; of what you&amp;rsquo;ll get in less than 10 minutes, or 
&lt;a href=&#34;https://sourcethemes.com/academic/#expo&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;view the &lt;strong&gt;showcase&lt;/strong&gt;&lt;/a&gt; of personal, project, and business sites.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;👉 
&lt;a href=&#34;#install&#34;&gt;&lt;strong&gt;Get Started&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;📚 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;View the &lt;strong&gt;documentation&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;💬 
&lt;a href=&#34;https://discourse.gohugo.io&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;Ask a question&lt;/strong&gt; on the forum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;👥 
&lt;a href=&#34;https://spectrum.chat/academic&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Chat with the &lt;strong&gt;community&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;🐦 Twitter: 
&lt;a href=&#34;https://twitter.com/source_themes&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;@source_themes&lt;/a&gt; 
&lt;a href=&#34;https://twitter.com/GeorgeCushen&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;@GeorgeCushen&lt;/a&gt; 
&lt;a href=&#34;https://twitter.com/search?q=%23MadeWithAcademic&amp;amp;src=typd&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;#MadeWithAcademic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;💡 
&lt;a href=&#34;https://github.com/gcushen/hugo-academic/issues&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Request a &lt;strong&gt;feature&lt;/strong&gt; or report a &lt;strong&gt;bug&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;⬆️ &lt;strong&gt;Updating?&lt;/strong&gt; View the 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/update/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Update Guide&lt;/a&gt; and 
&lt;a href=&#34;https://sourcethemes.com/academic/updates/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Release Notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;❤️ &lt;strong&gt;Support development&lt;/strong&gt; of Academic:
&lt;ul&gt;
&lt;li&gt;☕️ 
&lt;a href=&#34;https://paypal.me/cushen&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;Donate a coffee&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;💵 
&lt;a href=&#34;https://www.patreon.com/cushen&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Become a backer on &lt;strong&gt;Patreon&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;🖼️ 
&lt;a href=&#34;https://www.redbubble.com/people/neutreno/works/34387919-academic&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Decorate your laptop or journal with an Academic &lt;strong&gt;sticker&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;👕 
&lt;a href=&#34;https://academic.threadless.com/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Wear the &lt;strong&gt;T-shirt&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;👩‍💻 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/contribute/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;Contribute&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;figure id=&#34;figure-academic-is-mobile-first-with-a-responsive-design-to-ensure-that-your-site-looks-stunning-on-every-device&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;https://raw.githubusercontent.com/gcushen/hugo-academic/master/academic.png&#34; data-caption=&#34;Academic is mobile first with a responsive design to ensure that your site looks stunning on every device.&#34;&gt;


  &lt;img src=&#34;https://raw.githubusercontent.com/gcushen/hugo-academic/master/academic.png&#34; alt=&#34;&#34;  &gt;
&lt;/a&gt;


  &lt;figcaption&gt;
    Academic is mobile first with a responsive design to ensure that your site looks stunning on every device.
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;Key features:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Page builder&lt;/strong&gt; - Create &lt;em&gt;anything&lt;/em&gt; with 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/page-builder/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;widgets&lt;/strong&gt;&lt;/a&gt; and 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;elements&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Edit any type of content&lt;/strong&gt; - Blog posts, publications, talks, slides, projects, and more!&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create content&lt;/strong&gt; in 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/a&gt;, 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/jupyter/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;Jupyter&lt;/strong&gt;&lt;/a&gt;, or 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/install/#install-with-rstudio&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;RStudio&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Plugin System&lt;/strong&gt; - Fully customizable 
&lt;a href=&#34;https://sourcethemes.com/academic/themes/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;color&lt;/strong&gt; and &lt;strong&gt;font themes&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Display Code and Math&lt;/strong&gt; - Code highlighting and 
&lt;a href=&#34;https://en.wikibooks.org/wiki/LaTeX/Mathematics&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;LaTeX math&lt;/a&gt; supported&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrations&lt;/strong&gt; - 
&lt;a href=&#34;https://analytics.google.com&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Google Analytics&lt;/a&gt;, 
&lt;a href=&#34;https://disqus.com&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Disqus commenting&lt;/a&gt;, Maps, Contact Forms, and more!&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Beautiful Site&lt;/strong&gt; - Simple and refreshing one page design&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Industry-Leading SEO&lt;/strong&gt; - Help get your website found on search engines and social media&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Media Galleries&lt;/strong&gt; - Display your images and videos with captions in a customizable gallery&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mobile Friendly&lt;/strong&gt; - Look amazing on every screen with a mobile friendly version of your site&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-language&lt;/strong&gt; - 15+ language packs including English, 中文, and Português&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-user&lt;/strong&gt; - Each author gets their own profile page&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy Pack&lt;/strong&gt; - Assists with GDPR&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stand Out&lt;/strong&gt; - Bring your site to life with animation, parallax backgrounds, and scroll effects&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;One-Click Deployment&lt;/strong&gt; - No servers. No databases. Only files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;themes&#34;&gt;Themes&lt;/h2&gt;
&lt;p&gt;Academic comes with &lt;strong&gt;automatic day (light) and night (dark) mode&lt;/strong&gt; built-in. Alternatively, visitors can  choose their preferred mode - click the sun/moon icon in the top right of the 
&lt;a href=&#34;https://academic-demo.netlify.com/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Demo&lt;/a&gt; to see it in action! Day/night mode can also be disabled by the site admin in &lt;code&gt;params.toml&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/themes/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Choose a stunning &lt;strong&gt;theme&lt;/strong&gt; and &lt;strong&gt;font&lt;/strong&gt;&lt;/a&gt; for your site. Themes are fully 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/customization/#custom-theme&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;customizable&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;ecosystem&#34;&gt;Ecosystem&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;
&lt;a href=&#34;https://github.com/sourcethemes/academic-admin&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Academic Admin&lt;/a&gt;:&lt;/strong&gt; An admin tool to import publications from BibTeX or import assets for an offline site&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;
&lt;a href=&#34;https://github.com/sourcethemes/academic-scripts&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Academic Scripts&lt;/a&gt;:&lt;/strong&gt; Scripts to help migrate content to new versions of Academic&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;install&#34;&gt;Install&lt;/h2&gt;
&lt;p&gt;You can choose from one of the following four methods to install:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/install/#install-with-web-browser&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;strong&gt;one-click install using your web browser (recommended)&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/install/#install-with-git&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;install on your computer using &lt;strong&gt;Git&lt;/strong&gt; with the Command Prompt/Terminal app&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/install/#install-with-zip&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;install on your computer by downloading the &lt;strong&gt;ZIP files&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/install/#install-with-rstudio&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;install on your computer with &lt;strong&gt;RStudio&lt;/strong&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/get-started/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;personalize and deploy your new site&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;updating&#34;&gt;Updating&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/update/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;View the Update Guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Feel free to &lt;em&gt;star&lt;/em&gt; the project on 
&lt;a href=&#34;https://github.com/gcushen/hugo-academic/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Github&lt;/a&gt; to help keep track of 
&lt;a href=&#34;https://sourcethemes.com/academic/updates&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;updates&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&#34;license&#34;&gt;License&lt;/h2&gt;
&lt;p&gt;Copyright 2016-present 
&lt;a href=&#34;https://georgecushen.com&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;George Cushen&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Released under the 
&lt;a href=&#34;https://github.com/gcushen/hugo-academic/blob/master/LICENSE.md&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;MIT&lt;/a&gt; license.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>An example journal article</title>
      <link>/publication/journal-article/</link>
      <pubDate>Tue, 01 Sep 2015 00:00:00 +0000</pubDate>
      <guid>/publication/journal-article/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Cite&lt;/em&gt; button above to demo the feature to enable visitors to import publication metadata into their reference management software.
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Slides&lt;/em&gt; button above to demo Academic&amp;rsquo;s Markdown slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Supplementary notes can be added here, including 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code and math&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Hello R Markdown</title>
      <link>/post/2015-07-23-r-rmarkdown/</link>
      <pubDate>Thu, 23 Jul 2015 21:13:14 -0500</pubDate>
      <guid>/post/2015-07-23-r-rmarkdown/</guid>
      <description>


&lt;div id=&#34;r-markdown&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R Markdown&lt;/h1&gt;
&lt;p&gt;This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see &lt;a href=&#34;http://rmarkdown.rstudio.com&#34; class=&#34;uri&#34;&gt;http://rmarkdown.rstudio.com&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You can embed an R code chunk like this:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00
fit &amp;lt;- lm(dist ~ speed, data = cars)
fit
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;including-plots&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Including Plots&lt;/h1&gt;
&lt;p&gt;You can also embed plots. See Figure &lt;a href=&#34;#fig:pie&#34;&gt;1&lt;/a&gt; for example:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar = c(0, 1, 0, 1))
pie(
  c(280, 60, 20),
  c(&amp;#39;Sky&amp;#39;, &amp;#39;Sunny side of pyramid&amp;#39;, &amp;#39;Shady side of pyramid&amp;#39;),
  col = c(&amp;#39;#0292D8&amp;#39;, &amp;#39;#F7EA39&amp;#39;, &amp;#39;#C4B632&amp;#39;),
  init.angle = -50, border = NA
)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span id=&#34;fig:pie&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;/post/2015-07-23-r-rmarkdown_files/figure-html/pie-1.png&#34; alt=&#34;A fancy pie chart.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: A fancy pie chart.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>An example conference paper</title>
      <link>/publication/conference-paper/</link>
      <pubDate>Mon, 01 Jul 2013 00:00:00 +0000</pubDate>
      <guid>/publication/conference-paper/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Cite&lt;/em&gt; button above to demo the feature to enable visitors to import publication metadata into their reference management software.
  &lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click the &lt;em&gt;Slides&lt;/em&gt; button above to demo Academic&amp;rsquo;s Markdown slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Supplementary notes can be added here, including 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code and math&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Example Talk</title>
      <link>/talk/example/</link>
      <pubDate>Sun, 01 Jun 2003 13:00:00 +0000</pubDate>
      <guid>/talk/example/</guid>
      <description>&lt;div class=&#34;alert alert-note&#34;&gt;
  &lt;div&gt;
    Click on the &lt;strong&gt;Slides&lt;/strong&gt; button above to view the built-in slides feature.
  &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Slides can be added in a few ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Create&lt;/strong&gt; slides using Academic&amp;rsquo;s 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;&lt;em&gt;Slides&lt;/em&gt;&lt;/a&gt; feature and link using &lt;code&gt;slides&lt;/code&gt; parameter in the front matter of the talk file&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Upload&lt;/strong&gt; an existing slide deck to &lt;code&gt;static/&lt;/code&gt; and link using &lt;code&gt;url_slides&lt;/code&gt; parameter in the front matter of the talk file&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embed&lt;/strong&gt; your slides (e.g. Google Slides) or presentation video on this page using 
&lt;a href=&#34;https://sourcethemes.com/academic/docs/writing-markdown-latex/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;shortcodes&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Further talk details can easily be added to this page using &lt;em&gt;Markdown&lt;/em&gt; and $\rm \LaTeX$ math code.&lt;/p&gt;
</description>
    </item>
    
  </channel>
</rss>