-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpart-3-patterns-and-selections.html
157 lines (131 loc) · 4.96 KB
/
part-3-patterns-and-selections.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
<html>
<head>
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/white.css">
<link rel="stylesheet" href="lib/css/zenburn.css">
<style>
.reveal h1 {
text-transform: none;
line-height: 1;
}
.reveal ul {
margin: 0;
}
.reveal li {
list-style-type: none;
}
.reveal p {
margin: 0;
margin-bottom: 0.5em;
}
.reveal pre {
box-shadow: none;
}
.reveal pre code {
padding: 25px;
}
</style>
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<h1>Patterns and selections</h1>
</section>
<section>
<p>Because so much of the web is templated and generated from data stored in databases, it's also <strong>predictable</strong> in its structure.</p>
</section>
<section>
<p>For example, take WordPress, the popular blogging content management system. At its most basic, it's designed to take data for blog posts stored in a database and display them. It does this by running a "loop" over all the posts to be displayed.</p>
</section>
<section>
<p>That loop looks like this:</p>
<img src="img/Screen Shot 2018-07-19 at 2.27.07 AM.png" alt="">
</section>
<section>
<p>So each post will have, for the most part, exactly the same HTML structure.</p>
</section>
<section>
<p>Before we dive into that structure, let's start with something a bit more basic: text itself.</p>
</section>
<section>
<p>Just as HTML has a structure, so does text itself (thanks, linguistics!). Namely, you have characters, digits, spaces, word boundaries, etc.</p>
</section>
<section>
<p>A long time ago, someone figured out we could create a sort of "language" (it's not really a language) to search and query text using these formal characteristics. That's what we call <strong>regular expressions</strong>, or <strong>regex</strong> for short.</p>
</section>
<section>
<blockquote>Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.</blockquote>
</section>
<section>
<img style="height: 90%;" src="img/Screen Shot 2018-07-19 at 2.30.22 AM.png" alt="">
</section>
<section>
<p>It is often <strong>hard</strong> to write and difficult to read, so many people hate it. But it's powerful.</p>
</section>
<section>
<p>Here's an example of a regular expression (please bear with me!):</p>
<img src="img/Screen Shot 2018-07-19 at 2.38.22 AM.png" alt="">
</section>
<section>
<p>Let's write a few of our own together.</p>
<p><a href="https://regexone.com/lesson/introduction_abcs">https://regexone.com/lesson/introduction_abcs</a></p>
</section>
<section>
<p>Next up: selecting DOM nodes!</p>
</section>
<section>
<p>In the same way that we can use regex as a set of instructions to select text, we can use a certain sequence of text to select elements in the DOM.</p>
</section>
<section>
<p>There are two common ways of doing this: selectors and XPath.</p>
<ul>
<li><p>Selectors are far and away the most common, because they're a bit easier to read and are used in JavaScript.</p></li>
<li><p>XPath looks a lot more like regex, and can be annoying to get right.</p></li>
</ul>
</section>
<section>
<p>Both are useful, and luckily for us nowadays browsers can do a lot of the work in telling us what our selection text will be.</p>
</section>
<section>
<p>Let's use Dev Tools to look at some markup on The Globe and Mail's homepage and see if we notice some patterns.</p>
</section>
<section>
<p>Now that we've noticed a few, let's try and grab text for all the headlines on The Globe's homepage.</p>
</section>
<section>
<pre>
<code data-trim>
let data = document
.querySelectorAll('.o-card__hed-text')
.forEach(d => d.textContent);
console.log(data);
</code>
</pre>
</section>
<section>
<p>With those four lines of code, we've just written our first scraper.</p>
</section>
<section>
<p>Exercise time: let's all pick a website, identify the structure and write a basic <code>document.querySelectorAll</code> query together to grab some nodes.</p>
</section>
<section>
<p>Lunch time!</p>
</section>
<section>
<p><strong>Previous section:</strong> <a href="./part-2-basics-of-markup.html">Part 2: The basics of markup</a></p>
<p><strong>Next section:</strong> <a href="./part-4-writing-your-first-scraper.html">Part 4: Writing your first scraper with rvest</a></p>
</section>
</div>
</div>
<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
Reveal.initialize({
dependencies: [
{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
]
});
</script>
</body>
</html>