-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpart-5-offline-document-scraping.html
98 lines (84 loc) · 2.35 KB
/
part-5-offline-document-scraping.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
<html>
<head>
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/white.css">
<link rel="stylesheet" href="lib/css/zenburn.css">
<style>
.reveal h1 {
text-transform: none;
line-height: 1;
}
.reveal ul {
margin: 0;
}
.reveal li {
list-style-type: none;
}
.reveal p {
margin: 0;
margin-bottom: 0.5em;
}
.reveal pre {
box-shadow: none;
}
.reveal pre code {
padding: 25px;
}
</style>
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<h1>Offline document scraping</h1>
</section>
<section>
<p>Phew! This is the last section of the day.</p>
</section>
<section>
<p>While scraping can get you a long way, sometimes there's no way around having to work with a traditional document, such as a PDF.</p>
</section>
<section>
<p>I'm a fan of the hybrid approach — build a scraper to bulk download files, then analyze them in a different tool. Let's try out bulk downloading now.</p>
</section>
<section>
<p><strong>pdf.R</strong></p>
</section>
<section>
<p>I've used methods like this to download thousands of files at once. It can get pretty intense!</p>
</section>
<section>
<p>But once you've got your files, how do you extract information from them? Let's talk through the options.</p>
</section>
<section>
<h3>Tabula</h3>
</section>
<section>
<h3>Tesseract, pdfplumber, docs2csv</h3>
</section>
<section>
<h3>Adobe Acrobat</h3>
</section>
<section>
<p>Let's use Tabula and Acrobat to try to extract some data from the PDFs we just downloaded.</p>
</section>
<section>
<p>That's it! We'll take a short break, then get down to writing our own scrapers.</p>
<img src="img/giphy (1).gif" alt="">
</section>
<section>
<p><strong>Previous section:</strong> <a href="./part-4-writing-your-first-scraper.html">Part 4: Writing your first scraper with rvest</a></p>
</section>
</div>
</div>
<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
Reveal.initialize({
dependencies: [
{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
]
});
</script>
</body>
</html>