part-5-offline-document-scraping.html

<html>
	<head>
		<link rel="stylesheet" href="css/reveal.css">
		<link rel="stylesheet" href="css/theme/white.css">
		<link rel="stylesheet" href="lib/css/zenburn.css">
		<style>
			.reveal h1 {
				text-transform: none;
				line-height: 1;
			}
			.reveal ul {
				margin: 0;
			}
			.reveal li {
				list-style-type: none;
			}
			.reveal p {
				margin: 0;
				margin-bottom: 0.5em;
			}
			.reveal pre {
				box-shadow: none;
			}
			.reveal pre code {
				padding: 25px;
			}
		</style>
	</head>
	<body>
		<div class="reveal">
			<div class="slides">

				<section>
					<h1>Offline document scraping</h1>
				</section>

				<section>
					<p>Phew! This is the last section of the day.</p>
				</section>

				<section>
					<p>While scraping can get you a long way, sometimes there's no way around having to work with a traditional document, such as a PDF.</p>
				</section>

				<section>
					<p>I'm a fan of the hybrid approach — build a scraper to bulk download files, then analyze them in a different tool. Let's try out bulk downloading now.</p>
				</section>

				<section>
					<p><strong>pdf.R</strong></p>
				</section>

				<section>
					<p>I've used methods like this to download thousands of files at once. It can get pretty intense!</p>
				</section>

				<section>
					<p>But once you've got your files, how do you extract information from them? Let's talk through the options.</p>
				</section>

				<section>
					<h3>Tabula</h3>
				</section>

				<section>
					<h3>Tesseract, pdfplumber, docs2csv</h3>
				</section>

				<section>
					<h3>Adobe Acrobat</h3>
				</section>

				<section>
					<p>Let's use Tabula and Acrobat to try to extract some data from the PDFs we just downloaded.</p>
				</section>

				<section>
					<p>That's it! We'll take a short break, then get down to writing our own scrapers.</p>
					<img src="img/giphy (1).gif" alt="">
				</section>

				<section>
					<p><strong>Previous section:</strong> <a href="./part-4-writing-your-first-scraper.html">Part 4: Writing your first scraper with rvest</a></p>
				</section>

			</div>
		</div>
		<script src="lib/js/head.min.js"></script>
		<script src="js/reveal.js"></script>
		<script>
			Reveal.initialize({
				dependencies: [
					{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
				]
			});
		</script>
	</body>
</html>