-
Notifications
You must be signed in to change notification settings - Fork 1
/
ch13.tex
executable file
·420 lines (379 loc) · 18.2 KB
/
ch13.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
\chapter{Sharding with Facebook}
\index{Facebook}\index{social media}Facebook,
with billions of users and a profound, worldwide impact
on people, social interaction, commerce, and institutions,
is a major force far beyond its beginnings
as a way for college students to socialize online.
Having so many users poses tremendous technical challenges,
and Facebook's success is partly due to
the ability of its engineers to deal with these problems.
Many important technical innovations
have made the Facebook website possible.
Some of them are discussed in this chapter.
\section{The Technical Challenge}
To understand the challenges facing Facebook,
let's consider just one of its main features.
Facebook users can post frequent updates of their doings,
and they can view the postings of their friends through online connections
accessible from ubiquitous devices such as laptops, tablets, and cell phones.
Making this possible requires two lists of information for each user:
%\begin{quote}
%\begin{enumerate}
%\item
(1)~a list of the user's friends and
%\item
(2)~a list of the user's status updates.
%\end{enumerate}
%\end{quote}
It's not unusual for these lists to have hundreds of items,
so Facebook has to make hundreds of billions of items
accessible, quickly, online to billions of people.
That's this year. Next year, it may be trillions of items.
Traditionally, this data would be stored in a \index{database}database
using two tables, one for the status \index{update, database}updates
and one for the friends.
Figure~\ref{facebook-tables} (page \pageref{facebook-tables})
shows what these tables could look like.
Database software inserts, deletes, and modifies
rows on these tables as they evolve,
and supports sophisticated \index{query}queries that can
retrieve information from the tables.
For example, Facebook could use the following query to determine
what status updates to display when a specific user logs in:
\begin{code}
\begin{verbatim}
SELECT s.User, s.Time, s.Status
FROM Friends f, Statuses s
WHERE f.User='John'
AND f.Friend = s.User
ORDER BY s.Time DESC
\end{verbatim}
\end{code}
This approach would work very well if Facebook had a small number of users,
but it does not scale to billions of users.
The problem is that a traditional database cannot process
this query quickly enough to keep up with a continuous
flood of online demand.
Facebook is not the only company that has this problem.
For example, \index{Amazon}\index{store front}Amazon
offers millions of items for sale.
It encourages customers to write reviews for each item, and it keeps a history
of purchases made by its customers so it can keep them informed about the status of orders,
make suggestions for other purchases, and provide other services.
If the information were stored in traditional database tables,
retrieving it would take too long.
In fact, this is a problem faced by any
\index{Web 2.0 company}Web 2.0 organization.\footnote{A
Web 2.0 organization is one that allows large numbers
of users to produce content for its website.}
The web content that their users produce
can take the form of passages created directly by users,
such as Facebook updates or Amazon reviews, or it can be content created
indirectly, such as Amazon purchase recommendations.
When a Web 2.0 company is successful,
its users produce much more content than traditional databases can handle.
\begin{figure}
\begin{center}
\begin{tabular}[t]{ll}
\hline
\multicolumn{2}{c}{\textsc{Friends}} \\
\hline
\multicolumn{1}{c}{\textbf{User}} & \multicolumn{1}{c}{\textbf{Friend}} \\
\hline
John & Sally \\
John & Mary \\
Mary & Sally \\
Sally & David \\
\hline
\end{tabular}
\hspace{.5in}
\begin{tabular}[t]{llp{1in}}
\hline
\multicolumn{3}{c}{\textsc{Statuses}} \\
\hline
\multicolumn{1}{c}{\textbf{User}} & \multicolumn{1}{c}{\textbf{Time}} & \multicolumn{1}{c}{\textbf{Status}} \\
\hline
John & Apr 21, 2011, 10:27 am & \raggedright Checked into Starbucks in Norman. \tabularnewline
Sally & Apr 21, 2011, 10:29 am & \raggedright Saw \emph{Battle: Los Angeles} last night. What a waste! \tabularnewline
Mary & Apr 21, 2011, 10:32 am & \raggedright Is anybody going to the carnival this weekend? \tabularnewline
Sally & Apr 21, 2011, 10:33 am & \raggedright Looks like the fires are getting closer to our house. Thinking about evacuating. \tabularnewline
\hline
\end{tabular}
\end{center}
\index{status update, database}\index{update, database}
\caption{Tables for Facebook-style status updates.}
\label{facebook-tables}
\end{figure}
\section{Stopgap Remedies}
\subsection{Caching}
Web 2.0 companies cannot use traditional databases
because, for example, it would take too long to retrieve information needed
to display a user's welcome screen.
One way to address this problem is to limit the number of queries that the
database needs to carry out.
For example, if many users check their home screens
several times in a span of a few minutes,
the computer can remember the results of previous inquiries
instead of asking the database to
retrieve the home screen every time a user wants to check it.
That's called \emph{caching} the data.
A \index{cache}cache is a key/value store
in which data (values) are associated with keys
that facilitate locating the data,
like the search-tree keys of chapter \ref{ch:search-trees}
or the hash keys of chapter \ref{ch:hash-tables}.
When a query arrives, the database first checks
the cache to see if the information is already there.
If it is, it is simply reused.
Otherwise, the database performs the query against its underlying store
and puts the results in the cache so they can be reused later.
A cache is much smaller than the underlying store
so data in the cache is frequently flushed and replaced with other active data.
Caching is useful when three conditions are met.
First, putting information in the cache and retrieving it
must be faster than ordinary database operations.
Much faster, so there is a significant gain
when the results are in the cache to make up for the delay when they're not.
Second, interactions with the database must frequently repeat the same transactions,
so that results stored in the cache are often reused.
Otherwise, caching data is a waste of time because it will
have been flushed from the cache before a second or third request arrives.
Third, \index{retrieval, database}retrieval
rather than update must be the dominant type of database transaction.
Updates can make the data in the cache inconsistent with information in the database,
forcing it to be retrieved again from the underlying store,
just as if there were no cache at all.
If it's not retrievals but updates that are the dominant transaction,
caching is a waste of time and effort.
Caching is used throughout the web.
It is especially successful in storefront applications,
where database queries often concern details about a particular product.
Storefronts offer hundreds of thousands of products,
but only a handful of them are really popular.
So, there are frequent requests for the same information,
which makes the cache effective.
Caching worked well for Amazon, at least before product reviews
and recommendations became prevalent in their product pages.
But caching cannot work well for Facebook.
Users look at their welcome pages on an individual basis,
and they make frequent postings. Retrieval of information
does not dominate the pattern of transactions,
and it is not the case that many incoming requests
are for retrieval of the same data, over and over,
the way it is with a storefront operation.
Besides that, one update on Facebook can trigger changes in many pages of the website.
This is typical in Web 2.0 applications,
and this need for frequent updates of customized information
for every user eliminates the advantages of caching.
Another solution is needed.
\subsection{Sharding}
\index{sharding}\emph{Sharding} splits a database into many different databases.
For example, John's Facebook friends and status updates may be stored in machine J, whereas
Mary's friends and status updates are stored in machine M.
Machines like J and M that store just a portion of the data are called \emph{shards}.
Because the database is stored on many different computers,
no one computer has to shoulder the entire load.
That's the upside.
The downside is that it makes it harder
to automate transactions with the database.
Programmers have to specify how to distribute
individual queries across all of the many computers
that may be involved in resolving the query.
To see how this might work,
suppose the computer needs to generate John's welcome page.
The first step might be to find John's friends,
which can be done by executing a query on machine J.
\begin{code}
\begin{verbatim}
SELECT f.Friend
FROM Friends f
WHERE f.user='John'
\end{verbatim}
\end{code}
The query returns John's friends, Sally and Mary.
The next step is to find Sally's and Mary's status updates.
That leads to the following queries:
\begin{code}
\begin{verbatim}
SELECT s.Time, s.Status
FROM Statuses s
WHERE s.User='Sally'
ORDER BY s.Time DESC
SELECT s.Time, s.Status
FROM Statuses s
WHERE s.User='Mary'
ORDER BY s.Time DESC
\end{verbatim}
\end{code}
The first query should run on machine S,
whereas the second query should run on machine M.
The final step is to combine the results from the two queries
and then merge them, keeping
the combined list in reverse chronological order.
Queries like this show one of the shortcomings of sharding.
Each query retrieves results from only one table
because the related \index{record, database}records
are not necessarily stored in the same shard.
In this particular example, the information needed to answer the query was
distributed across shards J, S, and M.
So the program had to collect the information
from all the different sources and then combine it.
That makes sharding more complicated
than keeping all the information in one place,
as in a traditional database.
Sharding also suffers from uneven distribution.
What if, for example, one of the shards ends up with too many records?
That shard would need to be split into pieces.
For example, the system could
split the shard M into two shards, Ma--Mp and Mq--Mz.
That seems like a good idea, but in practice splitting shards
is complicated because the software on all the computers that access the data
needs to be modified to make it possible for the computers
to find the shard that contains the information they're looking for.
\section{The Cassandra Solution}
Faced with these difficulties, Facebook engineers developed a solution that
retained the benefits of sharding, but avoided some of the difficulties.
The goal was to make it easy to split a shard into multiple pieces
and to hide from the software the complexity of sharding.
\index{Cassandra}Cassandra, the solution they devised, combines features
from the Dynamo project at Amazon and the BigTable project at Google.
From Dynamo, Cassandra borrows the idea of a replication ring,
and from BigTable, a data model.
Cassandra's data model groups records into different tables.
Each record in a table is identified with a key.
The key must be unique in a given table,
but the same key may be used in different tables.
Each record consists of one or more key/value pairs, and
different records in a Cassandra table may have different keys.
For example, John's friends may be stored in a record like the one shown in figure~\ref{users-table}.
The important thing is that a program can retrieve
all of John's friends by requesting the single record with ID John.
Once John's friends are known, it is necessary to retrieve their status updates.
This can be done by looking for the records in the status table that have the appropriate record IDs.
Figure~\ref{status-table} (page \pageref{status-table})
shows what the status table could look like.
The table structures we have presented assume that all fields will fit in a single record.
That is, we assume that a single record can hold all of a user's friends or all of a user's status updates.
Cassandra tables are designed to support thousands of fields.
This will be enough for most users, but not for the heaviest users of Facebook.
To deal with the heaviest users, Facebook can reuse the same idea with column names.
To support an arbitrary number of friends or status updates,
the values can be spread across multiple records,
with IDs such as John1, John2, and so on.
\begin{figure}
\begin{center}
\begin{tabular}[t]{lll}
\hline
\multicolumn{3}{c}{\textsc{Friends}} \\
\hline
\multicolumn{1}{c}{\textbf{Record ID}} & \multicolumn{1}{c}{\textbf{Friend1}} & \multicolumn{1}{c}{\textbf{Friend2}} \\
\hline
John & Sally & Mary \\
Mary & Sally & \\
Sally & David & \\
\hline
\end{tabular}
\end{center}
\caption{Storing friends lists in Cassandra.}
\label{users-table}
\end{figure}
\begin{figure}
\begin{center}
\begin{tabular}[t]{lp{.8in}p{.8in}p{1in}p{1in}}
\hline
\multicolumn{5}{c}{\textsc{Statuses}} \\
\hline
\multicolumn{1}{c}{\textbf{Record ID}} & \multicolumn{1}{c}{\textbf{Time1}} & \multicolumn{1}{c}{\textbf{Time2}}
& \multicolumn{1}{c}{\textbf{Status1}} & \multicolumn{1}{c}{\textbf{Status2}} \\
\hline
Sally & \raggedright Apr 21, 2011,\hfill\break10:29 am & \raggedright Apr 21, 2011,\hfill\break10:33 am &
\raggedright Saw \emph{Battle: Los Angeles} last night. What a waste! &
\raggedright Looks like the fires are getting closer to our house. Thinking about evacuating. \tabularnewline
\hline
\end{tabular}
\end{center}
\caption{Storing status updates in Cassandra.}
\label{status-table}
\end{figure}
The upshot is that the workflow for retrieving information from Cassandra
is similar to the workflow for sharding but with a major difference.
The queries that are generated do not need to be aware
of which shard contains the information they need.
In fact, Cassandra relies on sharding both for performance and for replication.
The key innovation is that the shards are arranged in a ring.
For simplicity, assume that the shards are labeled A, B, C, \dots, Z.
The ring arrangement means that each shard is connected to the next,
and eventually the last shard is connected to the first.
For example, A could be connected to B, B to C, and so on, until Z is connected back to A.
Records are arranged in an order determined by a hash function
that computes a hash value from the key of a record.
The hash value is used to select a shard label (labels A through Z in the example).
All hash values up to A are mapped to shard B,
those between A and B to shard C, and so on.
The hash function and mapping of hash values to shards is known to every shard,
so a program can ask any of the machines to retrieve a given value.
If the machine does not have the value,
it can determine the shard containing the value
and forward the request to that shard.
The ring makes it easier to rebalance the shards
in case one of them becomes too large.
Suppose, for example, shard B gets too large.
To balance it, a new shard, say BM, is created and inserted between B and C.
During the insertion process, B sends all of its records between BM and C to shard BM.
%\todo{Question (11/22/17):
% Earlier it said that all hash values between B and C mapped to C.
% Now, it says that records between BM and C are sent to shard BM.
% It's not clear to me what that means. If "record between BM and C"
% means hash values between BM and C, then it seems that the record
% should be sent to shard C, based on the earlier specification.
% If "record between BM and C" means something other than hash value,
% then something needs to be said about what it means.
% Also, it's not clear whether shard and machine mean the same thing
% or if they are different things. From what is said, they its seems
% that machine and shard mean the same thing.
% This needs clarification.}
When this process completes, all shards are notified of the new
shard's existence and shard BM joins the ring.
This can happen without the knowledge of any programs
that are retrieving data from the system.
Cassandra also uses the ring for \index{replication, database}replication.
Records that should be stored in A
are also stored in B and C.
This is important, because computers and disks can fail,
but if shard A should fail, there are two more copies of its data.
It can also serve to improve performance during spikes.
If shard A becomes busy, shards B and C can take over some of the load.
Replication complicates the splitting of shards.
When shard B is split into B and BM, for example, this also affects shards C and D
because they store replicated records for shard B.
Now this needs to be restructured, so that B's records are replicated in shards BM and C.
Moreover, C and D should replicate the records for shard BM.
This means that shards C and D need
to participate in the insertion of BM into the ring.
Shard C needs to know that some of the records
it replicated for B are now associated with BM instead.
Shard D needs to know that some of the records it was replicating
on behalf of B can now be forgotten, and the rest need to be associated with BM.
Finally, the new shard BM needs to receive not just its share
of B's records but all of B's records so that it can replicate B's remaining data.
That's a lot of data movement, and it might be surprising that it helps.
Part of the reason it works is that Facebook has the luxury of not having
to get all the data right all the time.
If people don't see all the latest postings of their friends until
an hour or even several hours after they are posted, it's no big deal.
Users may be a little out of sync for awhile, but the data is not time critical
on a minute-by-minute basis.
\section{Summary}
Web 2.0 applications bring up many scaling challenges.
These challenges go beyond what traditional database solutions can offer.
So, leading Web 2.0 companies have developed custom solutions,
and the idea of sharding plays a prominent role in some of those solutions.
Cassandra, Facebook's solution, successfully met the challenge, and
fortunately, Facebook decided to make \index{Cassandra}Cassandra available to the
programming community via an open-source process.
Programmers can download Cassandra (\url{http://cassandra.apache.org})
and use it to develop new software.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "book"
%%% End: