Hopefully you completed part one of the introduction. This is part two. My aim in this episode is to start making a basic Key->CRDT store. But before that, a quick recap of what we did so far.
So you followed part one and it worked. I haven't seen a single mail to the list so assume either no-one did it, or no-one failed. Pretending it is the latter is the only way I can go on writing this.
So what do you have on your machine? When you ran that command:
make devrel
and subsequently joined the nodes into a cluster, what happened? What is a node? A node of what?
You made the most basic riak_core application possible. Riak Core is the distributed systems goop that Riak and other Basho products are built on. It provides the abstractions, and basic services needed to run a clustered application.
Riak Core applications are built around something called the
ring. At
its simplest the ring is a zero
to 2^160-1
integer hashspace. This
space is split into partitions
, each an equal range of the hash
space. A Partition has an ID or index
that is the maximum hashed
key it is responsible for. We use a hash function
to assign work
(or storage). We hash a key
to an integer on the ring, and the next
partition greater than that integer owns that key.
The Claim Algorithm assigns partitions to Nodes
. When you add, or
remove nodes, claim causes a number of partitions to move to their new
homes on a different node automatically. When you ran ping
in the
last tutorial:
rel/crdtdb/bin/crdtdb console
crdtdb@127.0.0.1)1> crdtdb:ping().
{pong,433883298582611803841718934712646521460354973696}
(crdtdb@127.0.0.1)2> crdtdb:ping().
{pong,1050454301831586472458898473514828420377701515264}
(crdtdb@127.0.0.1)3> crdtdb:ping().
{pong,433883298582611803841718934712646521460354973696}
(crdtdb@127.0.0.1)4> crdtdb:ping().
{pong,1438665674247607560106752257205091097473808596992}
(crdtdb@127.0.0.1)5> crdtdb:ping().
{pong,296867520082839655260123481645494988367611297792}
That integer is the partition that answered the ping
with a
pong
. Riak Core is responsible for keeping a balanced number of
partitions on each node.
As well as being a conceptual thing, the ring
is a data structure
that is gossiped between all Riak Core application nodes. It contains
information about what node is responsible for an index or partition
of the ring, a routing table in DHT speak. This means a request can
hit any node and find its way to the right node in one hop.
A Node
is usually a server, virtual or physical, but as you've seen
from make devrel
many application nodes can be run on a single
host. A Node
is just an instance of an
Erlang release. Nodes
host partitions. And partitions manifest as vnodes
.
A Virtual Node, or vnode
is the actual Erlang process that is
responsible for a partition of the ring.
vnodes
are the workers. Each vnode is an
erlang process.
Riak Core manages processes for the partitions
hosted by a node but
in order for our Riak Core application to do anything useful, we need
to provide it with vnode code.
A vnode
is also an
Erlang Behaviour. If
you're used to an OO language, a Behaviour is a little like an
interface, or even an abstract-class, or both. A behaviour tells us
what function callbacks we need to code in order to fit into the Riak
Core framework and have our vnode be part of a Riak Core application.
If you want lots of details about all the callbacks we need to implement to make our vnode, you should have a look at Ryan Zezeski's Try-Try-Try blog post. I've cribbed from it liberally here, but plan to go into less detail.
CRDTDB is a database for CRDTs. It's a Key/Value store. Our vnode is going to manage the storage and retrieval of CRDTs on disk.
Luckily Erlang comes with a really simple way to store Erlang Terms on disk, it's called dets and we'll use that for now. It has a 2GB per file limit, but we're not making a real database, they're hard™. Later maybe we'll refactor to use leveldb or bitcask.
And since we're going to store CRDTs, for now we can use riak_dt.
OK. In case you don't like typing I've pushed this code to the
SyncFree github account. You can
clone the project and make rel
if you
want just read/run the code.
First we need riak_dt
so edit rebar.config
to include:
{deps, [
{lager, "2.0", {git, "git://github.com/basho/lager", {tag, "2.0.0"}}},
{riak_core, ".*", {git, "git://github.com/basho/riak_core", {tag, "develop"}}},
{riak_dt, ".*", {git, "git://github.com/basho/riak_dt", {tag, "develop"}}}
]}.
And run
./rebar get-deps
To pull the library down from github. Also add
riak_dt,
To rel/reltool.config
in the first list section, right before crdtdb
.
Really quick sidebar: setting up Emacs for erlang is
easy. With
that done, let's open up src/crdtb_vnode.erl
and get started. I
don't want to make this an erlang tutorial so I may skip over things
in a confusing way, if you get that sensation please mail the list.
init/1
is called by riak core when a vnode process starts up. It is
passed the index
or partition number
as an argument.
In our vnode init we need to open our dets
file and store the handle
to it in the vnode's state.
Add a dets
field to the processes state
record:
-record(state, {partition, dets}).
And open the dets
file in the init/1
fun:
init([Partition]) ->
File = filename:join(app_helper:get_env(riak_core, platform_data_dir),
integer_to_list(Partition)),
{ok, Dets} = dets:open_file(Partition, [{file, File}, {type, set}]),
{ok, #state { partition=Partition, dets=Dets }}.
When the vnode finishes Riak Core will call the terminate/2
function. We should close our dets
table here.
terminate(_Reason, #state{dets=Dets}) ->
dets:close(Dets).
vnodes
handle commands. Think of vnode commands
as the API of the
vnode. What does our vnode
want to do? Get CRDTs and Store
CRDTs. Lets add these functions. First export them.
%% CRDT API
-export([get/2, put/3]).
Add a handy macro:
-define(SYNC(PrefList, Command),
riak_core_vnode_master:sync_command(PrefList, Command, crdtdb_vnode_master)).
The Vnode Master
is responsible for dispatching commands to
vnodes. sync_command
is as it sounds, send the command in a
synchronous, blocking call, and wait for the result.
Then the function bodies:
%% @doc get the CRDT stored under Key
get(Preflist, Key) ->
?SYNC(Preflist, {get, Key}).
%% @doc put the given CRDT as the value under Key
put(Preflist, Key, CRDT) ->
?SYNC(Preflist, {put, Key, CRDT}).
These are just API sugar, they simply package the arguments up for the
vnode master to dispatch. Don't worry about what the Preflist
argument is for now, we'll get to it.
For every command our vnode handles we need to add a
handle_command/3
function call. The vnode master will call this
callback function with the argument we gave it above. And it will call
it at each vnode in the Preflist
it is given. Let's do that now.
Get
is simple, just read the CRDT off the disk, and return it.
%% Get the CRDT stored at Key
handle_command({get, Key}, _Sender, State=#state{dets=Dets}) ->
Reply = case dets:lookup(Dets, Key) of
[] -> {error, notfound};
[{Key, CRDT}] -> {ok, CRDT};
Error -> Error
end,
{reply, Reply, State};
Put
might be a bit more complex. Since we're using CRDTs, we don't
want to worry about conflicts, interleaving writes, and things like
that. So let's make the most of CRDT-magic and merge
state with what is on disk already when we put
. This basically means:
Read what is on disk, merge with it, write the result.
What if the type
of CRDT we want to write is different to that on
disk? Bad luck! We can fix that later. We do need to know the type
of thing we are storing though, so that we can merge it. To that end
I've decided that when we say CRDT
in this app, we mean the record:
#crdt{mod, value}
Which I've added to the file crdt.hrl
.
-record(crdt, {mod, value}).
Ok, the put
command handling code:
handle_command({put, Key, #crdt{mod=Mod, value=Val}=CRDT}, _Sender, State=#state{dets=Dets}) ->
Reply = case dets:lookup(Dets, Key) of
[] -> dets:insert(Dets, {Key, CRDT});
[{Key, #crdt{mod=Mod, value=LocalValue}}] ->
Merged = Mod:merge(Val, LocalValue),
dets:insert(Dets, {Key, #crdt{mod=Mod, value=Merged}});
[{Key, #crdt{mod=Other}}] ->
{error, {type_conflict, Mod, Other}};
Error ->
Error
end,
{reply, Reply, State};
Lots of opportunity to refactor the common code between get
and put
but meh
. Later.
What we have is the absolute bare minimum. No replication to multiple
nodes for fault-tolerance / low-latency. No put-and-return-value
,
just a very basic Key->CRDT
store. We're going to add replication,
handoff, and all that good stuff next time.
We should write tests too, right? Sure.
I'm not going to now, though. But it would be nice to run this, to see
if it sort of works. Let's add some code to the internal client module
(the one that we used for ping
last time) for these two new
commands. I'm going to gloss over some important concepts as this has
been long. We'll leave them until next time.
Now to open src/crdtdb.erl
. This is an internal client. It will be
used by any external API endpoints (like HTTP, Protocol Buffers,
MsgPack etc) that we end up making.
Riak Core still sports some vestigial links to RiakKV. It expects all
"Keys" to be a two-tuple of {Bucket::binary(), Key::binary()}
. To
keep things simple we'll store all our data in one "bucket".
-define(BUCKET, <<"crdtdb">>).
Let's add the get
function:
get(Key) ->
DocIdx = riak_core_util:chash_key({?BUCKET, Key}),
PrefList = riak_core_apl:get_primary_apl(DocIdx, 1, crdtdb),
[{IndexNode, _Type}] = PrefList,
case crdtdb_vnode:get(IndexNode, Key) of
{ok, #crdt{mod=M, value=V}} ->
{ok, {M, V}};
E ->
E
end.
Tutorial number 3 will be all about what this snippet does. What a
PrefList
is and what we use it for. For now be content that we're
hashing the key of to a value that we're using to determine where the
data should be.
Put
is mighty similar:
put(Key, Mod, Value) ->
DocIdx = riak_core_util:chash_key({?BUCKET, Key}),
PrefList = riak_core_apl:get_primary_apl(DocIdx, 1, crdtdb),
[{IndexNode, _Type}] = PrefList,
crdtdb_vnode:put(IndexNode, Key, #crdt{mod=Mod, value=Value}).
If all that is done, then run ./rebar compile
. Assuming it actually
compiles, you should be able to make a node with make rel
and start
it up:
rel/crdtdb/bin/crdtdb console
Will start an erlang shell running the database. Now let's try it:
Eshell V5.10.3 (abort with ^G)
(crdtdb@127.0.0.1)1> crdtdb:get(<<"foo">>). %% We haven't stored it yet!
{error,notfound}
%% Create an ORSWOT with a single member
(crdtdb@127.0.0.1)1> {ok, S} = riak_dt_orswot:update({add, <<"bryan">>}, actor1, riak_dt_orswot:new()).
{ok,{[{actor1,1}],[{<<"bryan">>,[{actor1,1}]}]}}
(crdtdb@127.0.0.1)2> crdtdb:put(<<"foo">>, riak_dt_orswot, S). %% Store the orswot
ok
(crdtdb@127.0.0.1)3> {ok, {M, V}} = crdtdb:get(<<"foo">>). %% Fetch it
{ok,{riak_dt_orswot,{[{actor1,1}],
[{<<"bryan">>,[{actor1,1}]}]}}}
(crdtdb@127.0.0.1)4> M:value(V). %% Get the "value"
[<<"bryan">>]
Great. Now lets "concurrently" write to that same key:
(crdtdb@127.0.0.1)5> {ok, S2} = riak_dt_orswot:update({add, <<"bob">>}, actor2, riak_dt_orswot:new()).
{ok,{[{actor2,1}],[{<<"bob">>,[{actor2,1}]}]}}
(crdtdb@127.0.0.1)6> crdtdb:put(<<"foo">>, riak_dt_orswot, S2).
ok
(crdtdb@127.0.0.1)7> {ok, {M, V2}} = crdtdb:get(<<"foo">>).
{ok,{riak_dt_orswot,{[{actor1,1},{actor2,1}],
[{<<"bob">>,[{actor2,1}]},{<<"bryan">>,[{actor1,1}]}]}}}
(crdtdb@127.0.0.1)8> M:value(V2).
[<<"bob">>,<<"bryan">>]
Note the different actor, and the empty starting set. What happens if we write a counter to <<"foo">>
?
{ok, C} = riak_dt_pncounter:new(actor1, 100).
{ok,[{actor1,100,0}]}
(crdtdb@127.0.0.1)11> crdtdb:put(<<"foo">>, riak_dt_pncounter, C).
{error,{type_conflict,riak_dt_pncounter,riak_dt_orswot}}
Fair enough, but I'm sure we can do better.
That was a big chunk. We covered "what is riak core" to a certain extent, at least we looked at the ring, and partitions. Then we made a vnode, and we added some very basic database functionality. So we have a distributed database. We're using continuous hashing and the ring to distributed data around the database. We're storing CRDTs so concurrent writes are no problem. But we're not replicating. If we lose a node, we lose that portion of the keyspace, and that's bad.
Next time we'll cover preference lists, replication, quorums, and handoff. Until then, good luck, and get in touch if your stuck.