forked from besquared/redis-datastructures
-
Notifications
You must be signed in to change notification settings - Fork 1
rgrieselhuber/redis-datastructures
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
PartitionedTable A partitioned table stores all of its rows into 'buckets' which are defined by specifying a set of fields to partition on. Exact matches on any subset of the partitioned fields can then be executed quickly using redis' glob capability. The performance is good when using the 'load' command since it saves on having to marshal many small objects by bulk marshaling an entire set of rows at one time. This property makes partitioned tables best for bulk loading and reading. Preliminary Benchmarks class ConversionRates < PartitionedTable::Base name "benchmark_conversion_rates" fields :datasets, :versions, :conversions, :hours, :totals, :successes partitions :datasets, :hours end Mac Mini OS X 10.5 Leopard 1.66 GHz Intel Core Duo (2MB L2 Cache) 2GB 667 MHz DDR2 SDRAM Ruby 1.8.6 patchlevel 114 Redis-rb Loading 50,000 rows 1.280000 0.070000 1.350000 ( 1.429340) -------------------------------------------------- Finding all purchases Length: 16667 0.120000 0.020000 0.140000 ( 0.175329) -------------------------------------------------- Finding all purchases during hour 100 Length: 70 0.000000 0.000000 0.000000 ( 0.001119) -------------------------------------------------- Finding purchases for 336 hours Length: 7835 0.070000 0.030000 0.100000 ( 0.214801) -------------------------------------------------- Finding 336 hours of data Length: 23504 0.250000 0.070000 0.320000 ( 0.501853) -------------------------------------------------- Finding all data Length: 50000 0.510000 0.050000 0.560000 ( 0.673403) -------------------------------------------------- After building about 10 different versions of in memory and redis backed data structures I need to write down my overall thinking regarding this. The main questions are 1) What is the purpose? 2) When and how will the data be loaded? 3) When and how will the data be accessed? 1) What is the purpose? My initial goal was something like an "OLAP cube" which would allow me to pivot around certain dimensions and display their related measures or other quantities related to them. 2) When and how will the data be loaded? The data is not operational, it is historical. This is built to support advanced reporting interfaces. The data is loaded offline on a schedule and is generally not streamed in real time in any way. This means that load times are less important than read times. It also means there will be a lot of data, distribution of keys is important in order to take advantage of the distributed nature of key value stores. Structures that put all of the data into one list are not a good choice. 3) When and how will the data be accessed? The data is accessed in "real time", people are seeing loaders, it needs to be fast. The less requests we have to do the better because network latency is our enemy here and our data backend is written in C. The less processing of data we have to do to get our results in a usable state the better. Given these requirements there are a few conclusions 1) Do things in bulk where it's possible It's faster to deal with a 1Kb string than to deal with 1024 1 byte strings 2) Do things with the least number of keys that still provide "distribution" Pretend every key lookup is lost revenue, this is close to the truth 3) Do things directly rather than with indirection The benefits of indirection are often overshadowed by the overhead involved in multi-stage processes, especially across networks. A corollary to this is do things in one place only. ------------------------------------------------------ The partitioned storage strategy is as good as we can do No duplication of data. The partitioning splits the set of rows into non-overlapping subsets. This ensures that we aren't wasting space in our in-memory system. It also ensures that we don't waste load or read time on duplicate data. Partitioning creates *just* the right amount of keys. The same as the cross product of all of the values of the fields we are partitioning. For instance if we are partitioning on 3 fields each with a cardinality of 100 then the total number of keys we'll create is 1,000,000. This means lookups can be distributed as soon as we have a single partition field. Exact matching across all partitions in constant time. If we're looking for rows and we know the values of all of our partitions ahead of time then we can get the entire set of rows we care about in just 1 key look up. This also means that if we're looking for two exact matches we can do this in 2 key lookups. This is as good as we can get without duplicating data or creating more indirection to store multiple keys with a single row. In this sense, the partitioned storage strategy is optimal for many cases where exact matches are necessary. Range matches can be synthesized using exact matches by simply asking for one exact match for each value in some specified range of values.
About
A set of data structures I find useful built using redis as a backend
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published