Unlike many programming languages, R does not implement a native hash table class. The typical workaround is to use environment
s, taking advantage of the fact that these objects are, by default, internally hashed:
EE <- new.env(hash = TRUE) # equivalent to new.env()
set.seed(123)
list2env(
setNames(
as.list(rnorm(26)),
LETTERS
),
envir = EE
)
EE[["A"]]
# [1] -0.5604756
EE[["D"]]
# [1] 0.07050839
EE[["Z"]]
# [1] -1.686693
In many situations, this is a fine solution - lookups are reasonably fast, and environment
s are highly flexible, allowing one to store virtually any type of R object (functions, lists, other environments, etc.). However, one of the major downsides to using envinronment
s as hash tables is the inability to work with vector arguments:
EE[[c("A", "B")]]
# Error in EE[[c("A", "B")]] :
# wrong arguments for subsetting an environment
EE[c("A", "B")]
# Error in EE[c("A", "B")] :
# object of type 'environment' is not subsettable
This is unfortunate, and somewhat surprising, considering most operations in R have vectorized semantics.
library(hashmap)
set.seed(123)
(HH <- hashmap(LETTERS, rnorm(26)))
## (character) => (numeric)
## [Z] => [-1.686693]
## [Y] => [-0.625039]
## [R] => [-1.966617]
## [X] => [-0.728891]
## [Q] => [+0.497850]
## [P] => [+1.786913]
## [...] => [...]
HH[[c("A", "B")]]
# [1] -0.5604756 -0.2301775
It is important to note that unlike the environment
-based solution, hashmap
does NOT offer the flexibilty to store arbitrary types of objects. Any combination of the following atomic vector types is currently permitted:
- keys
integer
numeric
character
Date
POSIXct
- values
logical
integer
numeric
character
complex
Date
POSIXct
What hashmap
may lack in terms of flexibility it makes up for in two important areas: performance and ease-of-use. Let's begin with the latter by looking at some basic examples.
-
A
Hashmap
is created by passing a vector of keys and a vector of values tohashmap
:set.seed(123) H <- hashmap(letters[1:10], rnorm(10)) H ## (character) => (numeric) ## [j] => [-0.445662] ## [i] => [-0.686853] ## [h] => [-1.265061] ## [g] => [+0.460916] ## [e] => [+0.129288] ## [d] => [+0.070508] ## [...] => [...]
-
If the lengths of the two vectors are not equal, the longer object is truncated to the length of its counterpart, and a warning is issued:
hashmap(letters[1:5], 1:3) ## (character) => (integer) ## [c] => [3] ## [b] => [2] ## [a] => [1] # Warning message: # In new_CppObject_xp(fields$.module, fields$.pointer, ...) : # length(keys) != length(values)! hashmap(letters[1:3], 1:5) ## (character) => (integer) ## [c] => [3] ## [b] => [2] ## [a] => [1] # Warning message: # In new_CppObject_xp(fields$.module, fields$.pointer, ...) : # length(keys) != length(values)!
-
Value lookup can be performed by passing a vector of lookup keys to either of
[[
or$find
:H[["a"]] # [1] -0.5604756 H$find("b") # [1] -0.2301775 H[[c("a", "c")]] # [1] -0.5604756 1.5587083 H$find(c("b", "d")) # [1] -0.23017749 0.07050839
-
For non-existant lookup keys,
NA
is returned:H[[c("a", "A", "b")]] # [1] -0.5604756 NA -0.2301775
-
Use
$has_key
to check for the existance of individual keys, or$has_keys
for a vector of keys:H$has_key("a") # [1] TRUE H$has_key("A") # [1] FALSE H$has_keys(c("a", "A", "b", "B")) # [1] TRUE FALSE TRUE FALSE
-
Modification of key-value pairs is done using either of
[[<-
or$insert
. For non-existing keys, a new key-value pair will be inserted. For existing keys, the previous value will be overwritten:H[[c("a", "x")]] # [1] -0.5604756 NA H[[c("a", "x")]] <- c(1.5, 26.5) H[[c("a", "x")]] # [1] 1.5 26.5 H$insert(c("a", "y", "z"), c(100, 200, 300)) H[[c("a", "y", "z")]] # [1] 100 200 300
-
To remove elements from the hash table, pass a vector of keys to
$erase
, which will delete entries for matched elements, and do nothing otherwise:H$has_keys(c("y", "Y", "z", "Z")) # [1] TRUE FALSE TRUE FALSE H$erase(c("y", "Y", "z", "Z")) H$has_keys(c("y", "Y", "z", "Z")) # [1] FALSE FALSE FALSE FALSE
-
Use
$size
to check the number of key-value pairs,$empty
to check if the hash table is empty, and$clear
to delete all existing entries:H$size() # [1] 11 H$empty() # [1] FALSE H$clear() H$empty() # [1] TRUE H$size() # [1] 0 H ## [empty Hashmap]
-
$keys
and$values
return every key and value, respectively, and$data
returns a named vector of values, using the keys as names:H[[c("A", "B", "C")]] <- 1:3 H$keys() # [1] "C" "B" "A" H$values() # [1] 3 2 1 H$data() # C B A # 3 2 1
-
By default, only the first 6 key-value pairs of a
Hashmap
are printed, where[...] => [...]
indicates that additional entries exist but are not displayed. This can be adjusted viaoptions()
:getOption("hashmap.max.print") # [1] 6 H ## (character) => (numeric) ## [C] => [+3.000000] ## [B] => [+2.000000] ## [A] => [+1.000000] H[[letters[1:10]]] <- rnorm(10) H ## (character) => (numeric) ## [j] => [-0.472791] ## [i] => [+0.701356] ## [h] => [-1.966617] ## [g] => [+0.497850] ## [e] => [-0.555841] ## [d] => [+0.110683] ## [...] => [...] options(hashmap.max.print = 15) H ## (character) => (numeric) ## [j] => [-0.472791] ## [i] => [+0.701356] ## [h] => [-1.966617] ## [g] => [+0.497850] ## [e] => [-0.555841] ## [d] => [+0.110683] ## [c] => [+0.400772] ## [f] => [+1.786913] ## [b] => [+0.359814] ## [a] => [+1.224082] ## [C] => [+3.000000] ## [B] => [+2.000000] ## [A] => [+1.000000]
The following is a simple test comparing the performance of an environment
object against hashmap
for
- Construction of the hash table
- Vectorized key lookup
An overview of results in presented here, but the full code to reproduce the test is in assets/benchmark.R. All of the examples use a one million element character vector for keys, and a one million element numeric vector for values.
Hash table construction was rather slow for the environment, despite my best moderate efforts to devise a fast solution, so expressions were only evaluated 25 times:
microbenchmark::microbenchmark(
"Hash" = hashmap(Keys, Values),
"Env" = env_hash(Keys, Values),
times = 25L
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# Hash 946.3524 1287.771 1784.404 1639.788 2243.93 3315.194 25 a
# Env 11724.2705 13218.521 14071.874 13685.929 15178.27 16516.216 25 b
Next, a lookup of all 1000 keys:
E <- env_hash(Keys, Values)
H <- hashmap(Keys, Values)
all.equal(env_find(Lookup, E), H[[Lookup]])
# [1] TRUE
microbenchmark::microbenchmark(
"Hash" = H[[Lookup]],
"Env" = env_find(Lookup, E),
times = 500L
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# Hash 314.182 738.98 804.5154 799.7065 858.3895 3013.285 500 a
# Env 12291.671 12651.12 13020.3816 12740.1735 12919.7355 67220.784 500 b
And finally, a comparison of key-lookups for vectors of various sizes, plotted below on the linear and logarithmic scales, where data points represent median evaluation time of 200 runs for the given expression:
The benchmark was conducted on a laptop running Ubuntu 14.04, with the following specs,
$ lscpu && printf "\n\n" && free -h
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 69
Stepping: 1
CPU MHz: 759.000
BogoMIPS: 4589.34
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
total used free shared buffers cached
Mem: 7.7G 5.6G 2.1G 333M 499M 2.5G
-/+ buffers/cache: 2.6G 5.1G
Swap: 0B 0B 0B
in the following R session:
R version 3.2.4 Revised (2016-03-16 r70336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 hashmap_0.0.0.9000 ggvis_0.4.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4.1 rstudioapi_0.3.1 knitr_1.11 magrittr_1.5
[5] munsell_0.4.2 colorspace_1.2-6 xtable_1.8-2 R6_2.1.1
[9] plyr_1.8.3 dplyr_0.4.3 tools_3.2.4 parallel_3.2.4
[13] grid_3.2.4 gtable_0.1.2 DBI_0.3.1 htmltools_0.3.5
[17] yaml_2.1.13 lazyeval_0.1.10 assertthat_0.1 digest_0.6.8
[21] shiny_0.13.2 ggplot2_2.0.0 microbenchmark_1.4-2.1 codetools_0.2-14
[25] mime_0.4 rmarkdown_0.8.1 scales_0.3.0 jsonlite_0.9.17
[29] httpuv_1.3.3 chron_2.3-47
The stable release of hashmap
can be installed from CRAN:
install.packages("hashmap")
The current development version can be installed from GitHub with devtools
:
if (!"devtools" %in% installed.packages()[,1]) {
install.packages("devtools")
}
devtools::install_github("nathan-russell/hashmap")