Some additions to RHIPE for lm and glm. No care has been taken to optimize the regression (e.g. full model matrices are sent, no zeroes are eliminated). Also no checks on numerical accuracy. Merely a proof of concept.
See airline-regresion.R for an example. Note, we use contr.term
here, so
Intercept corresponds to day of week equal to 0 and hour of day equal to 0.
This is approximately 2GB with 59MM rows. The original airline data set has 123MM rows but I removed missing and any row with zero or negative delay.
Across a 72 core (7 task nodes, 2 with 4 cores, 4 with 8 and 2 with 16, dedicated namenode and jobtracker, Hadoop 0.20.2) cluster this took 1 1/2 minutes.
rs.add <- rhlm(arr.delay~dow+hod
,data=inputfile
,factor=list(list("dow",0:6), list("hod",0:23))
,mapred=list(rhipe_map_buff_size=10,mapred.max.split.size=67108864)
)
> rs.add
Estimate Std..Error t.value Pr...t..
(Intercept) 41.57959881 0.26992876 154.0391598 0.000000e+00
dow1 2.25532523 0.05894302 38.2628033 0.000000e+00
dow2 0.39874580 0.05928325 6.7261121 1.742580e-11
dow3 0.01652647 0.05862491 0.2819019 7.780188e-01
dow4 -0.30361517 0.05775607 -5.2568533 1.465416e-07
dow5 0.09743531 0.05748294 1.6950301 9.006972e-02
dow6 2.25813041 0.06240923 36.1826337 1.150564e-286
hod1 -4.68711607 0.44073100 -10.6348682 2.051303e-26
hod2 11.76238833 0.64715417 18.1755583 8.064288e-74
hod3 41.34138295 1.16878974 35.3711036 4.783786e-274
hod4 -20.94283612 0.62557009 -33.4780008 1.013119e-245
hod5 -26.11634530 0.30644857 -85.2226053 0.000000e+00
hod6 -21.71581855 0.27752754 -78.2474355 0.000000e+00
hod7 -20.53026943 0.27492050 -74.6771136 0.000000e+00
hod8 -17.65746553 0.27431319 -64.3697275 0.000000e+00
hod9 -16.23069045 0.27454418 -59.1186823 0.000000e+00
hod10 -14.97250794 0.27480785 -54.4835523 0.000000e+00
hod11 -13.51810139 0.27440600 -49.2631403 0.000000e+00
hod12 -12.23558233 0.27403494 -44.6497164 0.000000e+00
hod13 -11.01274704 0.27382891 -40.2176198 0.000000e+00
hod14 -7.93572622 0.27420976 -28.9403492 3.723338e-184
hod15 -6.19400939 0.27390189 -22.6139710 3.161734e-113
hod16 -5.93969444 0.27371627 -21.7001878 2.045267e-104
hod17 -4.50968685 0.27323515 -16.5047832 3.390596e-61
hod18 -0.26748642 0.27361694 -0.9775945 3.282749e-01
hod19 0.68584884 0.27414700 2.5017558 1.235792e-02
hod20 7.42978011 0.27558638 26.9598954 4.377370e-160
hod21 15.40831252 0.27932810 55.1620561 0.000000e+00
hod22 31.22943519 0.29114067 107.2657939 0.000000e+00
hod23 19.66198549 0.32810199 59.9264443 0.000000e+00
> attr(rs,"stats")
sigmahat r.sq df n
1.172748e+02 8.283200e-03 5.692781e+07 5.692798e+07
This is from a VoIP data set of 850MM rows and 14GB. traffic.rate is a continuous variable and rm.site is a factor with 27 levels. Since I’m not sure of the levels, rhlm precomputes it. Across the cluster it took ~10 minutes.
> rs2.int <- rhlm(jitter ~ traffic.rate+I(traffic.rate^2)+I)(traffic.rate^3+rm.site
,data="/voip/modified.jitter.traffic.rate.database/"
,type='map'
,factor="rm.site"
,transform=function(a){
a$traffic.rate <- a$traffic.rate/1e6
a
}
)
> rs2.int
Estimate Std..Error t.value Pr...t..
(Intercept) 3.303944e-01 7.421201e-05 4452.033981 0.000000e+00
traffic.rate 7.545127e-03 3.940215e-05 191.490253 0.000000e+00
I(traffic.rate^2) 1.099279e-04 6.390472e-06 17.201837 2.572496e-66
I(traffic.rate^3) -2.599681e-05 3.208562e-07 -81.023261 0.000000e+00
rm.siteMilan 1.084781e-01 4.715206e-05 2300.602363 0.000000e+00
rm.siteTampa 9.753336e-02 2.349133e-05 4151.887064 0.000000e+00
rm.siteSlough 6.397481e-02 4.563022e-05 1402.027281 0.000000e+00
rm.siteCopenhagen 8.709813e-02 4.354471e-05 2000.199833 0.000000e+00
rm.siteBuenos Aires 1.403013e-01 2.969850e-05 4724.188126 0.000000e+00
rm.siteAnaheim -1.024560e-02 2.027993e-05 -505.208557 0.000000e+00
rm.siteAmsterdam 7.453912e-02 3.414165e-05 2183.231038 0.000000e+00
rm.siteKansas City 7.001764e-02 2.202731e-04 317.867494 0.000000e+00
rm.siteBrussels 1.226566e-02 2.309235e-04 53.115666 0.000000e+00
rm.siteSacramento -1.065480e-02 4.069750e-05 -261.804721 0.000000e+00
rm.siteFrankfurt 7.986056e-02 2.617881e-05 3050.580770 0.000000e+00
rm.sitePhiladelphia -1.944151e-04 2.457940e-05 -7.909676 2.580592e-15
rm.siteIndianapolis 2.736086e-02 6.450638e-04 42.415742 0.000000e+00
rm.siteSouthfield 5.257719e-02 2.097359e-04 250.682851 0.000000e+00
rm.siteDenver 6.978296e-02 1.148105e-04 607.809977 0.000000e+00
rm.siteRochester 3.051194e-02 5.571686e-05 547.624868 0.000000e+00
rm.siteBoston 1.449916e-01 1.107167e-04 1309.572807 0.000000e+00
rm.siteDallas 5.534420e-02 6.789292e-05 815.168941 0.000000e+00
rm.siteZurich 7.821611e-02 1.054484e-04 741.747377 0.000000e+00
rm.siteChicago 3.766479e-02 2.488500e-04 151.355383 0.000000e+00
rm.sitePSTN 5.510437e-02 9.899227e-06 5566.532801 0.000000e+00
rm.siteSeattle 2.577840e-02 5.004408e-05 515.113866 0.000000e+00
rm.siteParis 5.831724e-02 5.456954e-05 1068.677558 0.000000e+00
rm.siteAtlanta 4.792869e-02 2.570511e-05 1864.558610 0.000000e+00
rm.siteWashington DC 5.903017e-02 3.285202e-05 1796.850326 0.000000e+00
rm.siteDocklands 3.703563e-02 2.031110e-05 1823.418148 0.000000e+00
... truncated ...