Skip to content

saptarshiguha/RHIPE-Additions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

About

Some additions to RHIPE for lm and glm. No care has been taken to optimize the regression (e.g. full model matrices are sent, no zeroes are eliminated). Also no checks on numerical accuracy. Merely a proof of concept.

Example

See airline-regresion.R for an example. Note, we use contr.term here, so Intercept corresponds to day of week equal to 0 and hour of day equal to 0.

This is approximately 2GB with 59MM rows. The original airline data set has 123MM rows but I removed missing and any row with zero or negative delay.

Across a 72 core (7 task nodes, 2 with 4 cores, 4 with 8 and 2 with 16, dedicated namenode and jobtracker, Hadoop 0.20.2) cluster this took 1 1/2 minutes.

rs.add <- rhlm(arr.delay~dow+hod
           ,data=inputfile
           ,factor=list(list("dow",0:6), list("hod",0:23))
           ,mapred=list(rhipe_map_buff_size=10,mapred.max.split.size=67108864)
           )

> rs.add
                Estimate Std..Error     t.value      Pr...t..
(Intercept)  41.57959881 0.26992876 154.0391598  0.000000e+00
dow1          2.25532523 0.05894302  38.2628033  0.000000e+00
dow2          0.39874580 0.05928325   6.7261121  1.742580e-11
dow3          0.01652647 0.05862491   0.2819019  7.780188e-01
dow4         -0.30361517 0.05775607  -5.2568533  1.465416e-07
dow5          0.09743531 0.05748294   1.6950301  9.006972e-02
dow6          2.25813041 0.06240923  36.1826337 1.150564e-286
hod1         -4.68711607 0.44073100 -10.6348682  2.051303e-26
hod2         11.76238833 0.64715417  18.1755583  8.064288e-74
hod3         41.34138295 1.16878974  35.3711036 4.783786e-274
hod4        -20.94283612 0.62557009 -33.4780008 1.013119e-245
hod5        -26.11634530 0.30644857 -85.2226053  0.000000e+00
hod6        -21.71581855 0.27752754 -78.2474355  0.000000e+00
hod7        -20.53026943 0.27492050 -74.6771136  0.000000e+00
hod8        -17.65746553 0.27431319 -64.3697275  0.000000e+00
hod9        -16.23069045 0.27454418 -59.1186823  0.000000e+00
hod10       -14.97250794 0.27480785 -54.4835523  0.000000e+00
hod11       -13.51810139 0.27440600 -49.2631403  0.000000e+00
hod12       -12.23558233 0.27403494 -44.6497164  0.000000e+00
hod13       -11.01274704 0.27382891 -40.2176198  0.000000e+00
hod14        -7.93572622 0.27420976 -28.9403492 3.723338e-184
hod15        -6.19400939 0.27390189 -22.6139710 3.161734e-113
hod16        -5.93969444 0.27371627 -21.7001878 2.045267e-104
hod17        -4.50968685 0.27323515 -16.5047832  3.390596e-61
hod18        -0.26748642 0.27361694  -0.9775945  3.282749e-01
hod19         0.68584884 0.27414700   2.5017558  1.235792e-02
hod20         7.42978011 0.27558638  26.9598954 4.377370e-160
hod21        15.40831252 0.27932810  55.1620561  0.000000e+00
hod22        31.22943519 0.29114067 107.2657939  0.000000e+00
hod23        19.66198549 0.32810199  59.9264443  0.000000e+00

> attr(rs,"stats")
    sigmahat         r.sq           df            n 
1.172748e+02 8.283200e-03 5.692781e+07 5.692798e+07 

This is from a VoIP data set of 850MM rows and 14GB. traffic.rate is a continuous variable and rm.site is a factor with 27 levels. Since I’m not sure of the levels, rhlm precomputes it. Across the cluster it took ~10 minutes.

> rs2.int <- rhlm(jitter ~ traffic.rate+I(traffic.rate^2)+I)(traffic.rate^3+rm.site
                ,data="/voip/modified.jitter.traffic.rate.database/"
                ,type='map'
                ,factor="rm.site"
                ,transform=function(a){
                  a$traffic.rate <- a$traffic.rate/1e6
                  a
                }
                )

> rs2.int
                          Estimate   Std..Error     t.value     Pr...t..
(Intercept)           3.303944e-01 7.421201e-05 4452.033981 0.000000e+00
traffic.rate          7.545127e-03 3.940215e-05  191.490253 0.000000e+00
I(traffic.rate^2)     1.099279e-04 6.390472e-06   17.201837 2.572496e-66
I(traffic.rate^3)    -2.599681e-05 3.208562e-07  -81.023261 0.000000e+00
rm.siteMilan          1.084781e-01 4.715206e-05 2300.602363 0.000000e+00
rm.siteTampa          9.753336e-02 2.349133e-05 4151.887064 0.000000e+00
rm.siteSlough         6.397481e-02 4.563022e-05 1402.027281 0.000000e+00
rm.siteCopenhagen     8.709813e-02 4.354471e-05 2000.199833 0.000000e+00
rm.siteBuenos Aires   1.403013e-01 2.969850e-05 4724.188126 0.000000e+00
rm.siteAnaheim       -1.024560e-02 2.027993e-05 -505.208557 0.000000e+00
rm.siteAmsterdam      7.453912e-02 3.414165e-05 2183.231038 0.000000e+00
rm.siteKansas City    7.001764e-02 2.202731e-04  317.867494 0.000000e+00
rm.siteBrussels       1.226566e-02 2.309235e-04   53.115666 0.000000e+00
rm.siteSacramento    -1.065480e-02 4.069750e-05 -261.804721 0.000000e+00
rm.siteFrankfurt      7.986056e-02 2.617881e-05 3050.580770 0.000000e+00
rm.sitePhiladelphia  -1.944151e-04 2.457940e-05   -7.909676 2.580592e-15
rm.siteIndianapolis   2.736086e-02 6.450638e-04   42.415742 0.000000e+00
rm.siteSouthfield     5.257719e-02 2.097359e-04  250.682851 0.000000e+00
rm.siteDenver         6.978296e-02 1.148105e-04  607.809977 0.000000e+00
rm.siteRochester      3.051194e-02 5.571686e-05  547.624868 0.000000e+00
rm.siteBoston         1.449916e-01 1.107167e-04 1309.572807 0.000000e+00
rm.siteDallas         5.534420e-02 6.789292e-05  815.168941 0.000000e+00
rm.siteZurich         7.821611e-02 1.054484e-04  741.747377 0.000000e+00
rm.siteChicago        3.766479e-02 2.488500e-04  151.355383 0.000000e+00
rm.sitePSTN           5.510437e-02 9.899227e-06 5566.532801 0.000000e+00
rm.siteSeattle        2.577840e-02 5.004408e-05  515.113866 0.000000e+00
rm.siteParis          5.831724e-02 5.456954e-05 1068.677558 0.000000e+00
rm.siteAtlanta        4.792869e-02 2.570511e-05 1864.558610 0.000000e+00
rm.siteWashington DC  5.903017e-02 3.285202e-05 1796.850326 0.000000e+00
rm.siteDocklands      3.703563e-02 2.031110e-05 1823.418148 0.000000e+00
... truncated ...

About

Some regression additions to RHIPE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages