Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Patrick's comments #7

Closed
wants to merge 213 commits into from
Closed

Commits on Mar 10, 2014

  1. SPARK-977 Added Python RDD.zip function

    was raised earlier as a part of  apache/incubator-spark#486
    
    Author: Prabin Banka <prabin.banka@imaginea.com>
    
    Closes apache#76 from prabinb/python-api-zip and squashes the following commits:
    
    b1a31a0 [Prabin Banka] Added Python RDD.zip function
    Prabin Banka authored and mateiz committed Mar 10, 2014
    Configuration menu
    Copy the full SHA
    e1e09e0 View commit details
    Browse the repository at this point in the history
  2. [SPARK-972] Added detailed callsite info for ValueError in context.py…

    … (resubmitted)
    
    Author: jyotiska <jyotiska123@gmail.com>
    
    Closes apache#34 from jyotiska/pyspark_code and squashes the following commits:
    
    c9439be [jyotiska] replaced dict with namedtuple
    a6bf4cd [jyotiska] added callsite info for context.py
    jyotiska authored and mateiz committed Mar 10, 2014
    Configuration menu
    Copy the full SHA
    f551898 View commit details
    Browse the repository at this point in the history
  3. SPARK-1168, Added foldByKey to pyspark.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:
    
    db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
    ScrapCodes authored and mateiz committed Mar 10, 2014
    Configuration menu
    Copy the full SHA
    a59419c View commit details
    Browse the repository at this point in the history
  4. SPARK-1205: Clean up callSite/origin/generator.

    This patch removes the `generator` field and simplifies + documents
    the tracking of callsites.
    
    There are two places where we care about call sites, when a job is
    run and when an RDD is created. This patch retains both of those
    features but does a slight refactoring and renaming to make things
    less confusing.
    
    There was another feature of an rdd called the `generator` which was
    by default the user class that in which the RDD was created. This is
    used exclusively in the JobLogger. It been subsumed by the ability
    to name a job group. The job logger can later be refectored to
    read the job group directly (will require some work) but for now
    this just preserves the default logged value of the user class.
    I'm not sure any users ever used the ability to override this.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#106 from pwendell/callsite and squashes the following commits:
    
    fc1d009 [Patrick Wendell] Compile fix
    e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite
    62e77ef [Patrick Wendell] Review feedback
    576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.
    pwendell committed Mar 10, 2014
    Configuration menu
    Copy the full SHA
    2a51617 View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2014

  1. SPARK-1211. In ApplicationMaster, set spark.master system property to…

    … "y...
    
    ...arn-cluster"
    
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#118 from sryza/sandy-spark-1211 and squashes the following commits:
    
    d4001c7 [Sandy Ryza] SPARK-1211. In ApplicationMaster, set spark.master system property to "yarn-cluster"
    sryza authored and pwendell committed Mar 11, 2014
    Configuration menu
    Copy the full SHA
    2a2c964 View commit details
    Browse the repository at this point in the history
  2. SPARK-1167: Remove metrics-ganglia from default build due to LGPL iss…

    …ues...
    
    This patch removes Ganglia integration from the default build. It
    allows users willing to link against LGPL code to use Ganglia
    by adding build flags or linking against a new Spark artifact called
    spark-ganglia-lgpl.
    
    This brings Spark in line with the Apache policy on LGPL code
    enumerated here:
    
    https://www.apache.org/legal/3party.html#options-optional
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#108 from pwendell/ganglia and squashes the following commits:
    
    326712a [Patrick Wendell] Responding to review feedback
    5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
    pwendell committed Mar 11, 2014
    Configuration menu
    Copy the full SHA
    16788a6 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2014

  1. SPARK-1064

    This reopens PR 649 from incubator-spark against the new repo
    
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#102 from sryza/sandy-spark-1064 and squashes the following commits:
    
    270e490 [Sandy Ryza] Handle different application classpath variables in different versions
    88b04e0 [Sandy Ryza] SPARK-1064. Make it possible to run on YARN without bundling Hadoop jars in Spark assembly
    sryza authored and pwendell committed Mar 12, 2014
    Configuration menu
    Copy the full SHA
    2409af9 View commit details
    Browse the repository at this point in the history
  2. Spark-1163, Added missing Python RDD functions

    Author: prabinb <prabin.banka@imaginea.com>
    
    Closes apache#92 from prabinb/python-api-rdd and squashes the following commits:
    
    51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
    prabinb authored and pwendell committed Mar 12, 2014
    Configuration menu
    Copy the full SHA
    af7f2f1 View commit details
    Browse the repository at this point in the history
  3. [SPARK-1232] Fix the hadoop 0.23 yarn build

    Author: Thomas Graves <tgraves@apache.org>
    
    Closes apache#127 from tgravescs/SPARK-1232 and squashes the following commits:
    
    c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build
    tgravescs authored and pwendell committed Mar 12, 2014
    Configuration menu
    Copy the full SHA
    c8c59b3 View commit details
    Browse the repository at this point in the history
  4. [SPARK-1233] Fix running hadoop 0.23 due to java.lang.NoSuchFieldExce…

    …ption: DEFAULT_M...
    
    ...APREDUCE_APPLICATION_CLASSPATH
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes apache#129 from tgravescs/SPARK-1233 and squashes the following commits:
    
    85ff5a6 [Thomas Graves] Fix running hadoop 0.23 due to java.lang.NoSuchFieldException: DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH
    tgravescs authored and aarondav committed Mar 12, 2014
    Configuration menu
    Copy the full SHA
    b5162f4 View commit details
    Browse the repository at this point in the history
  5. Fix #SPARK-1149 Bad partitioners can cause Spark to hang

    Author: liguoqiang <liguoqiang@rd.tuan800.com>
    
    Closes apache#44 from witgo/SPARK-1149 and squashes the following commits:
    
    3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149
    8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
    3dad595 [liguoqiang] review comment
    e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149
    b0d5c07 [liguoqiang] review comment
    d0a6005 [liguoqiang] review comment
    3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149
    ac006a3 [liguoqiang] code Formatting
    3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149
    adc443e [liguoqiang] partitions check  bugfix
    928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner
    db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149
    1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149
    3348619 [liguoqiang] Optimize performance for partitions check
    61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149
    e68210a [liguoqiang] add partition index check to submitJob
    3a65903 [liguoqiang] make the code more readable
    6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang
    liguoqiang authored and pwendell committed Mar 12, 2014
    Configuration menu
    Copy the full SHA
    5d1ec64 View commit details
    Browse the repository at this point in the history
  6. SPARK-1162 Added top in python.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:
    
    ece1fa4 [Prashant Sharma] Added top in python.
    ScrapCodes authored and mateiz committed Mar 12, 2014
    Configuration menu
    Copy the full SHA
    b8afe30 View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2014

  1. SPARK-1160: Deprecate toArray in RDD

    https://spark-project.atlassian.net/browse/SPARK-1160
    
    reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python."
    
    In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes apache#105 from CodingCat/SPARK-1060 and squashes the following commits:
    
    286f163 [CodingCat] deprecate in JavaRDDLike
    ee17b4e [CodingCat] add message and since
    2ff7319 [CodingCat] deprecate toArray in RDD
    CodingCat authored and aarondav committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    9032f7c View commit details
    Browse the repository at this point in the history
  2. Fix example bug: compile error

    Author: jianghan <jianghan@xiaomi.com>
    
    Closes apache#132 from pooorman/master and squashes the following commits:
    
    54afbe0 [jianghan] Fix example bug: compile error
    jianghan authored and pwendell committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    31a7040 View commit details
    Browse the repository at this point in the history
  3. hot fix for PR105 - change to Java annotation

    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes apache#133 from CodingCat/SPARK-1160-2 and squashes the following commits:
    
    6607155 [CodingCat] hot fix for PR105 - change to Java annotation
    CodingCat authored and aarondav committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    6bd2eaa View commit details
    Browse the repository at this point in the history
  4. SPARK-1019: pyspark RDD take() throws an NPE

    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#112 from pwendell/pyspark-take and squashes the following commits:
    
    daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE
    pwendell committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    4ea23db View commit details
    Browse the repository at this point in the history
  5. [SPARK-1237, 1238] Improve the computation of YtY for implicit ALS

    Computing YtY can be implemented using BLAS's DSPR operations instead of generating y_i y_i^T and then combining them. The latter generates many k-by-k matrices. On the movielens data, this change improves the performance by 10-20%. The algorithm remains the same, verified by computing RMSE on the movielens data.
    
    To compare the results, I also added an option to set a random seed in ALS.
    
    JIRA:
    1. https://spark-project.atlassian.net/browse/SPARK-1237
    2. https://spark-project.atlassian.net/browse/SPARK-1238
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#131 from mengxr/als and squashes the following commits:
    
    ed00432 [Xiangrui Meng] minor changes
    d984623 [Xiangrui Meng] minor changes
    2fc1641 [Xiangrui Meng] remove commented code
    4c7cde2 [Xiangrui Meng] allow specifying a random seed in ALS
    200bef0 [Xiangrui Meng] optimize computeYtY and updateBlock
    mengxr authored and rxin committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    e4e8d8f View commit details
    Browse the repository at this point in the history
  6. SPARK-1183. Don't use "worker" to mean executor

    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#120 from sryza/sandy-spark-1183 and squashes the following commits:
    
    5066a4a [Sandy Ryza] Remove "worker" in a couple comments
    0bd1e46 [Sandy Ryza] Remove --am-class from usage
    bfc8fe0 [Sandy Ryza] Remove am-class from doc and fix yarn-alpha
    607539f [Sandy Ryza] Address review comments
    74d087a [Sandy Ryza] SPARK-1183. Don't use "worker" to mean executor
    sryza authored and pwendell committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    6983732 View commit details
    Browse the repository at this point in the history
  7. SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.

    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#113 from rxin/jetty9 and squashes the following commits:
    
    867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file.
    d7c97ca [Reynold Xin] Return the correctly bound port.
    d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
    rxin authored and pwendell committed Mar 13, 2014
    Configuration menu
    Copy the full SHA
    ca4bf8c View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2014

  1. [bugfix] wrong client arg, should use executor-cores

    client arg is wrong, it should be executor-cores. it causes executor fail to start when executor-cores is specified
    
    Author: Tianshuo Deng <tdeng@twitter.com>
    
    Closes apache#138 from tsdeng/bugfix_wrong_client_args and squashes the following commits:
    
    304826d [Tianshuo Deng] wrong client arg, should use executor-cores
    tsdeng authored and pwendell committed Mar 14, 2014
    Configuration menu
    Copy the full SHA
    181b130 View commit details
    Browse the repository at this point in the history
  2. Fix serialization of MutablePair. Also provide an interface for easy …

    …updating.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#141 from marmbrus/mutablePair and squashes the following commits:
    
    f5c4783 [Michael Armbrust] Change function name to update
    8bfd973 [Michael Armbrust] Fix serialization of MutablePair.  Also provide an interface for easy updating.
    marmbrus authored and rxin committed Mar 14, 2014
    Configuration menu
    Copy the full SHA
    e19044c View commit details
    Browse the repository at this point in the history

Commits on Mar 15, 2014

  1. SPARK-1254. Consolidate, order, and harmonize repository declarations…

    … in Maven/SBT builds
    
    This suggestion addresses a few minor suboptimalities with how repositories are handled.
    
    1) Use HTTPS consistently to access repos, instead of HTTP
    
    2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.)
    
    3) Update SBT build to match Maven build in this regard
    
    4) Update SBT build to not refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#145 from srowen/SPARK-1254 and squashes the following commits:
    
    42f9bfc [Sean Owen] Use HTTPS for repos; consolidate repos in parent in order to put optional Cloudera repo last; harmonize SBT build repos with Maven; remove snapshot repos from SBT build which weren't in Maven
    srowen authored and pwendell committed Mar 15, 2014
    Configuration menu
    Copy the full SHA
    97e4459 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2014

  1. SPARK-1255: Allow user to pass Serializer object instead of class nam…

    …e for shuffle.
    
    This is more general than simply passing a string name and leaves more room for performance optimizations.
    
    Note that this is technically an API breaking change in the following two ways:
    1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark.
    2. Serializer's in Spark from now on are required to be serializable.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#149 from rxin/serializer and squashes the following commits:
    
    5acaccd [Reynold Xin] Properly call serializer's constructors.
    2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency.
    7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.
    rxin authored and pwendell committed Mar 16, 2014
    Configuration menu
    Copy the full SHA
    f5486e9 View commit details
    Browse the repository at this point in the history

Commits on Mar 17, 2014

  1. SPARK-1240: handle the case of empty RDD when takeSample

    https://spark-project.atlassian.net/browse/SPARK-1240
    
    It seems that the current implementation does not handle the empty RDD case when run takeSample
    
    In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value
    
    In the test case, I also add several lines for this case
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes apache#135 from CodingCat/SPARK-1240 and squashes the following commits:
    
    fef57d4 [CodingCat] fix the same problem in PySpark
    36db06b [CodingCat] create new test cases for takeSample from an empty red
    810948d [CodingCat] further fix
    a40e8fb [CodingCat] replace if with require
    ad483fd [CodingCat] handle the case with empty RDD when take sample
    CodingCat authored and mateiz committed Mar 17, 2014
    Configuration menu
    Copy the full SHA
    dc96546 View commit details
    Browse the repository at this point in the history
  2. SPARK-1244: Throw exception if map output status exceeds frame size

    This is a very small change on top of @andrewor14's patch in apache#147.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#152 from pwendell/akka-frame and squashes the following commits:
    
    e5fb3ff [Patrick Wendell] Reversing test order
    393af4c [Patrick Wendell] Small improvement suggested by Andrew Or
    8045103 [Patrick Wendell] Breaking out into two tests
    2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size
    c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular
    281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests
    pwendell committed Mar 17, 2014
    Configuration menu
    Copy the full SHA
    796977a View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2014

  1. [Spark-1261] add instructions for running python examples to doc over…

    …view page
    
    Author: Diana Carroll <dcarroll@cloudera.com>
    
    Closes apache#162 from dianacarroll/SPARK-1261 and squashes the following commits:
    
    14ac602 [Diana Carroll] typo in python example text
    5121e3e [Diana Carroll] Add explanation of how to run Python examples to main doc overview page
    Diana Carroll authored and mateiz committed Mar 18, 2014
    Configuration menu
    Copy the full SHA
    087eedc View commit details
    Browse the repository at this point in the history
  2. Spark 1246 add min max to stat counter

    Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.
    
    Author: Dan McClary <dan.mcclary@gmail.com>
    
    Closes apache#144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:
    
    fd3fd4b [Dan McClary] fixed  error, updated test
    82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
    5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
    21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
    1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
    a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
    ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
    1e7056d [Dan McClary] added underscore to getBucket
    37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
    29981f2 [Dan McClary] fixed indentation on doctest comment
    eaf89d9 [Dan McClary] added correct doctest for histogram
    4916016 [Dan McClary] added histogram method, added max and min to statscounter
    dwmclary authored and mateiz committed Mar 18, 2014
    Configuration menu
    Copy the full SHA
    e3681f2 View commit details
    Browse the repository at this point in the history
  3. Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."

    This reverts commit ca4bf8c.
    
    Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#167 from pwendell/jetty-revert and squashes the following commits:
    
    811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
    pwendell authored and rxin committed Mar 18, 2014
    Configuration menu
    Copy the full SHA
    e7423d4 View commit details
    Browse the repository at this point in the history
  4. SPARK-1102: Create a saveAsNewAPIHadoopDataset method

    https://spark-project.atlassian.net/browse/SPARK-1102
    
    Create a saveAsNewAPIHadoopDataset method
    
    By @mateiz: "Right now RDDs can only be saved as files using the new Hadoop API, not as "datasets" with no filename and just a JobConf. See http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ for an example of how you have to give a bogus filename. For the old Hadoop API, we have saveAsHadoopDataset."
    
    Author: CodingCat <zhunansjtu@gmail.com>
    
    Closes apache#12 from CodingCat/SPARK-1102 and squashes the following commits:
    
    6ba0c83 [CodingCat] add test cases for saveAsHadoopDataSet (new&old API)
    a8d11ba [CodingCat] style fix.........
    95a6929 [CodingCat] code clean
    7643c88 [CodingCat] change the parameter type back to Configuration
    a8583ee [CodingCat] Create a saveAsNewAPIHadoopDataset method
    CodingCat authored and mateiz committed Mar 18, 2014
    Configuration menu
    Copy the full SHA
    2fa26ec View commit details
    Browse the repository at this point in the history
  5. Update copyright year in NOTICE to 2014

    Author: Matei Zaharia <matei@databricks.com>
    
    Closes apache#174 from mateiz/update-notice and squashes the following commits:
    
    47fc1a5 [Matei Zaharia] Update copyright year in NOTICE to 2014
    mateiz authored and pwendell committed Mar 18, 2014
    Configuration menu
    Copy the full SHA
    79e547f View commit details
    Browse the repository at this point in the history
  6. [SPARK-1260]: faster construction of features with intercept

    The current implementation uses `Array(1.0, features: _*)` to construct a new array with intercept. This is not efficient for big arrays because `Array.apply` uses a for loop that iterates over the arguments. `Array.+:` is a better choice here.
    
    Also, I don't see a reason to set initial weights to ones. So I set them to zeros.
    
    JIRA: https://spark-project.atlassian.net/browse/SPARK-1260
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#161 from mengxr/sgd and squashes the following commits:
    
    b5cfc53 [Xiangrui Meng] set default weights to zeros
    a1439c2 [Xiangrui Meng] faster construction of features with intercept
    mengxr authored and rxin committed Mar 18, 2014
    Configuration menu
    Copy the full SHA
    e108b9a View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2014

  1. [SPARK-1266] persist factors in implicit ALS

    In implicit ALS computation, the user or product factor is used twice in each iteration. Caching can certainly help accelerate the computation. I saw the running time decreased by ~70% for implicit ALS on the movielens data.
    
    I also made the following changes:
    
    1. Change `YtYb` type from `Broadcast[Option[DoubleMatrix]]` to `Option[Broadcast[DoubleMatrix]]`, so we don't need to broadcast None in explicit computation.
    
    2. Mark methods `computeYtY`, `unblockFactors`, `updateBlock`, and `updateFeatures private`. Users do not need those methods.
    
    3. Materialize the final matrix factors before returning the model. It allows us to clean up other cached RDDs before returning the model. I do not have a better solution here, so I use `RDD.count()`.
    
    JIRA: https://spark-project.atlassian.net/browse/SPARK-1266
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#165 from mengxr/als and squashes the following commits:
    
    c9676a6 [Xiangrui Meng] add a comment about the last products.persist
    d3a88aa [Xiangrui Meng] change implicitPrefs match to if ... else ...
    63862d6 [Xiangrui Meng] persist factors in implicit ALS
    mengxr authored and mateiz committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    f9d8a83 View commit details
    Browse the repository at this point in the history
  2. Fix SPARK-1256: Master web UI and Worker web UI returns a 404 error

    Author: witgo <witgo@qq.com>
    
    Closes apache#150 from witgo/SPARK-1256 and squashes the following commits:
    
    08044a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1256
    c99b030 [witgo] Fix SPARK-1256
    witgo authored and pwendell committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    cc2655a View commit details
    Browse the repository at this point in the history
  3. Bundle tachyon: SPARK-1269

    This should all work as expected with the current version of the tachyon tarball (0.4.1)
    
    Author: Nick Lanham <nick@afternight.org>
    
    Closes apache#137 from nicklan/bundle-tachyon and squashes the following commits:
    
    2eee15b [Nick Lanham] Put back in exec, start tachyon first
    738ba23 [Nick Lanham] Move tachyon out of sbin
    f2f9bc6 [Nick Lanham] More checks for tachyon script
    111e8e1 [Nick Lanham] Only try tachyon operations if tachyon script exists
    0561574 [Nick Lanham] Copy over web resources so web interface can run
    4dc9809 [Nick Lanham] Update to tachyon 0.4.1
    0a1a20c [Nick Lanham] Add scripts using tachyon tarball
    nicklan authored and pwendell committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    a18ea00 View commit details
    Browse the repository at this point in the history
  4. bugfix: Wrong "Duration" in "Active Stages" in stages page

    If a stage which has completed once loss parts of data, it will be resubmitted. At this time, it appears that stage.completionTime > stage.submissionTime.
    
    Author: shiyun.wxm <shiyun.wxm@taobao.com>
    
    Closes apache#170 from BlackNiuza/duration_problem and squashes the following commits:
    
    a86d261 [shiyun.wxm] tow space indent
    c0d7b24 [shiyun.wxm] change the style
    3b072e1 [shiyun.wxm] fix scala style
    f20701e [shiyun.wxm] bugfix: "Duration" in "Active Stages" in stages page
    BlackNiuza authored and rxin committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    d55ec86 View commit details
    Browse the repository at this point in the history
  5. SPARK-1203 fix saving to hdfs from yarn

    Author: Thomas Graves <tgraves@apache.org>
    
    Closes apache#173 from tgravescs/SPARK-1203 and squashes the following commits:
    
    4fd5ded [Thomas Graves] adding import
    964e3f7 [Thomas Graves] SPARK-1203 fix saving to hdfs from yarn
    tgravescs committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    6112270 View commit details
    Browse the repository at this point in the history
  6. Bugfixes/improvements to scheduler

    Move the PR#517 of apache-incubator-spark to the apache-spark
    
    Author: Mridul Muralidharan <mridul@gmail.com>
    
    Closes apache#159 from mridulm/master and squashes the following commits:
    
    5ff59c2 [Mridul Muralidharan] Change property in suite also
    167fad8 [Mridul Muralidharan] Address review comments
    9bda70e [Mridul Muralidharan] Address review comments, akwats add to failedExecutors
    270d841 [Mridul Muralidharan] Address review comments
    fa5d9f1 [Mridul Muralidharan] Bugfixes/improvements to scheduler : PR apache#517
    mridulm authored and mateiz committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    ab747d3 View commit details
    Browse the repository at this point in the history
  7. [SPARK-1132] Persisting Web UI through refactoring the SparkListener …

    …interface
    
    The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running.
    
    The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. In this design, each existing listener for the UI (e.g. ExecutorsListener) maintains state that can be fully constructed from SparkListenerEvents. This state is then supplied to the parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand.
    
    This PR introduces two important classes: the **EventLoggingListener**, and the **ReplayListenerBus**. In a live application, SparkUI registers an EventLoggingListener with the SparkContext in addition to the existing listeners. Over the course of the application, this listener serializes and logs all events to persisted storage. Then, after the application has finished, the SparkUI can be revived by replaying all the logged events to the existing UI listeners through the ReplayListenerBus.
    
    This feature is currently integrated with the Master Web UI, which optionally rebuilds a SparkUI from event logs as soon as the corresponding application finishes.
    
    More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome.
    
    Author: Andrew Or <andrewor14@gmail.com>
    Author: andrewor14 <andrewor14@gmail.com>
    
    Closes apache#42 from andrewor14/master and squashes the following commits:
    
    e5f14fa [Andrew Or] Merge github.com:apache/spark
    a1c5cd9 [Andrew Or] Merge github.com:apache/spark
    b8ba817 [Andrew Or] Remove UI from map when removing application in Master
    83af656 [Andrew Or] Scraps and pieces (no functionality change)
    222adcd [Andrew Or] Merge github.com:apache/spark
    124429f [Andrew Or] Clarify LiveListenerBus behavior + Add tests for new behavior
    f80bd31 [Andrew Or] Simplify static handler and BlockManager status update logic
    9e14f97 [Andrew Or] Moved around functionality + renamed classes per Patrick
    6740e49 [Andrew Or] Fix comment nits
    650eb12 [Andrew Or] Add unit tests + Fix bugs found through tests
    45fd84c [Andrew Or] Remove now deprecated test
    c5c2c8f [Andrew Or] Remove list of (TaskInfo, TaskMetrics) from StageInfo
    3456090 [Andrew Or] Address Patrick's comments
    bf80e3d [Andrew Or] Imports, comments, and code formatting, once again (minor)
    ac69ec8 [Andrew Or] Fix test fail
    d801d11 [Andrew Or] Merge github.com:apache/spark (major)
    dc93915 [Andrew Or] Imports, comments, and code formatting (minor)
    77ba283 [Andrew Or] Address Kay's and Patrick's comments
    b6eaea7 [Andrew Or] Treating SparkUI as a handler of MasterUI
    d59da5f [Andrew Or] Avoid logging all the blocks on each executor
    d6e3b4a [Andrew Or] Merge github.com:apache/spark
    ca258a4 [Andrew Or] Master UI - add support for reading compressed event logs
    176e68e [Andrew Or] Fix deprecated message for JavaSparkContext (minor)
    4f69c4a [Andrew Or] Master UI - Rebuild SparkUI on application finish
    291b2be [Andrew Or] Correct directory in log message "INFO: Logging events to <dir>"
    1ba3407 [Andrew Or] Add a few configurable options to event logging
    e375431 [Andrew Or] Add new constructors for SparkUI
    18b256d [Andrew Or] Refactor out event logging and replaying logic from UI
    bb4c503 [Andrew Or] Use a more mnemonic path for logging
    aef411c [Andrew Or] Fix bug: storage status was not reflected on UI in the local case
    03eda0b [Andrew Or] Fix HDFS flush behavior
    36b3e5d [Andrew Or] Add HDFS support for event logging
    cceff2b [andrewor14] Fix 100 char format fail
    2fee310 [Andrew Or] Address Patrick's comments
    2981d61 [Andrew Or] Move SparkListenerBus out of DAGScheduler + Clean up
    5d2cec1 [Andrew Or] JobLogger: ID -> Id
    0503e4b [Andrew Or] Fix PySpark tests + remove sc.clearFiles/clearJars
    4d2fb0c [Andrew Or] Fix format fail
    faa113e [Andrew Or] General clean up
    d47585f [Andrew Or] Clean up FileLogger
    472fd8a [Andrew Or] Fix a couple of tests
    996d7a2 [Andrew Or] Reflect RDD unpersist on UI
    7b2f811 [Andrew Or] Guard against TaskMetrics NPE + Fix tests
    d1f4285 [Andrew Or] Migrate from lift-json to json4s-jackson
    28019ca [Andrew Or] Merge github.com:apache/spark
    bbe3501 [Andrew Or] Embed storage status and RDD info in Task events
    6631c02 [Andrew Or] More formatting changes, this time mainly for Json DSL
    70e7e7a [Andrew Or] Formatting changes
    e9e1c6d [Andrew Or] Move all JSON de/serialization logic to JsonProtocol
    d646df6 [Andrew Or] Completely decouple SparkUI from SparkContext
    6814da0 [Andrew Or] Explicitly register each UI listener rather than through some magic
    64d2ce1 [Andrew Or] Fix BlockManagerUI bug by introducing new event
    4273013 [Andrew Or] Add a gateway SparkListener to simplify event logging
    904c729 [Andrew Or] Fix another major bug
    5ac906d [Andrew Or] Mostly naming, formatting, and code style changes
    3fd584e [Andrew Or] Fix two major bugs
    f3fc13b [Andrew Or] General refactor
    4dfcd22 [Andrew Or] Merge git://git.apache.org/incubator-spark into persist-ui
    b3976b0 [Andrew Or] Add functionality of reconstructing a persisted UI from SparkContext
    8add36b [Andrew Or] JobProgressUI: Add JSON functionality
    d859efc [Andrew Or] BlockManagerUI: Add JSON functionality
    c4cd480 [Andrew Or] Also deserialize new events
    8a2ebe6 [Andrew Or] Fix bugs for EnvironmentUI and ExecutorsUI
    de8a1cd [Andrew Or] Serialize events both to and from JSON (rather than just to)
    bf0b2e9 [Andrew Or] ExecutorUI: Serialize events rather than arbitary executor information
    bb222b9 [Andrew Or] ExecutorUI: render completely from JSON
    dcbd312 [Andrew Or] Add JSON Serializability for all SparkListenerEvent's
    10ed49d [Andrew Or] Merge github.com:apache/incubator-spark into persist-ui
    8e09306 [Andrew Or] Use JSON for ExecutorsUI
    e3ae35f [Andrew Or] Merge github.com:apache/incubator-spark
    3ddeb7e [Andrew Or] Also privatize fields
    090544a [Andrew Or] Privatize methods
    13920c9 [Andrew Or] Update docs
    bd5a1d7 [Andrew Or] Typo: phyiscal -> physical
    287ef44 [Andrew Or] Avoid reading the entire batch into memory; also simplify streaming logic
    3df7005 [Andrew Or] Merge branch 'master' of github.com:andrewor14/incubator-spark
    a531d2e [Andrew Or] Relax assumptions on compressors and serializers when batching
    164489d [Andrew Or] Relax assumptions on compressors and serializers when batching
    andrewor14 authored and pwendell committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    79d07d6 View commit details
    Browse the repository at this point in the history
  8. Added doctest for map function in rdd.py

    Doctest added for map in rdd.py
    
    Author: Jyotiska NK <jyotiska123@gmail.com>
    
    Closes apache#177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:
    
    a38527f [Jyotiska NK] Added doctest for map function in rdd.py
    jyotiska authored and mateiz committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    67fa71c View commit details
    Browse the repository at this point in the history
  9. SPARK-1099:Spark's local mode should probably respect spark.cores.max…

    … by default
    
    This is for JIRA:https://spark-project.atlassian.net/browse/SPARK-1099
    And this is what I do in this patch (also commented in the JIRA) @aarondav
    
     This is really a behavioral change, so I do this with great caution, and welcome any review advice:
    
    1 I change the "MASTER=local" pattern of create LocalBackEnd . In the past, we passed 1 core to it . now it use a default cores
    The reason here is that when someone use spark-shell to start local mode , Repl will use this "MASTER=local" pattern as default.
    So if one also specify cores in the spark-shell command line, it will all go in here. So here pass 1 core is not suitalbe reponding to our change here.
    2 In the LocalBackEnd , the "totalCores" variable are fetched following a different rule(in the past it just take in a userd passed cores, like 1 in "MASTER=local" pattern, 2 in "MASTER=local[2]" pattern"
    rules:
    a The second argument of LocalBackEnd 's constructor indicating cores have a default value which is Int.MaxValue. If user didn't pass it , its first default value is Int.MaxValue
    b In getMaxCores, we first compare the former value to Int.MaxValue. if it's not equal, we think that user has passed their desired value, so just use it
    c. If b is not satified, we then get cores from spark.cores.max, and we get real logical cores from Runtime. And if cores specified by spark.cores.max is bigger than logical cores, we use logical cores, otherwise we use spark.cores.max
    3 In SparkContextSchedulerCreationSuite 's test("local") case, assertion is modified from 1 to logical cores, because "MASTER=local" pattern use default vaules.
    
    Author: qqsun8819 <jin.oyj@alibaba-inc.com>
    
    Closes apache#110 from qqsun8819/local-cores and squashes the following commits:
    
    731aefa [qqsun8819] 1 LocalBackend not change 2 In SparkContext do some process to the cores and pass it to original LocalBackend constructor
    78b9c60 [qqsun8819] 1 SparkContext MASTER=local pattern use default cores instead of 1 to construct LocalBackEnd , for use of spark-shell and cores specified in cmd line 2 some test case change from local to local[1]. 3 SparkContextSchedulerCreationSuite test spark.cores.max config in local pattern
    6ae1ee8 [qqsun8819] Add a static function in LocalBackEnd to let it use spark.cores.max specified cores when no cores are passed to it
    qqsun8819 authored and aarondav committed Mar 19, 2014
    Configuration menu
    Copy the full SHA
    1678931 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2014

  1. Revert "SPARK-1099:Spark's local mode should probably respect spark.c…

    …ores.max by default"
    
    This reverts commit 1678931. Jenkins was not run for this PR.
    aarondav committed Mar 20, 2014
    Configuration menu
    Copy the full SHA
    ffe272d View commit details
    Browse the repository at this point in the history
  2. Principal Component Analysis

    # Principal Component Analysis
    
    Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.
    
    ## Testing
    Tests included:
     * All principal components
     * Only top k principal components
     * Dense SVD tests
     * Dense/sparse matrix tests
    
    The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html
    
    ## Documentation
    Added to mllib-guide.md
    
    ## Example Usage
    Added to examples directory under SparkPCA.scala
    
    Author: Reza Zadeh <rizlar@gmail.com>
    
    Closes apache#88 from rezazadeh/sparkpca and squashes the following commits:
    
    e298700 [Reza Zadeh] reformat using IDE
    3f23271 [Reza Zadeh] documentation and cleanup
    b025ab2 [Reza Zadeh] documentation
    e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
    3787bb4 [Reza Zadeh] stylin
    c6ecc1f [Reza Zadeh] docs
    aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
    56975b0 [Reza Zadeh] docs
    2df9bde [Reza Zadeh] docs update
    8fb0015 [Reza Zadeh] rcond documentation
    dbf7797 [Reza Zadeh] correct argument number
    a9f1f62 [Reza Zadeh] documentation
    4ce6caa [Reza Zadeh] style changes
    9a56a02 [Reza Zadeh] use rcond relative to larget svalue
    120f796 [Reza Zadeh] housekeeping
    156ff78 [Reza Zadeh] string comprehension
    2e1cf43 [Reza Zadeh] rename rcond
    ea223a6 [Reza Zadeh] many style changes
    f4002d7 [Reza Zadeh] more docs
    bd53c7a [Reza Zadeh] proper accumulator
    a8b5ecf [Reza Zadeh] Don't use for loops
    0dc7980 [Reza Zadeh] filter zeros in sparse
    6115610 [Reza Zadeh] More documentation
    36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
    bc4599f [Reza Zadeh] configurable rcond
    86f7515 [Reza Zadeh] compute per parition, use while
    09726b3 [Reza Zadeh] more style changes
    4195e69 [Reza Zadeh] private, accumulator
    17002be [Reza Zadeh] style changes
    4ba7471 [Reza Zadeh] style change
    f4982e6 [Reza Zadeh] Use dense matrix in example
    2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
    72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
    f807be9 [Reza Zadeh] fix typo
    2d7ccde [Reza Zadeh] Array interface for dense svd and pca
    cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
    398d123 [Reza Zadeh] style change
    55abbfa [Reza Zadeh] docs fix
    ef29644 [Reza Zadeh] bad chnage undo
    472566e [Reza Zadeh] all files from old pr
    555168f [Reza Zadeh] initial files
    rezazadeh authored and mateiz committed Mar 20, 2014
    Configuration menu
    Copy the full SHA
    66a03e5 View commit details
    Browse the repository at this point in the history
  3. [Hot Fix apache#42] Do not stop SparkUI if bind() is not called

    This is a bug fix for apache#42 (79d07d6).
    
    In Master, we do not bind() each SparkUI because we do not want to start a server for each finished application. However, when we remove the associated application, we call stop() on the SparkUI, which throws an assertion failure.
    
    This fix ensures we don't call stop() on a SparkUI that was never bind()'ed.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#188 from andrewor14/ui-fix and squashes the following commits:
    
    94a925f [Andrew Or] Do not stop SparkUI if bind() is not called
    andrewor14 authored and pwendell committed Mar 20, 2014
    Configuration menu
    Copy the full SHA
    ca76423 View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2014

  1. SPARK-1251 Support for optimizing and executing structured queries

    This pull request adds support to Spark for working with structured data using a simple SQL dialect, HiveQL and a Scala Query DSL.
    
    *This is being contributed as a new __alpha component__ to Spark and does not modify Spark core or other components.*
    
    The code is broken into three primary components:
     - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
     - Execution (sql/core) - A query planner / execution engine for translating Catalyst’s logical query plans into Spark RDDs.  This component also includes a new public interface, SqlContext, that allows users to execute SQL or structured scala queries against existing RDDs and Parquet files.
     - Hive Metastore Support (sql/hive) - An extension of SqlContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes.  There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
    
    A more complete design of this new component can be found in [the associated JIRA](https://spark-project.atlassian.net/browse/SPARK-1251).
    
    [An updated version of the Spark documentation, including API Docs for all three sub-components,](http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html) is also available for review.
    
    With this PR comes support for inferring the schema of existing RDDs that contain case classes.  Using this information, developers can now express structured queries that are automatically compiled into RDD operations.
    
    ```scala
    // Define the schema using a case class.
    case class Person(name: String, age: Int)
    val people: RDD[Person] =
      sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).toInt))
    
    // The following is the same as 'SELECT name FROM people WHERE age >= 10 && age <= 19'
    val teenagers = people.where('age >= 10).where('age <= 19).select('name).toRdd
    ```
    
    RDDs can also be registered as Tables, allowing SQL queries to be written over them.
    ```scala
    people.registerAsTable("people")
    val teenagers = sql("SELECT name FROM people WHERE age >= 10 && age <= 19")
    ```
    
    The results of queries are themselves RDDs and support standard RDD operations:
    ```scala
    teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
    ```
    
    Finally, with the optional Hive support, users can read and write data located in existing Apache Hive deployments using HiveQL.
    ```scala
    sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
    sql("LOAD DATA LOCAL INPATH 'src/main/resources/kv1.txt' INTO TABLE src")
    
    // Queries are expressed in HiveQL
    sql("SELECT key, value FROM src").collect().foreach(println)
    ```
    
    ## Relationship to Shark
    
    Unlike Shark, Spark SQL does not act as a drop in replacement for Hive or the HiveServer. Instead this new feature is intended to make it easier for Spark developers to run queries over structured data, using either SQL or the query DSL. After this sub-project graduates from Alpha status it will likely become a new optimizer/backend for the Shark project.
    
    Author: Michael Armbrust <michael@databricks.com>
    Author: Yin Huai <huaiyin.thu@gmail.com>
    Author: Reynold Xin <rxin@apache.org>
    Author: Lian, Cheng <rhythm.mail@gmail.com>
    Author: Andre Schumacher <andre.schumacher@iki.fi>
    Author: Yin Huai <huai@cse.ohio-state.edu>
    Author: Timothy Chen <tnachen@gmail.com>
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    Author: Timothy Chen <tnachen@apache.org>
    Author: Henry Cook <henry.m.cook+github@gmail.com>
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes #146 from marmbrus/catalyst and squashes the following commits:
    
    458bd1b [Michael Armbrust] Update people.txt
    0d638c3 [Michael Armbrust] Typo fix from @ash211.
    bdab185 [Michael Armbrust] Address another round of comments: * Doc examples can now copy/paste into spark-shell. * SQLContext is serializable * Minor parser bugs fixed * Self-joins of RDDs now handled correctly. * Removed deprecated examples * Removed deprecated parquet docs * Made more of the API private * Copied all the DSLQuery tests and rewrote them as SQLQueryTests
    778299a [Michael Armbrust] Fix some old links to spark-project.org
    fead0b6 [Michael Armbrust] Create a new RDD type, SchemaRDD, that is now the return type for all SQL operations.  This improves the old API by reducing the number of implicits that are required, and avoids throwing away schema information when returning an RDD to the user.  This change also makes it slightly less verbose to run language integrated queries.
    fee847b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into catalyst, integrating changes to serialization for ShuffledRDD.
    48a99bc [Michael Armbrust] Address first round of feedback.
    461581c [Michael Armbrust] Blacklist test that depends on JVM specific rounding behaviour
    adcf1a4 [Henry Cook] Update sql-programming-guide.md
    9dffbfa [Michael Armbrust] Style fixes. Add downloading of test cases to jenkins.
    6978dd8 [Michael Armbrust] update docs, add apache license
    1d0eb63 [Michael Armbrust] update changes with spark core
    e5e1d6b [Michael Armbrust] Remove travis configuration.
    c2efad6 [Michael Armbrust] First draft of SQL documentation.
    013f62a [Michael Armbrust] Fix documentation / code style.
    c01470f [Michael Armbrust] Clean up example
    2f22454 [Michael Armbrust] WIP: Parquet example.
    ce8073b [Michael Armbrust] clean up implicits.
    f7d992d [Michael Armbrust] Naming / spelling.
    9eb0294 [Michael Armbrust] Bring expressions implicits into SqlContext.
    d2d9678 [Michael Armbrust] Make sure hive isn't in the assembly jar.  Create a separate, optional Hive assembly that is used when present.
    8b35e0a [Michael Armbrust] address feedback, work on DSL
    5d71074 [Michael Armbrust] Merge pull request #62 from AndreSchumacher/parquet_file_fixes
    f93aa39 [Andre Schumacher] Better handling of path names in ParquetRelation
    1a4bbd9 [Michael Armbrust] Merge pull request #60 from marmbrus/maven
    3386e4f [Michael Armbrust] Merge pull request #58 from AndreSchumacher/parquet_fixes
    3447c3e [Michael Armbrust] Don't override the metastore / warehouse in non-local/test hive context.
    7233a74 [Michael Armbrust] initial support for maven builds
    f0ba39e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into maven
    7386a9f [Michael Armbrust] Initial example programs using spark sql.
    aeaef54 [Andre Schumacher] Removing unnecessary Row copying and reverting some changes to MutableRow
    7ca4b4e [Andre Schumacher] Improving checks in Parquet tests
    5bacdc0 [Andre Schumacher] Moving towards mutable rows inside ParquetRowSupport
    54637ec [Andre Schumacher] First part of second round of code review feedback
    c2a658d [Michael Armbrust] Merge pull request #55 from marmbrus/mutableRows
    ba28849 [Michael Armbrust] code review comments.
    d994333 [Michael Armbrust] Remove copies before shuffle, this required changing the default shuffle serialization.
    9049cf0 [Michael Armbrust] Extend MutablePair interface to support easy syntax for in-place updates.  Also add a constructor so that it can be serialized out-of-the-box.
    959bdf0 [Michael Armbrust] Don't silently swallow all KryoExceptions, only the one that indicates the end of a stream.
    d371393 [Michael Armbrust] Add a framework for dealing with mutable rows to reduce the number of object allocations that occur in the critical path.
    c9f8fb3 [Michael Armbrust] Merge pull request #53 from AndreSchumacher/parquet_support
    3c3f962 [Michael Armbrust] Fix a bug due to array reuse.  This will need to be revisited after we merge the mutable row PR.
    7d0f13e [Michael Armbrust] Update parquet support with master.
    9d419a6 [Michael Armbrust] Merge remote-tracking branch 'catalyst/catalystIntegration' into parquet_support
    0040ae6 [Andre Schumacher] Feedback from code review
    1ce01c7 [Michael Armbrust] Merge pull request #56 from liancheng/unapplySeqForRow
    70e489d [Cheng Lian] Fixed a spelling typo
    6d315bb [Cheng Lian] Added Row.unapplySeq to extract fields from a Row object.
    8d5da5e [Michael Armbrust] modify compute-classpath.sh to include datanucleus jars explicitly
    99e61fb [Michael Armbrust] Merge pull request #51 from marmbrus/expressionEval
    7b9d142 [Michael Armbrust] Update travis to increase permgen size.
    da9afbd [Michael Armbrust] Add byte wrappers for hive UDFS.
    6fdefe6 [Michael Armbrust] Port sbt improvements from master.
    296fe50 [Michael Armbrust] Address review feedback.
    d7fbc3a [Michael Armbrust] Several performance enhancements and simplifications of the expression evaluation framework.
    3bda72d [Andre Schumacher] Adding license banner to new files
    3ac9eb0 [Andre Schumacher] Rebasing to new main branch
    c863bed [Andre Schumacher] Codestyle checks
    61e3bfb [Andre Schumacher] Adding WriteToFile operator and rewriting ParquetQuerySuite
    3321195 [Andre Schumacher] Fixing one import in ParquetQueryTests.scala
    3a0a552 [Andre Schumacher] Reorganizing Parquet table operations
    18fdc44 [Andre Schumacher] Reworking Parquet metadata in relation and adding CREATE TABLE AS for Parquet tables
    75262ee [Andre Schumacher] Integrating operations on Parquet files into SharkStrategies
    f347273 [Andre Schumacher] Adding ParquetMetaData extraction, fixing schema projection
    6a6bf98 [Andre Schumacher] Added column projections to ParquetTableScan
    0f17d7b [Andre Schumacher] Rewriting ParquetRelation tests with RowWriteSupport
    a11e364 [Andre Schumacher] Adding Parquet RowWriteSupport
    6ad05b3 [Andre Schumacher] Moving ParquetRelation to spark.sql core
    eb0e521 [Andre Schumacher] Fixing package names and other problems that came up after the rebase
    99a9209 [Andre Schumacher] Expanding ParquetQueryTests to cover all primitive types
    b33e47e [Andre Schumacher] First commit of Parquet import of primitive column types
    c334386 [Michael Armbrust] Initial support for generating schema's based on case classes.
    608a29e [Michael Armbrust] Add hive as a repl dependency
    7413ac2 [Michael Armbrust] make test downloading quieter.
    4d57d0e [Michael Armbrust] Fix test execution on travis.
    5f2963c [Michael Armbrust] naming and continuous compilation fixes.
    f5e7492 [Michael Armbrust] Add Apache license.  Make naming more consistent.
    3ac9416 [Michael Armbrust] Merge support for working with schema-ed RDDs using catalyst in as a spark subproject.
    2225431 [Michael Armbrust] Merge pull request #48 from marmbrus/minorFixes
    d393d2a [Michael Armbrust] Review Comments: Add comment to map that adds a sub query.
    24eaa79 [Michael Armbrust] fix > 100 chars
    6e04e5b [Michael Armbrust] Add insertIntoTable to the DSL.
    df88f01 [Michael Armbrust] add a simple test for aggregation
    18a861b [Michael Armbrust] Correctly convert nested products into nested rows when turning scala data into catalyst data.
    b922511 [Michael Armbrust] Fix insertion of nested types into hive tables.
    5fe7de4 [Michael Armbrust] Move table creation out of rule into a separate function.
    a430895 [Michael Armbrust] Planning for logical Repartition operators.
    532dd37 [Michael Armbrust] Allow the local warehouse path to be specified.
    4905b2b [Michael Armbrust] Add more efficient TopK that avoids global sort for logical Sort => StopAfter.
    8c01c24 [Michael Armbrust] Move definition of Row out of execution to top level sql package.
    c9116a6 [Michael Armbrust] �Add combiner to avoid NPE when spark performs external aggregation.
    29effad [Michael Armbrust] Include alias in attributes that are produced by overridden tables.
    9990ec7 [Michael Armbrust] Merge pull request #28 from liancheng/columnPruning
    f22df3a [Michael Armbrust] Merge pull request #37 from yhuai/SerDe
    cf4db59 [Lian, Cheng] Added golden answers for PruningSuite
    54f165b [Lian, Cheng] Fixed spelling typo in two golden answer file names
    2682f72 [Lian, Cheng] Merge remote-tracking branch 'origin/master' into columnPruning
    c5a4fab [Lian, Cheng] Merge branch 'master' into columnPruning
    f670c8c [Yin Huai] Throw a NotImplementedError for not supported clauses in a CTAS query.
    128a9f8 [Yin Huai] Minor changes.
    017872c [Yin Huai] Remove stats20 from whitelist.
    a1a4776 [Yin Huai] Update comments.
    feb022c [Yin Huai] Partitioning key should be case insensitive.
    555fb1d [Yin Huai] Correctly set the extension for a text file.
    d00260b [Yin Huai] Strips backticks from partition keys.
    334aace [Yin Huai] New golden files.
    a40d6d6 [Yin Huai] Loading the static partition specified in a INSERT INTO/OVERWRITE query.
    428aff5 [Yin Huai] Distinguish `INSERT INTO` and `INSERT OVERWRITE`.
    eea75c5 [Yin Huai] Correctly set codec.
    45ffb86 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
    e089627 [Yin Huai] Code style.
    563bb22 [Yin Huai] Set compression info in FileSinkDesc.
    35c9a8a [Michael Armbrust] Merge pull request #46 from marmbrus/reviewFeedback
    bdab5ed [Yin Huai] Add a TODO for loading data into partitioned tables.
    5495fab [Yin Huai] Remove cloneRecords which is no longer needed.
    1596e1b [Yin Huai] Cleanup imports to make IntelliJ happy.
    3bb272d [Michael Armbrust] move org.apache.spark.sql package.scala to the correct location.
    8506c17 [Michael Armbrust] Address review feedback.
    3cb4f2e [Michael Armbrust] Merge pull request #45 from tnachen/master
    9ad474d [Michael Armbrust] Merge pull request #44 from marmbrus/sampling
    566fd66 [Timothy Chen] Whitelist tests and add support for Binary type
    69adf72 [Yin Huai] Set cloneRecords to false.
    a9c3188 [Timothy Chen] Fix udaf struct return
    346f828 [Yin Huai] Move SharkHadoopWriter to the correct location.
    59e37a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
    ed3a1d1 [Yin Huai] Load data directly into Hive.
    7f206b5 [Michael Armbrust] Add support for hive TABLESAMPLE PERCENT.
    b6de691 [Michael Armbrust] Merge pull request #43 from liancheng/fixMakefile
    1f6260d [Lian, Cheng] Fixed package name and test suite name in Makefile
    5ae010f [Michael Armbrust] Merge pull request #42 from markhamstra/non-ascii
    678341a [Mark Hamstra] Replaced non-ascii text
    887f928 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SerDeNew
    1f7d00a [Reynold Xin] Merge pull request #41 from marmbrus/splitComponents
    7588a57 [Michael Armbrust] Break into 3 major components and move everything into the org.apache.spark.sql package.
    bc9a12c [Michael Armbrust] Move hive test files.
    5720d2b [Lian, Cheng] Fixed comment typo
    f0c3742 [Lian, Cheng] Refactored PhysicalOperation
    f235914 [Lian, Cheng] Test case udf_regex and udf_like need BooleanWritable registered
    cf691df [Lian, Cheng] Added the PhysicalOperation to generalize ColumnPrunings
    2407a21 [Lian, Cheng] Added optimized logical plan to debugging output
    a7ad058 [Michael Armbrust] Merge pull request #40 from marmbrus/includeGoldens
    9329820 [Michael Armbrust] add golden answer files to repository
    dce0593 [Michael Armbrust] move golden answer to the source code directory.
    964368f [Michael Armbrust] Merge pull request #39 from marmbrus/lateralView
    7785ee6 [Michael Armbrust] Tighten visibility based on comments.
    341116c [Michael Armbrust] address comments.
    0e6c1d7 [Reynold Xin] Merge pull request #38 from yhuai/parseDBNameInCTAS
    2897deb [Michael Armbrust] fix scaladoc
    7123225 [Yin Huai] Correctly parse the db name and table name in INSERT queries.
    b376d15 [Michael Armbrust] fix newlines at EOF
    5cc367c [Michael Armbrust] use berkeley instead of cloudbees
    ff5ea3f [Michael Armbrust] new golden
    db92adc [Michael Armbrust] more tests passing. clean up logging.
    740febb [Michael Armbrust] Tests for tgfs.
    0ce61b0 [Michael Armbrust] Docs for GenericHiveUdtf.
    ba8897f [Michael Armbrust] Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView
    dd00b7e [Michael Armbrust] initial implementation of generators.
    ea76cf9 [Michael Armbrust] Add NoRelation to planner.
    bea4b7f [Michael Armbrust] Add SumDistinct.
    016b489 [Michael Armbrust] fix typo.
    acb9566 [Michael Armbrust] Correctly type attributes of CTAS.
    8841eb8 [Michael Armbrust] Rename Transform -> ScriptTransformation.
    02ff8e4 [Yin Huai] Correctly parse the db name and table name in a CTAS query.
    5e4d9b4 [Michael Armbrust] Merge pull request #35 from marmbrus/smallFixes
    5479066 [Reynold Xin] Merge pull request #36 from marmbrus/partialAgg
    8017afb [Michael Armbrust] fix copy paste error.
    dc6353b [Michael Armbrust] turn off deprecation
    cab1a84 [Michael Armbrust] Fix PartialAggregate inheritance.
    883006d [Michael Armbrust] improve tests.
    32b615b [Michael Armbrust] add override to asPartial.
    e1999f9 [Yin Huai] Use Deserializer and Serializer instead of AbstractSerDe.
    f94345c [Michael Armbrust] fix doc link
    d8cb805 [Michael Armbrust] Implement partial aggregation.
    ccdb07a [Michael Armbrust] Fix bug where averages of strings are turned into sums of strings.  Remove a blank line.
    b4be6a5 [Michael Armbrust] better logging when applying rules.
    67128b8 [Reynold Xin] Merge pull request #30 from marmbrus/complex
    cb57459 [Michael Armbrust] blacklist machine specific test.
    2f27604 [Michael Armbrust] Address comments / style errors.
    389525d [Michael Armbrust] update golden, blacklist mr.
    e3c10bd [Michael Armbrust] update whitelist.
    44d343c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into complex
    42ec4af [Michael Armbrust] improve complex type support in hive udfs/udafs.
    ab5bff3 [Michael Armbrust] Support for get item of map types.
    1679554 [Michael Armbrust] add toString for if and IS NOT NULL.
    ab9a131 [Michael Armbrust] when UDFs fail they should return null.
    25288d0 [Michael Armbrust] Implement [] for arrays and maps.
    e7933e9 [Michael Armbrust] fix casting bug when working with fractional expressions.
    010accb [Michael Armbrust] add tinyint to metastore type parser.
    7a0f543 [Michael Armbrust] Avoid propagating types from unresolved nodes.
    ac9d7de [Michael Armbrust] Resolve *s in Transform clauses.
    692a477 [Michael Armbrust] Support for wrapping arrays to be written into hive tables.
    92e4158 [Reynold Xin] Merge pull request #32 from marmbrus/tooManyProjects
    9c06778 [Michael Armbrust] fix serialization issues, add JavaStringObjectInspector.
    72a003d [Michael Armbrust] revert regex change
    7661b6c [Michael Armbrust] blacklist machines specific tests
    aa430e7 [Michael Armbrust] Update .travis.yml
    e4def6b [Michael Armbrust] set dataType for HiveGenericUdfs.
    5e54aa6 [Michael Armbrust] quotes for struct field names.
    bbec500 [Michael Armbrust] update test coverage, new golden
    3734a94 [Michael Armbrust] only quote string types.
    3f9e519 [Michael Armbrust] use names w/ boolean args
    5b3d2c8 [Michael Armbrust] implement distinct.
    5b33216 [Michael Armbrust] work on decimal support.
    2c6deb3 [Michael Armbrust] improve printing compatibility.
    35a70fb [Michael Armbrust] multi-letter field names.
    a9388fb [Michael Armbrust] printing for map types.
    c3feda7 [Michael Armbrust] use toArray.
    c654f19 [Michael Armbrust] Support for list and maps in hive table scan.
    cf8d992 [Michael Armbrust] Use built in functions for creating temp directory.
    1579eec [Michael Armbrust] Only cast unresolved inserts.
    6420c7c [Michael Armbrust] Memoize the ordinal in the GetField expression.
    da7ae9d [Michael Armbrust] Add boolean writable that was breaking udf_regexp test.  Not sure how this was passing before...
    6709441 [Michael Armbrust] Evaluation for accessing nested fields.
    dc6463a [Michael Armbrust] Support for resolving access to nested fields using "." notation.
    d670e41 [Michael Armbrust] Print nested fields like hive does.
    efa7217 [Michael Armbrust] Support for reading structs in HiveTableScan.
    9c22b4e [Michael Armbrust] Support for parsing nested types.
    82163e3 [Michael Armbrust] special case handling of partitionKeys when casting insert into tables
    ea6f37f [Michael Armbrust] fix style.
    7845364 [Michael Armbrust] deactivate concurrent test.
    b649c20 [Michael Armbrust] fix test logging / caching.
    1590568 [Michael Armbrust] add log4j.properties
    19bfd74 [Michael Armbrust] store hive output in circular buffer
    dfb67aa [Michael Armbrust] add test case
    cb775ac [Michael Armbrust] get rid of SharkContext singleton
    2de89d0 [Michael Armbrust] Merge pull request #13 from tnachen/master
    63003e9 [Michael Armbrust] Fix spacing.
    41b41f3 [Michael Armbrust] Only cast unresolved inserts.
    6eb5960 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into udafs
    5b7afd8 [Michael Armbrust] Merge pull request #10 from yhuai/exchangeOperator
    b1151a8 [Timothy Chen] Fix load data regex
    8e0931f [Michael Armbrust] Cast to avoid using deprecated hive API.
    e079f2b [Timothy Chen] Add GenericUDAF wrapper and HiveUDAFFunction
    45b334b [Yin Huai] fix comments
    235cbb4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
    fc67b50 [Yin Huai] Check for a Sort operator with the global flag set instead of an Exchange operator with a RangePartitioning.
    6015f93 [Michael Armbrust] Merge pull request #29 from rxin/style
    271e483 [Michael Armbrust] Update build status icon.
    d3a3d48 [Michael Armbrust] add testing to travis
    807b2d7 [Michael Armbrust] check style and publish docs with travis
    d20b565 [Michael Armbrust] fix if style
    bce024d [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into style Disable if brace checking as it errors in single line functional cases unlike the style guide.
    d91e276 [Michael Armbrust] Remove dependence on HIVE_HOME for running tests.  This was done by moving all the hive query test (from branch-0.12) and data files into src/test/hive.  These are used by default when HIVE_HOME is not set.
    f47c2f6 [Yin Huai] set outputPartitioning in BroadcastNestedLoopJoin
    41bbee6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
    7e24436 [Reynold Xin] Removed dependency on JDK 7 (nio.file).
    5c1e600 [Reynold Xin] Added hash code implementation for AttributeReference
    7213a2c [Reynold Xin] style fix for Hive.scala.
    08e4d05 [Reynold Xin] First round of style cleanup.
    605255e [Reynold Xin] Added scalastyle checker.
    61e729c [Lian, Cheng] Added ColumnPrunings strategy and test cases
    2486fb7 [Lian, Cheng] Fixed spelling
    8ee41be [Lian, Cheng] Minor refactoring
    ebb56fa [Michael Armbrust] add travis config
    4c89d6e [Reynold Xin] Merge pull request #27 from marmbrus/moreTests
    d4f539a [Michael Armbrust] blacklist mr and user specific tests.
    677eb07 [Michael Armbrust] Update test whitelist.
    5dab0bc [Michael Armbrust] Merge pull request #26 from liancheng/serdeAndPartitionPruning
    c263c84 [Michael Armbrust] Only push predicates into partitioned table scans.
    ab77882 [Michael Armbrust] upgrade spark to RC5.
    c98ede5 [Lian, Cheng] Response to comments from @marmbrus
    83d4520 [Yin Huai] marmbrus's comments
    70994a3 [Lian, Cheng] Revert unnecessary Scaladoc changes
    9ebff47 [Yin Huai] remove unnecessary .toSeq
    e811d1a [Yin Huai] markhamstra's comments
    4802f69 [Yin Huai] The outputPartitioning of a UnaryNode inherits its child's outputPartitioning by default. Also, update the logic in AddExchange to avoid unnecessary shuffling operations.
    040fbdf [Yin Huai] AddExchange is the only place to add Exchange operators.
    9fb357a [Yin Huai] use getSpecifiedDistribution to create Distribution. ClusteredDistribution and OrderedDistribution do not take Nil as inptu expressions.
    e9347fc [Michael Armbrust] Remove broken scaladoc links.
    99c6707 [Michael Armbrust] upgrade spark
    57799ad [Lian, Cheng] Added special treat for HiveVarchar in InsertIntoHiveTable
    cb49af0 [Lian, Cheng] Fixed Scaladoc links
    4e5e4d4 [Lian, Cheng] Added PreInsertionCasts to do necessary casting before insertion
    111ffdc [Lian, Cheng] More comments and minor reformatting
    9e0d840 [Lian, Cheng] Added partition pruning optimization
    761bbb8 [Lian, Cheng] Generalized BindReferences to run against any query plan
    04eb5da [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
    9dd3b26 [Michael Armbrust] Fix scaladoc.
    6f44cac [Lian, Cheng] Made TableReader & HadoopTableReader private to catalyst
    7c92a41 [Lian, Cheng] Added Hive SerDe support
    ce5fdd6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
    2957f31 [Yin Huai] addressed comments on PR
    907db68 [Michael Armbrust] Space after while.
    04573a0 [Reynold Xin] Merge pull request #24 from marmbrus/binaryCasts
    4e50679 [Reynold Xin] Merge pull request #25 from marmbrus/rowOrderingWhile
    5bc1dc2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into exchangeOperator
    be1fff7 [Michael Armbrust] Replace foreach with while in RowOrdering. Fixes #23
    fd084a4 [Michael Armbrust] implement casts binary <=> string.
    0b31176 [Michael Armbrust] Merge pull request #22 from rxin/type
    548e479 [Yin Huai] merge master into exchangeOperator and fix code style
    5b11db0 [Reynold Xin] Added Void to Boolean type widening.
    9e3d989 [Reynold Xin] Made HiveTypeCoercion.WidenTypes more clear.
    9bb1979 [Reynold Xin] Merge pull request #19 from marmbrus/variadicUnion
    a2beb38 [Michael Armbrust] Merge pull request #21 from liancheng/fixIssue20
    b20a4d4 [Lian, Cheng] Fix issue #20
    6d6cb58 [Michael Armbrust] add source links that point to github to the scala doc.
    4285962 [Michael Armbrust] Remove temporary test cases
    167162f [Michael Armbrust] more merge errors, cleanup.
    e170ccf [Michael Armbrust] Improve documentation and remove some spurious changes that were introduced by the merge.
    6377d0b [Michael Armbrust] Drop empty files, fix if ().
    c0b0e60 [Michael Armbrust] cleanup broken doc links.
    330a88b [Michael Armbrust] Fix bugs in AddExchange.
    4f345f2 [Michael Armbrust] Remove SortKey, use RowOrdering.
    043e296 [Michael Armbrust] Make physical union nodes variadic.
    ece15e1 [Michael Armbrust] update unit tests
    5c89d2e [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into exchangeOperator Fix deprecated use of combineValuesByKey. Get rid of test where the answer is dependent on the plan execution width.
    9804eb5 [Michael Armbrust] upgrade spark
    053a371 [Michael Armbrust] Merge pull request #15 from marmbrus/orderedRow
    5ab18be [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into orderedRow
    ca2ff68 [Michael Armbrust] Merge pull request #17 from marmbrus/unionTypes
    bf9161c [Michael Armbrust] Merge pull request #18 from marmbrus/noSparkAgg
    563053f [Michael Armbrust] Address @rxin's comments.
    6537c66 [Michael Armbrust] Address @rxin's comments.
    2a76fc6 [Michael Armbrust] add notes from @rxin.
    685bfa1 [Michael Armbrust] fix spelling
    69ed98f [Michael Armbrust] Output a single row for empty Aggregations with no grouping expressions.
    7859a86 [Michael Armbrust] Remove SparkAggregate.  Its kinda broken and breaks RDD lineage.
    fc22e01 [Michael Armbrust] whitelist newly passing union test.
    3f547b8 [Michael Armbrust] Add support for widening types in unions.
    53b95f8 [Michael Armbrust] coercion should not occur until children are resolved.
    b892e32 [Michael Armbrust] Union is not resolved until the types match up.
    95ab382 [Michael Armbrust] Use resolved instead of custom function.  This is better because some nodes override the notion of resolved.
    81a109d [Michael Armbrust] fix link.
    f143f61 [Michael Armbrust] Implement sampling.  Fixes a flaky test where the JVM notices that RAND as a Comparison method "violates its general contract!"
    6cd442b [Michael Armbrust] Use numPartitions variable, fix grammar.
    c800798 [Michael Armbrust] Add build status icon.
    0cf5a75 [Michael Armbrust] Merge pull request #16 from marmbrus/filterPushDown
    05d3a0d [Michael Armbrust] Refactor to avoid serializing ordering details with every row.
    f2fdd77 [Michael Armbrust] fix required distribtion for aggregate.
    658866e [Michael Armbrust] Pull back in changes made by @yhuai eliminating CoGroupedLocallyRDD.scala
    583a337 [Michael Armbrust] break apart distribution and partitioning.
    e8d41a9 [Michael Armbrust] Merge remote-tracking branch 'yin/exchangeOperator' into exchangeOperator
    0ff8be7 [Michael Armbrust] Cleanup spurious changes and fix doc links.
    73c70de [Yin Huai] add a first set of unit tests for data properties.
    fbfa437 [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into filterPushDown Minor doc improvements.
    2b9d80f [Yin Huai] initial commit of adding exchange operators to physical plans.
    fcbc03b [Michael Armbrust] Fix if ().
    7b9080c [Michael Armbrust] Create OrderedRow class to allow ordering to be used by multiple operators.
    b4adb0f [Michael Armbrust] Merge pull request #14 from marmbrus/castingAndTypes
    b2a1ec5 [Michael Armbrust] add comment on how using numeric implicitly complicates spark serialization.
    e286d20 [Michael Armbrust] address code review comments.
    80d0681 [Michael Armbrust] fix scaladoc links.
    de0c248 [Michael Armbrust] Print the executed plan in SharkQuery toString.
    3413e61 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
    404d552 [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
    fb84ae4 [Michael Armbrust] Refactor DataProperty into Distribution.
    2abb0bc [Michael Armbrust] better debug messages, use exists.
    098dfc4 [Michael Armbrust] Implement Long sorting again.
    60f3a9a [Michael Armbrust] More aggregate functions out of the aggregate class to make things more readable.
    a1ef62e [Michael Armbrust] Print the executed plan in SharkQuery toString.
    dfce426 [Michael Armbrust] Add mapChildren and withNewChildren methods to TreeNode.
    037a2ed [Michael Armbrust] Better exception when unbound attributes make it to evaluation.
    ec90620 [Michael Armbrust] Support for Sets as arguments to TreeNode classes.
    b21f803 [Michael Armbrust] Merge pull request #11 from marmbrus/goldenGen
    83adb9d [Yin Huai] add DataProperty
    5a26292 [Michael Armbrust] Rules to bring casting more inline with Hive semantics.
    f0e0161 [Michael Armbrust] Move numeric types into DataTypes simplifying evaluator.  This can probably also be use for codegen...
    6d2924d [Michael Armbrust] add support for If. Not integrated in HiveQL yet.
    ccc4dbf [Michael Armbrust] Add optimization rule to simplify casts.
    058ec15 [Michael Armbrust] handle more writeables.
    ffa9f25 [Michael Armbrust] blacklist some more MR tests.
    aa2239c [Michael Armbrust] filter test lines containing Owner:
    f71a325 [Michael Armbrust] Update golden jar.
    a3003ae [Michael Armbrust] Update makefile to use better sharding support.
    568d150 [Michael Armbrust] Updates to white/blacklist.
    8351f25 [Michael Armbrust] Add an ignored test to remind us we don't do empty aggregations right.
    c4104ec [Michael Armbrust] Numerous improvements to testing infrastructure.  See comments for details.
    09c6300 [Michael Armbrust] Add nullability information to StructFields.
    5460b2d [Michael Armbrust] load srcpart by default.
    3695141 [Michael Armbrust] Lots of parser improvements.
    965ac9a [Michael Armbrust] Add expressions that allow access into complex types.
    3ba53c9 [Michael Armbrust] Output type suffixes on AttributeReferences.
    8777489 [Michael Armbrust] Initial support for operators that allow the user to specify partitioning.
    e57f97a [Michael Armbrust] more decimal/null support.
    e1440ed [Michael Armbrust] Initial support for function specific type conversions.
    1814ed3 [Michael Armbrust] use childrenResolved function.
    f2ec57e [Michael Armbrust] Begin supporting decimal.
    6924e6e [Michael Armbrust] Handle NullTypes when resolving HiveUDFs
    7fcfa8a [Michael Armbrust] Initial support for parsing unspecified partition parameters.
    d0124f3 [Michael Armbrust] Correctly type null literals.
    b65626e [Michael Armbrust] Initial support for parsing BigDecimal.
    a90efda [Michael Armbrust] utility function for outputing string stacktraces.
    7102f33 [Michael Armbrust] methods with side-effects should use ().
    3ccaef7 [Michael Armbrust] add renaming TODO.
    bc282c7 [Michael Armbrust] fix bug in getNodeNumbered
    c8e89d5 [Michael Armbrust] memoize inputSet calculation.
    6aefa46 [Michael Armbrust] Skip folding literals.
    a72e540 [Michael Armbrust] Add IN operator.
    04f885b [Michael Armbrust] literals are only non-nullable if they are not null.
    35d2948 [Michael Armbrust] correctly order partition and normal attributes in hive relation output.
    12fd52d [Michael Armbrust] support for sorting longs.
    0606520 [Michael Armbrust] drop old comment.
    859200a [Michael Armbrust] support for reading more types from the metastore.
    1fedd18 [Michael Armbrust] coercion from null to numeric types
    71e902d [Michael Armbrust] fix test cases.
    cc06b6c [Michael Armbrust] Merge remote-tracking branch 'databricks/master' into interviewAnswer
    8a8b521 [Reynold Xin] Merge pull request #8 from marmbrus/testImprovment
    86355a6 [Michael Armbrust] throw error if there are unexpected join clauses.
    c5842d2 [Michael Armbrust] don't throw an error when a select clause outputs multiple copies of the same attribute.
    0e975ea [Michael Armbrust] parse bucket sampling as percentage sampling
    a92919d [Michael Armbrust] add alter view as to native commands
    f58d5a5 [Michael Armbrust] support for parsing SELECT DISTINCT
    f0faa26 [Michael Armbrust] add sample and distinct operators.
    ef7b943 [Michael Armbrust] add metastore support for float
    e9f4588 [Michael Armbrust] fix > 100 char.
    755b229 [Michael Armbrust] blacklist some ddl tests.
    9ae740a [Michael Armbrust] blacklist more tests that require MR.
    4cfc11a [Michael Armbrust] more test coverage.
    0d9d56a [Michael Armbrust] add more native commands to parser
    78d730d [Michael Armbrust] Load src test table on RESET.
    8364ec2 [Michael Armbrust] whitelist all possible partition values.
    b01468d [Michael Armbrust] support path rewrites when the query begins with a comment.
    4c6b454 [Michael Armbrust] add option for recomputing the cached golden answer when tests fail.
    4c5fb0f [Michael Armbrust] makefile target for building new whitelist.
    4b6fed8 [Michael Armbrust] support for parsing both DESTINATION and INSERT_INTO.
    516481c [Michael Armbrust] Ignore requests to explain native commands.
    68aa2e6 [Michael Armbrust] Stronger type for Token extractor.
    ca4ea26 [Michael Armbrust] Support for parsing UDF(*).
    1aafea3 [Michael Armbrust] Configure partition whitelist in TestShark reset.
    9627616 [Michael Armbrust] Use current database as default database.
    9b02b44 [Michael Armbrust] Fix spelling error. Add failFast mode.
    6f64cee [Michael Armbrust] don't line wrap string literal
    eafaeed [Michael Armbrust] add type documentation
    f54c94c [Michael Armbrust] make golden answers file a test dependency
    5362365 [Michael Armbrust] push conditions into join
    0d2388b [Michael Armbrust] Point at databricks hosted scaladoc.
    73b29cd [Michael Armbrust] fix bad casting
    9aa06c5 [Michael Armbrust] Merge pull request #7 from marmbrus/docFixes
    7eff191 [Michael Armbrust] link all the expression names.
    83227e4 [Michael Armbrust] fix scaladoc list syntax, add docs for some rules
    9de6b74 [Michael Armbrust] fix language feature and deprecation warnings.
    0b1960a [Michael Armbrust] Fix broken scala doc links / warnings.
    b1acb36 [Michael Armbrust] Merge pull request #3 from yhuai/evalauteLiteralsInExpressions
    01c00c2 [Michael Armbrust] new golden
    5c14857 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    b749b51 [Michael Armbrust] Merge pull request #5 from marmbrus/testCaching
    66adceb [Michael Armbrust] Merge pull request #6 from marmbrus/joinWork
    1a393da [Yin Huai] folded -> foldable
    1e964ea [Yin Huai] update
    a43d41c [Michael Armbrust] more tests passing!
    8ca38d0 [Michael Armbrust] begin support for varchar / binary types.
    ab8bbd1 [Michael Armbrust] parsing % operator
    c16c8b5 [Michael Armbrust] case insensitive checking for hooks in tests.
    3a90a5f [Michael Armbrust] simpler output when running a single test from the commandline.
    5332fee [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    367fb9e [Yin Huai] update
    0cd5cc6 [Michael Armbrust] add BIGINT cast parsing
    61b266f [Michael Armbrust] comment for eliminate subqueries.
    d72a5a2 [Michael Armbrust] add long to literal factory object.
    b3bd15f [Michael Armbrust] blacklist more mr requiring tests.
    e06fd38 [Michael Armbrust] black list map reduce tests.
    8e7ce30 [Michael Armbrust] blacklist some env specific tests.
    6250cbd [Michael Armbrust] Do not exit on test failure
    b22b220 [Michael Armbrust] also look for cached hive test answers on the classpath.
    b6e4899 [Yin Huai] formatting
    e75c90d [Reynold Xin] Merge pull request #4 from marmbrus/hive12
    5fabbec [Michael Armbrust] ignore partitioned scan test. scan seems to be working but there is some error about the table already existing?
    9e190f5 [Michael Armbrust] drop unneeded ()
    68b58c1 [Michael Armbrust] drop a few more tests.
    b0aa400 [Michael Armbrust] update whitelist.
    c99012c [Michael Armbrust] skip tests with hooks
    db00ebf [Michael Armbrust] more types for hive udfs
    dbc3678 [Michael Armbrust] update ghpages repo
    138f53d [Yin Huai] addressed comments and added a space after a space after the defining keyword of every control structure.
    6f954ee [Michael Armbrust] export the hadoop classpath when starting sbt, required to invoke hive during tests.
    46bf41b [Michael Armbrust] add a makefile for priming the test answer cache in parallel.  usage: "make -j 8 -i"
    8d47ed4 [Yin Huai] comment
    2795f05 [Yin Huai] comment
    e003728 [Yin Huai] move OptimizerSuite to the package of catalyst.optimizer
    2941d3a [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    0bd1688 [Yin Huai] update
    6a7bd75 [Michael Armbrust] fix partition column delimiter configuration.
    e942da1 [Michael Armbrust] Begin upgrade to Hive 0.12.0.
    b8cd7e3 [Michael Armbrust] Merge pull request #7 from rxin/moreclean
    52864da [Reynold Xin] Added executeCollect method to SharkPlan.
    f0e1cbf [Reynold Xin] Added resolved lazy val to LogicalPlan.
    b367e36 [Reynold Xin] Replaced the use of ??? with UnsupportedOperationException.
    38124bd [Yin Huai] formatting
    2924468 [Yin Huai] add two tests for testing pre-order and post-order tree traversal, respectively
    555d839 [Reynold Xin] More cleaning ...
    d48d0e1 [Reynold Xin] Code review feedback.
    aa2e694 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    5c421ac [Reynold Xin] Imported SharkEnv, SharkContext, and HadoopTableReader to remove Shark dependency.
    479e055 [Reynold Xin] A set of minor changes, including: - import order - limit some lines to 100 character wide - inline code comment - more scaladocs - minor spacing (i.e. add a space after if)
    da16e45 [Reynold Xin] Merge pull request #3 from rxin/packagename
    e36caf5 [Reynold Xin] Renamed Rule.name to Rule.ruleName since name is used too frequently in the code base and is shadowed often by local scope.
    72426ed [Reynold Xin] Rename shark2 package to execution.
    0892153 [Reynold Xin] Merge pull request #2 from rxin/packagename
    e58304a [Reynold Xin] Merge pull request #1 from rxin/gitignore
    3f9fee1 [Michael Armbrust] rewrite push filter through join optimization.
    c6527f5 [Reynold Xin] Moved the test src files into the catalyst directory.
    c9777d8 [Reynold Xin] Put all source files in a catalyst directory.
    019ea74 [Reynold Xin] Updated .gitignore to include IntelliJ files.
    80ca4be [Timothy Chen] Address comments
    0079392 [Michael Armbrust] support for multiple insert commands in a single query
    75b5a01 [Michael Armbrust] remove space.
    4283400 [Timothy Chen] Add limited predicate push down
    e547e50 [Michael Armbrust] implement First.
    e77c9b6 [Michael Armbrust] more work on unique join.
    c795e06 [Michael Armbrust] improve star expansion
    a26494e [Michael Armbrust] allow aliases to have qualifiers
    d078333 [Michael Armbrust] remove extra space
    a75c023 [Michael Armbrust] implement Coalesce
    3a018b6 [Michael Armbrust] fix up docs.
    ab6f67d [Michael Armbrust] import the string "null" as actual null.
    5377c04 [Michael Armbrust] don't call dataType until checking if children are resolved.
    191ce3e [Michael Armbrust] analyze rewrite test query.
    60b1526 [Michael Armbrust] don't call dataType until checking if children are resolved.
    2ab5a32 [Michael Armbrust] stop using uberjar as it has its own set of issues.
    e42f75a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into HEAD
    c086a35 [Michael Armbrust] docs, spacing
    c4060e4 [Michael Armbrust] cleanup
    3b85462 [Michael Armbrust] more tests passing
    bcfc8c5 [Michael Armbrust] start supporting partition attributes when inserting data.
    c944a95 [Michael Armbrust] First aggregate expression.
    1e28311 [Michael Armbrust] make tests execute in alpha order again
    a287481 [Michael Armbrust] spelling
    8492548 [Michael Armbrust] beginning of UNIQUEJOIN parsing.
    a6ab6c7 [Michael Armbrust] add !=
    4529594 [Michael Armbrust] draft of coalesce
    70f253f [Michael Armbrust] more tests passing!
    7349e7b [Michael Armbrust] initial support for test thrift table
    d3c9305 [Michael Armbrust] fix > 100 char line
    93b64b0 [Michael Armbrust] load test tables that are args to "DESCRIBE"
    06b2aba [Michael Armbrust] don't be case sensitive when fixing load paths
    6355d0e [Michael Armbrust] match actual return type of count with expected
    cda43ab [Michael Armbrust] don't throw an exception when one of the join tables is empty.
    fd4b096 [Michael Armbrust] fix casing of null strings as well.
    4632695 [Michael Armbrust] support for megastore bigint
    67b88cf [Michael Armbrust] more verbose debugging of evaluation return types
    c680e0d [Michael Armbrust] Failed string => number conversion should return null.
    2326be1 [Michael Armbrust] make getClauses case insensitive.
    dac2786 [Michael Armbrust] correctly handle null values when going from string to numeric types.
    045ac4b [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    fb5ddfd [Michael Armbrust] move ViewExamples to examples/
    83833e8 [Michael Armbrust] more tests passing!
    47c98d6 [Michael Armbrust] add query tests for like and hash.
    1724c16 [Michael Armbrust] clear lines that contain last updated times.
    cfd6bbc [Michael Armbrust] Quick skipping of tests that we can't even parse.
    9b2642b [Michael Armbrust] make the blacklist support regexes
    1d50af6 [Michael Armbrust] more datatypes, fix nonserializable instance variables in udfs
    910e33e [Michael Armbrust] basic support for building an assembly jar.
    d55bb52 [Michael Armbrust] add local warehouse/metastore to gitignore.
    495d9dc [Michael Armbrust] Add an expression for when we decide to support LIKE natively instead of using the HIVE udf.
    65f4e69 [Michael Armbrust] remove incorrect comments
    0831a3c [Michael Armbrust] support for parsing some operator udfs.
    6c27aa7 [Michael Armbrust] more cast parsing.
    43db061 [Michael Armbrust] significant generalization of hive udf functionality.
    3fe24ec [Michael Armbrust] better implementation of 3vl in Evaluate, fix some > 100 char lines.
    e5690a6 [Michael Armbrust] add BinaryType
    adab892 [Michael Armbrust] Clear out functions that are created during tests when reset is called.
    d408021 [Michael Armbrust] support for printing out arrays in the output in the same form as hive (e.g., [e1, e1]).
    8d5f504 [Michael Armbrust] Example of schema RDD using scala's dynamic trait, resulting in a more standard ORM style of usage.
    21f0d91 [Michael Armbrust] Simple example of schemaRdd with scala filter function.
    0daaa0e [Michael Armbrust] Promote booleans that appear in comparisons.
    2b70abf [Michael Armbrust] true and false literals.
    ef8b0a5 [Michael Armbrust] more tests.
    14d070f [Michael Armbrust] add support for correctly extracting partition keys.
    0afbe73 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    69a0bd4 [Michael Armbrust] promote strings in predicates with number too.
    3946e31 [Michael Armbrust] don't build strings unless assertion fails.
    90c453d [Michael Armbrust] more tests passing!
    6e6417a [Michael Armbrust] correct handling of nulls in boolean logic and sorting.
    8000504 [Michael Armbrust] Improve type coercion.
    9087152 [Michael Armbrust] fix toString of Not.
    58b111c [Michael Armbrust] fix bad scaladoc tag.
    d5c05c6 [Michael Armbrust] For now, ignore the big data benchmark tests when the data isn't there.
    ac6376d [Michael Armbrust] Split out general shark query execution driver from test harness.
    1d0ae1e [Michael Armbrust] Switch from IndexSeq[Any] to Row interface that will allow us unboxed access to primitive types.
    d873b2b [Yin Huai] Remove numbers associated with test cases.
    8545675 [Yin Huai] Merge remote-tracking branch 'upstream/master' into evalauteLiteralsInExpressions
    b34a9eb [Michael Armbrust] Merge branch 'master' into filterPushDown
    d1e7b8e [Michael Armbrust] Update README.md
    c8b1553 [Michael Armbrust] Update README.md
    9307ef9 [Michael Armbrust] update list of passing tests.
    934c18c [Michael Armbrust] Filter out non-deterministic lines when comparing test answers.
    a045c9c [Michael Armbrust] SparkAggregate doesn't actually support sum right now.
    ae0024a [Yin Huai] update
    cf80545 [Yin Huai] Merge remote-tracking branch 'origin/evalauteLiteralsInExpressions' into evalauteLiteralsInExpressions
    21976ae [Yin Huai] update
    b4999fe [Yin Huai] Merge remote-tracking branch 'upstream/filterPushDown' into evalauteLiteralsInExpressions
    dedbf0c [Yin Huai] support Boolean literals
    eaac9e2 [Yin Huai] explain the limitation of the current EvaluateLiterals
    37817b5 [Yin Huai] add a comment to EvaluateLiterals.
    468667f [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
    b1d1843 [Michael Armbrust] more work on big data benchmark tests.
    cc9a957 [Michael Armbrust] support for creating test tables outside of TestShark
    7d7fa9f [Michael Armbrust] support for create table as
    5f54f03 [Michael Armbrust] parsing for ASC
    d42b725 [Michael Armbrust] Sum of strings requires cast
    34b30fa [Michael Armbrust] not all attributes need to be bound (e.g. output attributes that are contained in non-leaf operators.)
    81659cb [Michael Armbrust] implement transform operator.
    5cd76d6 [Michael Armbrust] break up the file based test case code for reuse
    1031b65 [Michael Armbrust] support for case insensitive resolution.
    320df04 [Michael Armbrust] add snapshot repo for databricks (has shark/spark snapshots)
    b6f083e [Michael Armbrust] support for publishing scala doc to github from sbt
    d9d18b4 [Michael Armbrust] debug logging implicit.
    669089c [Yin Huai] support Boolean literals
    ef3321e [Yin Huai] explain the limitation of the current EvaluateLiterals
    73a05fd [Yin Huai] add a comment to EvaluateLiterals.
    191eb7d [Yin Huai] First draft of literal evaluation in the optimization phase. TreeNode has been extended to support transform in the post order. So, for an expression, we can evaluate literal from the leaf nodes of this expression tree. For an attribute reference in the expression node, we just leave it as is.
    80039cc [Yin Huai] Merge pull request #1 from yhuai/master
    cbe1ca1 [Yin Huai] add explicit result type to the overloaded sideBySide
    5c518e4 [Michael Armbrust] fix bug in test.
    b50dd0e [Michael Armbrust] fix return type of overloaded method
    05679b7 [Michael Armbrust] download assembly jar for easy compiling during interview.
    8c60cc0 [Michael Armbrust] Update README.md
    03b9526 [Michael Armbrust] First draft of optimizer tests.
    f392755 [Michael Armbrust] Add flatMap to TreeNode
    6cbe8d1 [Michael Armbrust] fix bug in side by side, add support for working with unsplit strings
    15a53fc [Michael Armbrust] more generic sum calculation and better binding of grouping expressions.
    06749d0 [Michael Armbrust] add expression enumerations for query plan operators and recursive version of transform expression.
    4b0a888 [Michael Armbrust] implement string comparison and more casts.
    356b321 [Michael Armbrust] Update README.md
    3776395 [Michael Armbrust] Update README.md
    304d17d [Michael Armbrust] Create README.md
    b7d8be0 [Michael Armbrust] more tests passing.
    b82481f [Michael Armbrust] add todo comment.
    02e6dee [Michael Armbrust] add another test that breaks the harness to the blacklist.
    cc5efe3 [Michael Armbrust] First draft of broadcast nested loop join with full outer support.
    c43a259 [Michael Armbrust] comments
    15ff448 [Michael Armbrust] better error message when a dsl test throws an exception
    76ec650 [Michael Armbrust] fix join conditions
    e10df99 [Michael Armbrust] Create new expr ids for local relations that exist more than once in a query plan.
    91573a4 [Michael Armbrust] initial type promotion
    e2ef4a5 [Michael Armbrust] logging
    e43dc1e [Michael Armbrust] add string => int cast evaluation
    f1f7e96 [Michael Armbrust] fix incorrect generation of join keys
    2b27230 [Michael Armbrust] add depth based subtree access
    0f6279f [Michael Armbrust] broken tests.
    389bc0b [Michael Armbrust] support for partitioned columns in output.
    12584f4 [Michael Armbrust] better errors for missing clauses. support for matching multiple clauses with the same name.
    b67a225 [Michael Armbrust] better errors when types don't match up.
    9e74808 [Michael Armbrust] add children resolved.
    6d03ce9 [Michael Armbrust] defaults for unresolved relation
    2469b00 [Michael Armbrust] skip nodes with unresolved children when doing coersions
    be5ae2c [Michael Armbrust] better resolution logging
    cb7b5af [Michael Armbrust] views example
    420e05b [Michael Armbrust] more tests passing!
    6916c63 [Michael Armbrust] Reading from partitioned hive tables.
    a1245f9 [Michael Armbrust] more tests passing
    956e760 [Michael Armbrust] extended explain
    5f14c35 [Michael Armbrust] more test tables supported
    175c43e [Michael Armbrust] better errors for parse exceptions
    480ade5 [Michael Armbrust] don't use partial cached results.
    8a9d21c [Michael Armbrust] fix evaluation
    7aee69c [Michael Armbrust] parsing for joins, boolean logic
    7fcf480 [Michael Armbrust] test for and logic
    3ea9b00 [Michael Armbrust] don't use simpleString if there are no new lines.
    6902490 [Michael Armbrust] fix boolean logic evaluation
    4d5eba7 [Michael Armbrust] add more dsl for expression arithmetic and boolean logic
    8b2a2ee [Michael Armbrust] more tests passing!
    ad1f3b4 [Michael Armbrust] toString for null literals
    a5c0a1b [Michael Armbrust] more test harness improvements: * regex whitelist * side by side answer comparison (still needs formatting work)
    60ec19d [Michael Armbrust] initial support for udfs
    c45b440 [Michael Armbrust] support for is (not) null and boolean logic
    7f4a1dc [Michael Armbrust] add NoRelation logical operator
    72e183b [Michael Armbrust] support for null values in tree node args.
    ad596d2 [Michael Armbrust] add sc to Union's otherCopyArgs
    e5c9d1a [Michael Armbrust] use nonEmpty
    dcc4fe1 [Michael Armbrust] support for src1 test table.
    c78b587 [Michael Armbrust] casting.
    75c3f3f [Michael Armbrust] add support for logging with scalalogging.
    da2c011 [Michael Armbrust] make it more obvious when results are being truncated.
    96b73ba [Michael Armbrust] more docs in TestShark
    18524fd [Michael Armbrust] add method to SharkSqlQuery for directly executing the same query on hive.
    e6d063b [Michael Armbrust] more join tests.
    664c1c3 [Michael Armbrust] make parsing of function names case insensitive.
    0967d4e [Michael Armbrust] fix hardcoded path to hiveDevHome.
    1a6db68 [Michael Armbrust] spelling
    7638cb4 [Michael Armbrust] simple join execution with dsl tests.  no hive tests yes.
    859d4c9 [Michael Armbrust] better argString printing of nested trees.
    fc53615 [Michael Armbrust] add same instance comparisons for tree nodes.
    a026e6b [Michael Armbrust] move out hive specific operators
    fff4d1c [Michael Armbrust] add simple query execution debugging
    e2120ab [Michael Armbrust] sorting for strings
    da06eb6 [Michael Armbrust] Parsing for sortby and joins
    9eb5c5e [Michael Armbrust] override equality in Attribute references to compare exprId.
    8eb2460 [Michael Armbrust] add system property to override whitelist.
    88124bb [Michael Armbrust] make strategy evaluation lazy.
    74a3a21 [Michael Armbrust] implement outputSet
    d25b171 [Michael Armbrust] Add AND and OR expressions
    67f0a4a [Michael Armbrust] dsl improvements: string to attribute, subquery, unionAll
    12acf0a [Michael Armbrust] add .DS_Store for macs
    f7da6ce [Michael Armbrust] add agg with grouping expr in select test
    36805b3 [Michael Armbrust] pull out and improve aggregation
    75613e1 [Michael Armbrust] better evaluations failure messages.
    4789a35 [Michael Armbrust] weaken type since its hard to create pure references.
    e89dd36 [Michael Armbrust] no newline for online trees
    d0590d4 [Michael Armbrust] include stack trace for catalyst failures.
    081c0d9 [Michael Armbrust] more generic computation of agg functions.
    31af3a0 [Michael Armbrust] fail when clauses are unhandeled in the parser
    ecd45b2 [Michael Armbrust] Add more passing tests.
    97d5419 [Michael Armbrust] fix alignment.
    565cc13 [Michael Armbrust] make the canary query optional.
    a95e65c [Michael Armbrust] support for resolving qualified attribute references.
    e1dfa0c [Michael Armbrust] better error reporting for comparison tests when hive works but catalyst fails.
    4640a0b [Michael Armbrust] handle test tables when database is specified.
    bef12e3 [Michael Armbrust] Add Subquery node and trivial optimizer to remove it after analysis.
    fec5158 [Michael Armbrust] add hive / idea files to .gitignore
    3f97ffe [Michael Armbrust] Rename Hive => HiveQl
    656b836 [Michael Armbrust] Support for parsing select clause aliases.
    3ca7414 [Michael Armbrust] StopAfter needs otherCopyArgs.
    3ffde66 [Michael Armbrust] When the child of an alias is unresolved it should return an unresolved attribute instead of throwing an exception.
    8cbef8a [Michael Armbrust] spelling
    aa8c37c [Michael Armbrust] Better toString for SortOrder
    1bb8b45 [Michael Armbrust] fix error message for UnresolvedExceptions
    a2e0327 [Michael Armbrust] add a bunch of tests.
    4a3e1ea [Michael Armbrust] docs and use shark for data loading.
    339bb8f [Michael Armbrust] better docs, Not support
    1d7b2d9 [Michael Armbrust] Add NaN conversions.
    46a2534 [Michael Armbrust] only run canary query on failure.
    8996066 [Michael Armbrust] remove protected from makeCopy
    53bcf41 [Michael Armbrust] testing improvements: * reset hive vars * delete indexes and tables * delete database * reset to use default database * record tests that pass
    04a372a [Michael Armbrust] add a flag for running all tests.
    3b2235b [Michael Armbrust] More general implementation of arithmetic.
    edd7795 [Michael Armbrust] More testing improvements: * Check that results match for native commands * Ensure explain commands can be planned * Cache hive "golden" results
    da6c577 [Michael Armbrust] add string <==> file utility functions.
    3adf5ca [Michael Armbrust] Initial support for groupBy and count.
    7bcd8a4 [Michael Armbrust] Improvements to comparison tests: * Sort answer when query doesn't contain an order by. * Display null values the same as Hive. * Print full query results in easy to read format when they differ.
    a52e7c9 [Michael Armbrust] Transform children that are present in sequences of the product.
    d66ba7e [Michael Armbrust] drop printlns.
    88f2efd [Michael Armbrust] Add sum / count distinct expressions.
    05adedc [Michael Armbrust] rewrite relative paths when loading data in TestShark
    07784b3 [Michael Armbrust] add support for rewriting paths and running 'set' commands.
    b8a9910 [Michael Armbrust] quote tests passing.
    8e5e267 [Michael Armbrust] handle aliased select expressions.
    4286a96 [Michael Armbrust] drop debugging println
    ac34aeb [Michael Armbrust] proof of concept for hive ast transformations.
    2238b00 [Michael Armbrust] better error when makeCopy functions fails due to incorrect arguments
    ff1eab8 [Michael Armbrust] start trying to make insert into hive table more general.
    74a6337 [Michael Armbrust] use fastEquals when doing transformations.
    1184a23 [Michael Armbrust] add native test for escapes.
    b972b18 [Michael Armbrust] create BaseRelation class
    fa6bce9 [Michael Armbrust] implement union
    6391a87 [Michael Armbrust] count aggregate.
    d47c317 [Michael Armbrust] add unary minus, more tests passing.
    c7114e4 [Michael Armbrust] first draft of star expansion.
    044c43d [Michael Armbrust] better support for numeric literal parsing.
    1d0f072 [Michael Armbrust] use native drop table as it doesn't appear to fail when the "table" is actually a view.
    61503c5 [Michael Armbrust] add cached toRdd
    2036883 [Michael Armbrust] skip explain queries when testing.
    ebac4b1 [Michael Armbrust] fix bug in sort reference calculation
    ca0dee0 [Michael Armbrust] docs.
    1ee0471 [Michael Armbrust] string literal parsing.
    357278b [Michael Armbrust] add limit support
    9b3e479 [Michael Armbrust] creation of string literals.
    02efa30 [Michael Armbrust] alias evaluation
    cb68b33 [Michael Armbrust] parsing for random sample in hive ql.
    126dd36 [Michael Armbrust] include query plans in failure output
    bb59ae9 [Michael Armbrust] doc fixes
    7e68286 [Michael Armbrust] fix confusing naming
    768bb25 [Michael Armbrust] handle errors in shark query toString
    829c3ce [Michael Armbrust] Auto loading of test data on demand. Add reset method to test shark.  Make test shark a singleton to avoid weirdness with the hive megastore.
    ad02e41 [Michael Armbrust] comment jdo dependency
    7bc89fe [Michael Armbrust] add collect to TreeNode.
    438cf74 [Michael Armbrust] create explicit treeString function in addition to toString override. docs.
    09679ee [Michael Armbrust] fix bug in TreeNode foreach
    2930b27 [Michael Armbrust] more specific name for del query tests.
    8842549 [Michael Armbrust] docs.
    da81f81 [Michael Armbrust] Implementation and tests for simple AVG query in Hive SQL.
    a8969b9 [Michael Armbrust] Factor out hive query comparison test framework.
    1a7efb0 [Michael Armbrust] specialize spark aggregate for global aggregations.
    a36dd9a [Michael Armbrust] evaluation for other > data types.
    cae729b [Michael Armbrust] remove unnecessary lazy vals.
    d8e12af [Michael Armbrust] docs
    3a60d67 [Michael Armbrust] implement average, placeholder for count
    f05c106 [Michael Armbrust] checkAnswer handles single row results.
    2730534 [Michael Armbrust] implement inputSet
    a9aa79d [Michael Armbrust] debugging for sort exec
    8bec3c9 [Michael Armbrust] better tree makeCopy when there are two constructors.
    554b4b2 [Michael Armbrust] BoundAttribute pretty printing.
    754f5fa [Michael Armbrust] dsl for setting nullability
    a206d7a [Michael Armbrust] clean up query tests.
    84ad6ef [Michael Armbrust] better sort implementation and tests.
    de24923 [Michael Armbrust] add double type.
    9611a2c [Michael Armbrust] literal creation for doubles.
    7358313 [Michael Armbrust] sort order returns child type.
    b544715 [Michael Armbrust] implement eval for rand, and > for doubles
    7013bad [Michael Armbrust] asc, desc should work for expressions and unresolved attributes (symbols)
    1c1a35e [Michael Armbrust] add simple Rand expression.
    3ca51de [Michael Armbrust] add orderBy to dsl
    7ae41ab [Michael Armbrust] more literal implicit conversions
    b18b675 [Michael Armbrust] First cut at native query tests for shark.
    d392e29 [Michael Armbrust] add toRdd implicit conversion for logical plans in TestShark.
    5eac895 [Michael Armbrust] better error when descending is specified.
    2b16f86 [Michael Armbrust] add todo
    e527bb8 [Michael Armbrust] remove arguments to binary predicate constructor as they seem to break serialization
    9dde3c8 [Michael Armbrust] add project and filter operations.
    ad9037b [Michael Armbrust] Add support for local relations.
    6227143 [Michael Armbrust] evaluation of Equals.
    7526290 [Michael Armbrust] BoundReference should also be an Attribute.
    bd33e26 [Michael Armbrust] more documentation
    5de0ea3 [Michael Armbrust] Move all shark specific into a separate package.  Lots of documentation improvements.
    0ae292b [Michael Armbrust] implement calculation of sort expressions.
    9fd5011 [Michael Armbrust] First cut at expression evaluation.
    6259e3a [Michael Armbrust] cleanup
    787e5a2 [Michael Armbrust] use fastEquals
    f90da36 [Michael Armbrust] better printing of optimization exceptions
    b05dd67 [Michael Armbrust] Application of rules to fixed point.
    bb2e0db [Michael Armbrust] pretty print for literals.
    1ec3287 [Michael Armbrust] Add extractor for IntegerLiterals.
    d3a3687 [Michael Armbrust] add fastEquals
    2b4935b [Michael Armbrust] set sbt.version explicitly
    46dfd7f [Michael Armbrust] first cut at checking answer for HiveCompatability tests.
    c79f2fd [Michael Armbrust] insert operator should return an empty rdd.
    14c22ec [Michael Armbrust] implement sorting when the sort expression is the first attribute of the input.
    ae7b4c3 [Michael Armbrust] remove implicit dependencies.  now compiles without copying things into lib/ manually.
    84082f9 [Michael Armbrust] add sbt binaries and scripts
    15371a8 [Michael Armbrust] First draft of simple Hive DDL parser.
    063bf44 [Michael Armbrust] Periods should end all comments.
    e1f7f4c [Michael Armbrust] Remove "NativePlaceholder" hack.
    ed3633e [Michael Armbrust] start consolidating Hive/Shark specific code. first hive compatibility test case passing!
    b34a770 [Michael Armbrust] Add data sink strategy, make strategy application a little more robust.
    e7174ec [Michael Armbrust] fix schema, add docs, make helper method protected.
    26f410a [Michael Armbrust] physical traits should extend PhysicalPlan.
    dc72469 [Michael Armbrust] beginning of hive compatibility testing framework.
    0763490 [Michael Armbrust] support for hive native command pass-through.
    d8a924f [Michael Armbrust] scaladoc
    29a7163 [Michael Armbrust] Insert into hive table physical operator.
    633cebc [Michael Armbrust] better error message when there is no appropriate planning strategy.
    59ac444 [Michael Armbrust] add unary expression
    3aa1b28 [Michael Armbrust] support for table names in the form 'database.tableName'
    665f7d0 [Michael Armbrust] add logical nodes for hive data sinks.
    64d2923 [Michael Armbrust] Add classes for representing sorts.
    f72b7ce [Michael Armbrust] first trivial end to end query execution.
    5c7d244 [Michael Armbrust] first draft of references implementation.
    7bff274 [Michael Armbrust] point at new shark.
    c7cd57f [Michael Armbrust] docs for util function.
    910811c [Michael Armbrust] check each item of the sequence
    ef21a0b [Michael Armbrust] line up comments.
    4b765d5 [Michael Armbrust] docs, drop println
    6f9bafd [Michael Armbrust] empty output for unresolved relation to avoid exception in resolution.
    a703c49 [Michael Armbrust] this order works better until fixed point is implemented.
    ec1d7c0 [Michael Armbrust] Simple attribute resolution.
    069df02 [Michael Armbrust] parsing binary predicates
    a1cf754 [Michael Armbrust] add joins and equality.
    3f5bc98 [Michael Armbrust] add optiq to sbt.
    54f3460 [Michael Armbrust] initial optiq parsing.
    d9161ce [Michael Armbrust] add join operator
    1e423eb [Michael Armbrust] placeholders in LogicalPlan, docs
    24ef6fb [Michael Armbrust] toString for alias.
    ae7d776 [Michael Armbrust] add nullability changing function
    d49dc02 [Michael Armbrust] scaladoc for named exprs
    7c45dd7 [Michael Armbrust] pretty printing of trees.
    78e34bf [Michael Armbrust] simple git ignore.
    7ba19be [Michael Armbrust] First draft of interface to hive metastore.
    7e7acf0 [Michael Armbrust] physical placeholder.
    1c11136 [Michael Armbrust] first draft of error handling / plans for debugging.
    3766a41 [Michael Armbrust] rearrange utility functions.
    7fb3d5e [Michael Armbrust] docs and equality improvements.
    45da47b [Michael Armbrust] flesh out plans and expressions a little. first cut at named expressions.
    002d4d4 [Michael Armbrust] default to no alias.
    be25003 [Michael Armbrust] add repl initialization to sbt.
    0608a00 [Michael Armbrust] tighten public interface
    a1a8b38 [Michael Armbrust] test that ids don't change for no-op transforms.
    daa71ca [Michael Armbrust] foreach, maps, and scaladoc
    6a158cb [Michael Armbrust] simple transform working.
    db0299f [Michael Armbrust] basic analysis of relations minus transform function.
    f74c4ee [Michael Armbrust] parsing a simple query.
    08e4f57 [Michael Armbrust] upgrade scala include shark.
    d3c6404 [Michael Armbrust] initial commit
    marmbrus authored and rxin committed Mar 21, 2014
    Configuration menu
    Copy the full SHA
    9aadcff View commit details
    Browse the repository at this point in the history
  2. Fix maven jenkins: Add explicit init for required tables in SQLQueryS…

    …uite
    
    Sorry! I added this test at the last minute and failed to run it in maven as well.
    
    Note that, this will probably not be sufficient to actually fix the maven jenkins build, as that does not use the dev/run-tests scripts.  We will need to configure it to also run dev/download-hive-tests.sh.  The other option would be to check in the tests as I suggested in the original PR. (I can do this if we agree its the right thing to do).
    
    Long term it would probably be a good idea to also have maven run some sort of test env setup script so that we can decouple the test environment from the jenkins configuration.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#191 from marmbrus/fixMaven and squashes the following commits:
    
    3366e37 [Michael Armbrust] Fix maven jenkins: Add explicit init for required tables in SQLQuerySuite
    marmbrus authored and rxin committed Mar 21, 2014
    Configuration menu
    Copy the full SHA
    e09139d View commit details
    Browse the repository at this point in the history
  3. Add hive test files to repository. Remove download script.

    This PR removes our test dependence on files hosted at Berkeley by checking the test queries and answers into the repository.  This should also fix the maven Jenkins build.
    
    I realize this is a *giant* commit.  But size wise its actually pretty small.  We are only looking at ~1.2Mb compressed (~30Mb uncompressed).  Given that we already have a ~80Mb file permanently added to the spark code lineage, I do not think that this will change the developer experience significantly.
    
    Furthermore, I think it is good engineering practice to consider such test support files as "code", since changes to them would indicate a change in functionality.  These files were only excluded from the initial PR as I wanted the diff to be readable.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#199 from marmbrus/hiveTestFiles and squashes the following commits:
    
    b9b9b17 [Michael Armbrust] Add hive test files to repository.  Remove download script.
    marmbrus authored and pwendell committed Mar 21, 2014
    Configuration menu
    Copy the full SHA
    7e17fe6 View commit details
    Browse the repository at this point in the history
  4. SPARK-1279: Fix improper use of SimpleDateFormat

    `SimpleDateFormat` is not thread-safe. Some places use the same SimpleDateFormat object without safeguard in the multiple threads. It will cause that the Web UI displays improper date.
    
    This PR creates a new `SimpleDateFormat` every time when it's necessary. Another solution is using `ThreadLocal` to store a `SimpleDateFormat` in each thread. If this PR impacts the performance, I can change to the latter one.
    
    Author: zsxwing <zsxwing@gmail.com>
    
    Closes apache#179 from zsxwing/SPARK-1278 and squashes the following commits:
    
    21fabd3 [zsxwing] SPARK-1278: Fix improper use of SimpleDateFormat
    zsxwing authored and pwendell committed Mar 21, 2014
    Configuration menu
    Copy the full SHA
    2c0aa22 View commit details
    Browse the repository at this point in the history
  5. Make SQL keywords case-insensitive

    This is a bit of a hack that allows all variations of a keyword, but it still seems to produce valid error messages and such.
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes apache#193 from mateiz/case-insensitive-sql and squashes the following commits:
    
    0ee4ace [Matei Zaharia] Removed unnecessary `+ ""`
    e3ed773 [Matei Zaharia] Make SQL keywords case-insensitive
    mateiz authored and rxin committed Mar 21, 2014
    Configuration menu
    Copy the full SHA
    dab5439 View commit details
    Browse the repository at this point in the history
  6. Add asCode function for dumping raw tree representations.

     Intended only for use by Catalyst developers.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#200 from marmbrus/asCode and squashes the following commits:
    
    7e8c1d9 [Michael Armbrust] Add asCode function for dumping raw tree representations.  Intended only for use by Catalyst developers.
    marmbrus authored and rxin committed Mar 21, 2014
    Configuration menu
    Copy the full SHA
    d780983 View commit details
    Browse the repository at this point in the history

Commits on Mar 22, 2014

  1. Fix to Stage UI to display numbers on progress bar

    Fixes an issue on Stage UI to display numbers on progress bar which are today hidden behind the progress bar div. Please refer to the attached images to see the issue.
    ![screen shot 2014-03-21 at 4 48 46 pm](https://f.cloud.github.com/assets/563652/2489083/8c127e80-b153-11e3-807c-048ebd45104b.png)
    ![screen shot 2014-03-21 at 4 49 00 pm](https://f.cloud.github.com/assets/563652/2489084/8c12cf5c-b153-11e3-8747-9d93ff6fceb4.png)
    
    Author: Emtiaz Ahmed <emtiazahmed@gmail.com>
    
    Closes apache#201 from emtiazahmed/master and squashes the following commits:
    
    a7964fe [Emtiaz Ahmed] Fix to Stage UI to display numbers on progress bar
    emtiazahmed authored and aarondav committed Mar 22, 2014
    Configuration menu
    Copy the full SHA
    646e554 View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2014

  1. SPARK-1254. Supplemental fix for HTTPS on Maven Central

    It seems that HTTPS does not necessarily work on Maven Central, as it does not today at least. Back to HTTP. Both builds works from a clean repo.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#209 from srowen/SPARK-1254Fix and squashes the following commits:
    
    bb7be47 [Sean Owen] Revert to HTTP for Maven Central repo, as it seems HTTPS does not necessarily work
    srowen authored and pwendell committed Mar 23, 2014
    Configuration menu
    Copy the full SHA
    abf6714 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1292] In-memory columnar representation for Spark SQL

    This PR is rebased from the Catalyst repository, and contains the first version of in-memory columnar representation for Spark SQL. Compression support is not included yet and will be added later in a separate PR.
    
    Author: Cheng Lian <lian@databricks.com>
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes apache#205 from liancheng/memColumnarSupport and squashes the following commits:
    
    99dba41 [Cheng Lian] Restricted new objects/classes to `private[sql]'
    0892ad8 [Cheng Lian] Addressed ScalaStyle issues
    af1ad5e [Cheng Lian] Fixed some minor issues introduced during rebasing
    0dbf2fb [Cheng Lian] Make necessary renaming due to rebase
    a162d4d [Cheng Lian] Removed the unnecessary InMemoryColumnarRelation class
    9bcae4b [Cheng Lian] Added Apache license
    220ee1e [Cheng Lian] Added table scan operator for in-memory columnar support.
    c701c7a [Cheng Lian] Using SparkSqlSerializer for generic object SerDe causes error, made a workaround
    ed8608e [Cheng Lian] Added implicit conversion from DataType to ColumnType
    b8a645a [Cheng Lian] Replaced KryoSerializer with an updated SparkSqlSerializer
    b6c0a49 [Cheng Lian] Minor test suite refactoring
    214be73 [Cheng Lian] Refactored BINARY and GENERIC to reduce duplicate code
    da2f4d5 [Cheng Lian] Added Apache license
    dbf7a38 [Cheng Lian] Added ColumnAccessor and test suite, refactored ColumnBuilder
    c01a177 [Cheng Lian] Added column builder classes and test suite
    f18ddc6 [Cheng Lian] Added ColumnTypes and test suite
    2d09066 [Cheng Lian] Added KryoSerializer
    34f3c19 [Cheng Lian] Added TypeTag field to all NativeTypes
    acc5c48 [Cheng Lian] Added Hive test files to .gitignore
    liancheng authored and pwendell committed Mar 23, 2014
    Configuration menu
    Copy the full SHA
    57a4379 View commit details
    Browse the repository at this point in the history
  3. Fixed coding style issues in Spark SQL

    This PR addresses various coding style issues in Spark SQL, including but not limited to those mentioned by @mateiz in PR apache#146.
    
    As this PR affects lots of source files and may cause potential conflicts, it would be better to merge this as soon as possible *after* PR apache#205 (In-memory columnar representation for Spark SQL) is merged.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes apache#208 from liancheng/fixCodingStyle and squashes the following commits:
    
    fc2b528 [Cheng Lian] Merge branch 'master' into fixCodingStyle
    b531273 [Cheng Lian] Fixed coding style issues in sql/hive
    0b56f77 [Cheng Lian] Fixed coding style issues in sql/core
    fae7b02 [Cheng Lian] Addressed styling issues mentioned by @marmbrus
    9265366 [Cheng Lian] Fixed coding style issues in sql/core
    3dcbbbd [Cheng Lian] Fixed relative package imports for package catalyst
    liancheng authored and pwendell committed Mar 23, 2014
    Configuration menu
    Copy the full SHA
    8265dc7 View commit details
    Browse the repository at this point in the history

Commits on Mar 24, 2014

  1. [SPARK-1212] Adding sparse data support and update KMeans

    Continue our discussions from https://github.com/apache/incubator-spark/pull/575
    
    This PR is WIP because it depends on a SNAPSHOT version of breeze.
    
    Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.
    
    @fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.
    
    I'm going to update this PR to include:
    
    1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.
    
    2. Some numbers about the performance.
    
    3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#117 from mengxr/sparse-kmeans and squashes the following commits:
    
    67b368d [Xiangrui Meng] fix SparseVector.toArray
    5eda0de [Xiangrui Meng] update NOTICE
    67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
    1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
    9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
    226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
    238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
    b28ba2f [Xiangrui Meng] add toArray to Vector
    e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
    72bde33 [Xiangrui Meng] clean up code for distance computation
    712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
    27858e4 [Xiangrui Meng] update breeze version to 0.7
    07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
    6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
    42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
    d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
    42b4e50 [Xiangrui Meng] line feed at the end
    a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
    3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
    0107e19 [Xiangrui Meng] update NOTICE
    87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
    0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
    f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
    ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
    4e7d5ca [Xiangrui Meng] minor style update
    07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
    mengxr authored and mateiz committed Mar 24, 2014
    Configuration menu
    Copy the full SHA
    80c2968 View commit details
    Browse the repository at this point in the history
  2. SPARK-1144 Added license and RAT to check licenses.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#125 from ScrapCodes/rat-integration and squashes the following commits:
    
    64f7c7d [Prashant Sharma] added license headers.
    fcf28b1 [Prashant Sharma] Review feedback.
    c0648db [Prashant Sharma] SPARK-1144 Added license and RAT to check licenses.
    ScrapCodes authored and pwendell committed Mar 24, 2014
    Configuration menu
    Copy the full SHA
    21109fb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    56db8a2 View commit details
    Browse the repository at this point in the history

Commits on Mar 25, 2014

  1. SPARK-1294 Fix resolution of uppercase field names using a HiveContext.

    Fixing this bug required the following:
     - Creation of a new logical node that converts a schema to lowercase.
     - Generalization of the subquery eliding rule to also elide this new node
     - Fixing of several places where too tight assumptions were made on the types of `InsertIntoTable` children.
     - I also removed an API that was left in by accident that exposed catalyst data structures, and fix the logic that pushes down filters into hive tables scans to correctly compare attribute references.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#202 from marmbrus/upperCaseFieldNames and squashes the following commits:
    
    15e5265 [Michael Armbrust] Support for resolving mixed case fields from a reflected schema using HiveQL.
    5aa5035 [Michael Armbrust] Remove API that exposes internal catalyst data structures.
    9d99cb6 [Michael Armbrust] Attributes should be compared using exprId, not TreeNode.id.
    marmbrus authored and pwendell committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    8043b7b View commit details
    Browse the repository at this point in the history
  2. SPARK-1094 Support MiMa for reporting binary compatibility accross ve…

    …rsions.
    
    This adds some changes on top of the initial work by @ScrapCodes in apache#20:
    
    The goal here is to do automated checking of Spark commits to determine whether they break binary compatibility.
    
    1. Special case for inner classes of package-private objects.
    2. Made tools classes accessible when running `spark-class`.
    3. Made some declared types in MLLib more general.
    4. Various other improvements to exclude-generation script.
    5. In-code documentation.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    Author: Prashant Sharma <prashant.s@imaginea.com>
    Author: Prashant Sharma <scrapcodes@gmail.com>
    
    Closes apache#207 from pwendell/mima and squashes the following commits:
    
    22ae267 [Patrick Wendell] New binary changes after upmerge
    6c2030d [Patrick Wendell] Merge remote-tracking branch 'apache/master' into mima
    3666cf1 [Patrick Wendell] Minor style change
    0e0f570 [Patrick Wendell] Small fix and removing directory listings
    647c547 [Patrick Wendell] Reveiw feedback.
    c39f3b5 [Patrick Wendell] Some enhancements to binary checking.
    4c771e0 [Prashant Sharma] Added a tool to generate mima excludes and also adapted build to pick automatically.
    b551519 [Prashant Sharma] adding a new exclude after rebasing with master
    651844c [Prashant Sharma] Support MiMa for reporting binary compatibility accross versions.
    pwendell committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    dc126f2 View commit details
    Browse the repository at this point in the history
  3. SPARK-1128: set hadoop task properties when constructing HadoopRDD

    https://spark-project.atlassian.net/browse/SPARK-1128
    
    The task properties are not set when constructing HadoopRDD in current implementation, this may limit the implementation based on
    
    ```
    mapred.tip.id
    mapred.task.id
    mapred.task.is.map
    mapred.task.partition
    mapred.job.id
    ```
    
    This patch also contains a small fix  in createJobID (SparkHadoopWriter.scala), where the current implementation actually is not using time parameter
    
    Author: CodingCat <zhunansjtu@gmail.com>
    Author: Nan Zhu <CodingCat@users.noreply.github.com>
    
    Closes apache#101 from CodingCat/SPARK-1128 and squashes the following commits:
    
    ed0980f [CodingCat] make SparkHiveHadoopWriter belongs to spark package
    5b1ad7d [CodingCat] move SparkHiveHadoopWriter to org.apache.spark package
    258f92c [CodingCat] code cleanup
    af88939 [CodingCat] update the comments and permission of SparkHadoopWriter
    9bd1fe3 [CodingCat] move configuration for jobConf to HadoopRDD
    b7bdfa5 [Nan Zhu] style fix
    a3153a8 [Nan Zhu] style fix
    c3258d2 [CodingCat] set hadoop task properties while using InputFormat
    CodingCat authored and aarondav committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    5140598 View commit details
    Browse the repository at this point in the history
  4. Unify the logic for column pruning, projection, and filtering of tabl…

    …e scans.
    
    This removes duplicated logic, dead code and casting when planning parquet table scans and hive table scans.
    
    Other changes:
     - Fix tests now that we are doing a better job of column pruning (i.e., since pruning predicates are applied before we even start scanning tuples, columns required by these predicates do not need to be included in the output of the scan unless they are also included in the final output of this logical plan fragment).
     - Add rule to simplify trivial filters.  This was required to avoid `WHERE false` from getting pushed into table scans, since `HiveTableScan` (reasonably) refuses to apply partition pruning predicates to non-partitioned tables.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#213 from marmbrus/strategyCleanup and squashes the following commits:
    
    48ce403 [Michael Armbrust] Move one more bit of parquet stuff into the core SQLContext.
    834ce08 [Michael Armbrust] Address comments.
    0f2c6f5 [Michael Armbrust] Unify the logic for column pruning, projection, and filtering of table scans for both Hive and Parquet relations.  Fix tests now that we are doing a better job of column pruning.
    marmbrus authored and pwendell committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    b637f2d View commit details
    Browse the repository at this point in the history
  5. SPARK-1286: Make usage of spark-env.sh idempotent

    Various spark scripts load spark-env.sh. This can cause growth of any variables that may be appended to (SPARK_CLASSPATH, SPARK_REPL_OPTS) and it makes the precedence order for options specified in spark-env.sh less clear.
    
    One use-case for the latter is that we want to set options from the command-line of spark-shell, but these options will be overridden by subsequent loading of spark-env.sh. If we were to load the spark-env.sh first and then set our command-line options, we could guarantee correct precedence order.
    
    Note that we use SPARK_CONF_DIR if available to support the sbin/ scripts, which always set this variable from sbin/spark-config.sh. Otherwise, we default to the ../conf/ as usual.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#184 from aarondav/idem and squashes the following commits:
    
    e291f91 [Aaron Davidson] Use "private" variables in load-spark-env.sh
    8da8360 [Aaron Davidson] Add .sh extension to load-spark-env.sh
    93a2471 [Aaron Davidson] SPARK-1286: Make usage of spark-env.sh idempotent
    aarondav committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    007a733 View commit details
    Browse the repository at this point in the history
  6. Add more hive compatability tests to whitelist

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#220 from marmbrus/moreTests and squashes the following commits:
    
    223ec35 [Michael Armbrust] Blacklist machine specific test
    9c966cc [Michael Armbrust] add more hive compatability tests to whitelist
    marmbrus authored and rxin committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    134ace7 View commit details
    Browse the repository at this point in the history
  7. SPARK-1316. Remove use of Commons IO

    (This follows from a side point on SPARK-1133, in discussion of the PR: apache#164 )
    
    Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark `Utils.scala` class.
    
    Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#226 from srowen/SPARK-1316 and squashes the following commits:
    
    21efef3 [Sean Owen] Remove use of Commons IO
    srowen authored and rxin committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    71d4ed2 View commit details
    Browse the repository at this point in the history
  8. SPARK-1319: Fix scheduler to account for tasks using > 1 CPUs.

    Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.
    
    Thanks @kayousterhout for the design discussion
    
    Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
    
    Closes apache#219 from shivaram/multi-cpus and squashes the following commits:
    
    5c7d685 [Shivaram Venkataraman] Don't pass availableCpus to TaskSetManager
    260e4d5 [Shivaram Venkataraman] Add a check for non-zero CPUs in TaskSetManager
    73fcf6f [Shivaram Venkataraman] Add documentation for spark.task.cpus
    647bc45 [Shivaram Venkataraman] Fix scheduler to account for tasks using > 1 CPUs. Move CPUS_PER_TASK to TaskSchedulerImpl as the value is a constant and use it in both Mesos and CoarseGrained scheduler backends.
    shivaram authored and kayousterhout committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    f8111ea View commit details
    Browse the repository at this point in the history
  9. Avoid Option while generating call site

    This is an update on apache#180, which changes the solution from blacklisting "Option.scala" to avoiding the Option code path while generating the call path.
    
    Also includes a unit test to prevent this issue in the future, and some minor refactoring.
    
    Thanks @witgo for reporting this issue and working on the initial solution!
    
    Author: witgo <witgo@qq.com>
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#222 from aarondav/180 and squashes the following commits:
    
    f74aad1 [Aaron Davidson] Avoid Option while generating call site & add unit tests
    d2b4980 [witgo] Modify the position of the filter
    1bc22d7 [witgo] Fix Stage.name return "apply at Option.scala:120"
    witgo authored and pwendell committed Mar 25, 2014
    Configuration menu
    Copy the full SHA
    8237df8 View commit details
    Browse the repository at this point in the history

Commits on Mar 26, 2014

  1. Initial experimentation with Travis CI configuration

    This is not intended to replace Jenkins immediately, and Jenkins will remain the CI of reference for merging pull requests in the near term.  Long term, it is possible that Travis will give us better integration with github, so we are investigating its use.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#230 from marmbrus/travis and squashes the following commits:
    
    93f9a32 [Michael Armbrust] Add Apache license to .travis.yml
    d7c0e78 [Michael Armbrust] Initial experimentation with Travis CI configuration
    marmbrus authored and pwendell committed Mar 26, 2014
    Configuration menu
    Copy the full SHA
    4f7d547 View commit details
    Browse the repository at this point in the history
  2. SPARK-1321 Use Guava's top k implementation rather than our BoundedPr…

    …iorityQueue based implementation
    
    Also updated the documentation for top and takeOrdered.
    
    On my simple test of sorting 100 million (Int, Int) tuples using Spark, Guava's top k implementation (in Ordering) is much faster than the BoundedPriorityQueue implementation for roughly sorted input (10 - 20X faster), and still faster for purely random input (2 - 5X).
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#229 from rxin/takeOrdered and squashes the following commits:
    
    0d11844 [Reynold Xin] Use Guava's top k implementation rather than our BoundedPriorityQueue based implementation. Also updated the documentation for top and takeOrdered.
    rxin committed Mar 26, 2014
    Configuration menu
    Copy the full SHA
    b859853 View commit details
    Browse the repository at this point in the history
  3. SPARK-1322, top in pyspark should sort result in descending order.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits:
    
    f316266 [Prashant Sharma] Minor change in comment.
    58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.
    ScrapCodes authored and pwendell committed Mar 26, 2014
    Configuration menu
    Copy the full SHA
    a0853a3 View commit details
    Browse the repository at this point in the history
  4. Unified package definition format in Spark SQL

    According to discussions in comments of PR apache#208, this PR unifies package definition format in Spark SQL.
    
    Some broken links in ScalaDoc and typos detected along the way are also fixed.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes apache#225 from liancheng/packageDefinition and squashes the following commits:
    
    75c47b3 [Cheng Lian] Fixed file line length
    4f87968 [Cheng Lian] Unified package definition format in Spark SQL
    liancheng authored and pwendell committed Mar 26, 2014
    Configuration menu
    Copy the full SHA
    345825d View commit details
    Browse the repository at this point in the history

Commits on Mar 27, 2014

  1. [SQL] Un-ignore a test that is now passing.

    Add golden answer for aforementioned test.
    
    Also, fix golden test generation from sbt/sbt by setting the classpath correctly.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#244 from marmbrus/partTest and squashes the following commits:
    
    37a33c9 [Michael Armbrust] Un-ignore a test that is now passing, add golden answer for aforementioned test.  Fix golden test generation from sbt/sbt.
    marmbrus authored and pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    32cbdfd View commit details
    Browse the repository at this point in the history
  2. [SQL] Add a custom serializer for maps since they do not have a no-ar…

    …g constructor.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#243 from marmbrus/mapSer and squashes the following commits:
    
    54045f7 [Michael Armbrust] Add a custom serializer for maps since they do not have a no-arg constructor.
    marmbrus authored and pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    e15e574 View commit details
    Browse the repository at this point in the history
  3. SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS

    /cc @aarondav and @andrewor14
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#231 from pwendell/ui-binding and squashes the following commits:
    
    e8025f8 [Patrick Wendell] SPARK-1324: SparkUI Should Not Bind to SPARK_PUBLIC_DNS
    pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    be6d96c View commit details
    Browse the repository at this point in the history
  4. Spark 1095 : Adding explicit return types to all public methods

    Excluded those that are self-evident and the cases that are discussed in the mailing list.
    
    Author: NirmalReddy <nirmal_reddy2000@yahoo.com>
    Author: NirmalReddy <nirmal.reddy@imaginea.com>
    
    Closes apache#168 from NirmalReddy/Spark-1095 and squashes the following commits:
    
    ac54b29 [NirmalReddy] import misplaced
    8c5ff3e [NirmalReddy] Changed syntax of unit returning methods
    02d0778 [NirmalReddy] fixed explicit types in all the other packages
    1c17773 [NirmalReddy] fixed explicit types in core package
    NirmalReddy authored and pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    3e63d98 View commit details
    Browse the repository at this point in the history
  5. SPARK-1325. The maven build error for Spark Tools

    This is just a slight variation on apache#234 and alternative suggestion for SPARK-1325. `scala-actors` is not necessary. `SparkBuild.scala` should be updated to reflect the direct dependency on `scala-reflect` and `scala-compiler`. And the `repl` build, which has the same dependencies, should also be consistent between Maven / SBT.
    
    Author: Sean Owen <sowen@cloudera.com>
    Author: witgo <witgo@qq.com>
    
    Closes apache#240 from srowen/SPARK-1325 and squashes the following commits:
    
    25bd7db [Sean Owen] Add necessary dependencies scala-reflect and scala-compiler to tools. Update repl dependencies, which are similar, to be consistent between Maven / SBT in this regard too.
    srowen authored and pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    1fa48d9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-1327] GLM needs to check addIntercept for intercept and weights

    GLM needs to check addIntercept for intercept and weights. The current implementation always uses the first weight as intercept. Added a test for training without adding intercept.
    
    JIRA: https://spark-project.atlassian.net/browse/SPARK-1327
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#236 from mengxr/glm and squashes the following commits:
    
    bcac1ac [Xiangrui Meng] add two tests to ensure {Lasso, Ridge}.setIntercept will throw an exceptions
    a104072 [Xiangrui Meng] remove protected to be compatible with 0.9
    0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
    d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
    mengxr authored and tdas committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    d679843 View commit details
    Browse the repository at this point in the history
  7. Cut down the granularity of travis tests.

    This PR amortizes the cost of downloading all the jars and compiling core across more test cases.  In one anecdotal run this change takes the cumulative time down from ~80 minutes to ~40 minutes.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#255 from marmbrus/travis and squashes the following commits:
    
    506b22d [Michael Armbrust] Cut down the granularity of travis tests so we can amortize the cost of compilation.
    marmbrus authored and pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    5b2d863 View commit details
    Browse the repository at this point in the history
  8. SPARK-1330 removed extra echo from comput_classpath.sh

    remove the extra echo which prevents spark-class from working.  Note that I did not update the comment above it, which is also wrong because I'm not sure what it should do.
    
    Should hive only be included if explicitly built with sbt hive/assembly or should sbt assembly build it?
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes apache#241 from tgravescs/SPARK-1330 and squashes the following commits:
    
    b10d708 [Thomas Graves] SPARK-1330 removed extra echo from comput_classpath.sh
    tgravescs committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    426042a View commit details
    Browse the repository at this point in the history
  9. SPARK-1335. Also increase perm gen / code cache for scalatest when in…

    …voked via Maven build
    
    I am observing build failures when the Maven build reaches tests in the new SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual complaint from scala, that it's out of permgen space, or that JIT out of code cache space.
    
    I see that various build scripts increase these both for SBT. This change simply adds these settings to scalatest's arguments. Works for me and seems a bit more consistent.
    
    (I also snuck in cures for new build warnings from new scaladoc. Felt too trivial for a new PR, although it's separate. Just something I also saw while examining the build output.)
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#253 from srowen/SPARK-1335 and squashes the following commits:
    
    c0f2d31 [Sean Owen] Appease scalastyle with a newline at the end of the file
    a02679c [Sean Owen] Fix scaladoc errors due to missing links, which are generating build warnings, from some recent doc changes. We apparently can't generate links outside the module.
    b2c6a09 [Sean Owen] Add perm gen, code cache settings to scalatest, mirroring SBT settings elsewhere, which allows tests to complete in at least one environment where they are failing. (Also removed a duplicate -Xms setting elsewhere.)
    srowen authored and pwendell committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    53953d0 View commit details
    Browse the repository at this point in the history
  10. [SPARK-1268] Adding XOR and AND-NOT operations to spark.util.collecti…

    …on.BitSet
    
    Symmetric difference (xor) in particular is useful for computing some distance metrics (e.g. Hamming). Unit tests added.
    
    Author: Petko Nikolov <nikolov@soundcloud.com>
    
    Closes apache#172 from petko-nikolov/bitset-imprv and squashes the following commits:
    
    451f28b [Petko Nikolov] fixed style mistakes
    5beba18 [Petko Nikolov] rm outer loop in andNot test
    0e61035 [Petko Nikolov] conform to spark style; rm redundant asserts; more unit tests added; use arraycopy instead of loop
    d53cdb9 [Petko Nikolov] rm incidentally added space
    4e1df43 [Petko Nikolov] adding xor and and-not to BitSet; unit tests added
    Petko Nikolov authored and rxin committed Mar 27, 2014
    Configuration menu
    Copy the full SHA
    6f986f0 View commit details
    Browse the repository at this point in the history

Commits on Mar 28, 2014

  1. [SPARK-1210] Prevent ContextClassLoader of Actor from becoming ClassL…

    …oader of Executo...
    
    ...r.
    
    Constructor of `org.apache.spark.executor.Executor` should not set context class loader of current thread, which is backend Actor's thread.
    
    Run the following code in local-mode REPL.
    
    ```
    scala> case class Foo(i: Int)
    scala> val ret = sc.parallelize((1 to 100).map(Foo), 10).collect
    ```
    
    This causes errors as follows:
    
    ```
    ERROR actor.OneForOneStrategy: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo;
    java.lang.ArrayStoreException: [L$line5.$read$$iwC$$iwC$$iwC$$iwC$Foo;
         at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
         at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870)
         at org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:870)
         at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
         at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:859)
         at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:616)
         at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
         at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    ```
    
    This is because the class loaders to deserialize result `Foo` instances might be different from backend Actor's, and the Actor's class loader should be the same as Driver's.
    
    Author: Takuya UESHIN <ueshin@happy-camper.st>
    
    Closes apache#15 from ueshin/wip/wrongcontextclassloader and squashes the following commits:
    
    d79e8c0 [Takuya UESHIN] Change a parent class loader of ExecutorURLClassLoader.
    c6c09b6 [Takuya UESHIN] Add a test to collect objects of class defined in repl.
    43e0feb [Takuya UESHIN] Prevent ContextClassLoader of Actor from becoming ClassLoader of Executor.
    ueshin authored and pwendell committed Mar 28, 2014
    Configuration menu
    Copy the full SHA
    3d89043 View commit details
    Browse the repository at this point in the history
  2. Make sed do -i '' on OSX

    I don't have access to an OSX machine, so if someone could test this that would be great.
    
    Author: Nick Lanham <nick@afternight.org>
    
    Closes apache#258 from nicklan/osx-sed-fix and squashes the following commits:
    
    a6f158f [Nick Lanham] Also make mktemp work on OSX
    558fd6e [Nick Lanham] Make sed do -i '' on OSX
    nicklan authored and mateiz committed Mar 28, 2014
    Configuration menu
    Copy the full SHA
    632c322 View commit details
    Browse the repository at this point in the history
  3. SPARK-1096, a space after comment start style checker.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits:
    
    214135a [Prashant Sharma] Review feedback.
    5eba88c [Prashant Sharma] Fixed style checks for ///+ comments.
    e54b2f8 [Prashant Sharma] improved message, work around.
    83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version.
    810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker.
    ba33193 [Prashant Sharma] scala style as a project
    ScrapCodes authored and pwendell committed Mar 28, 2014
    Configuration menu
    Copy the full SHA
    60abc25 View commit details
    Browse the repository at this point in the history
  4. fix path for jar, make sed actually work on OSX

    Author: Nick Lanham <nick@afternight.org>
    
    Closes apache#264 from nicklan/make-distribution-fixes and squashes the following commits:
    
    172b981 [Nick Lanham] fix path for jar, make sed actually work on OSX
    nicklan authored and mateiz committed Mar 28, 2014
    Configuration menu
    Copy the full SHA
    75d46be View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    56cc7fb View commit details
    Browse the repository at this point in the history

Commits on Mar 29, 2014

  1. SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the n…

    …ew ...
    
    ...sql pom files
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes apache#263 from tgravescs/SPARK-1345 and squashes the following commits:
    
    b43a2a0 [Thomas Graves] SPARK-1345 adding missing dependency on avro for hadoop 0.23 to the new sql pom files
    tgravescs authored and pwendell committed Mar 29, 2014
    Configuration menu
    Copy the full SHA
    3738f24 View commit details
    Browse the repository at this point in the history
  2. SPARK-1126. spark-app preliminary

    This is a starting version of the spark-app script for running compiled binaries against Spark.  It still needs tests and some polish.  The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.
    
    This leaves out the changes required for launching python scripts.  I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).
    
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#86 from sryza/sandy-spark-1126 and squashes the following commits:
    
    d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments
    e7315c6 [Sandy Ryza] Fix failing tests
    34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs
    299ddca [Sandy Ryza] Fix scalastyle
    a94c627 [Sandy Ryza] Add newline at end of SparkSubmit
    04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script
    sryza authored and pwendell committed Mar 29, 2014
    Configuration menu
    Copy the full SHA
    1617816 View commit details
    Browse the repository at this point in the history
  3. Implement the RLike & Like in catalyst

    This PR includes:
    1) Unify the unit test for expression evaluation
    2) Add implementation of RLike & Like
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes apache#224 from chenghao-intel/string_expression and squashes the following commits:
    
    84f72e9 [Cheng Hao] fix bug in RLike/Like & Simplify the unit test
    aeeb1d7 [Cheng Hao] Simplify the implementation/unit test of RLike/Like
    319edb7 [Cheng Hao] change to spark code style
    91cfd33 [Cheng Hao] add implementation for rlike/like
    2c8929e [Cheng Hao] Update the unit test for expression evaluation
    chenghao-intel authored and rxin committed Mar 29, 2014
    Configuration menu
    Copy the full SHA
    af3746c View commit details
    Browse the repository at this point in the history

Commits on Mar 30, 2014

  1. [SPARK-1186] : Enrich the Spark Shell to support additional arguments.

    Enrich the Spark Shell functionality to support the following options.
    
    ```
    Usage: spark-shell [OPTIONS]
    
    OPTIONS:
        -h  --help              : Print this help information.
        -c  --cores             : The maximum number of cores to be used by the Spark Shell.
        -em --executor-memory   : The memory used by each executor of the Spark Shell, the number
                                  is followed by m for megabytes or g for gigabytes, e.g. "1g".
        -dm --driver-memory     : The memory used by the Spark Shell, the number is followed
                                  by m for megabytes or g for gigabytes, e.g. "1g".
        -m  --master            : A full string that describes the Spark Master, defaults to "local"
                                  e.g. "spark://localhost:7077".
        --log-conf              : Enables logging of the supplied SparkConf as INFO at start of the
                                  Spark Context.
    
    e.g.
        spark-shell -m spark://localhost:7077 -c 4 -dm 512m -em 2g
    ```
    
    **Note**: this commit reflects the changes applied to _master_ based on [5d98cfc].
    
    [ticket: SPARK-1186] : Enrich the Spark Shell to support additional arguments.
                            https://spark-project.atlassian.net/browse/SPARK-1186
    
    Author      : bernardo.gomezpalcio@gmail.com
    
    Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
    
    Closes apache#116 from berngp/feature/enrich-spark-shell and squashes the following commits:
    
    c5f455f [Bernardo Gomez Palacio] [SPARK-1186] : Enrich the Spark Shell to support additional arguments.
    berngp authored and aarondav committed Mar 30, 2014
    Configuration menu
    Copy the full SHA
    fda86d8 View commit details
    Browse the repository at this point in the history
  2. Don't swallow all kryo errors, only those that indicate we are out of…

    … data.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#142 from marmbrus/kryoErrors and squashes the following commits:
    
    9c72d1f [Michael Armbrust] Make the test more future proof.
    78f5a42 [Michael Armbrust] Don't swallow all kryo errors, only those that indicate we are out of data.
    marmbrus authored and rxin committed Mar 30, 2014
    Configuration menu
    Copy the full SHA
    92b8395 View commit details
    Browse the repository at this point in the history
  3. [SQL] SPARK-1354 Fix self-joins of parquet relations

    @AndreSchumacher, please take a look.
    
    https://spark-project.atlassian.net/browse/SPARK-1354
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#269 from marmbrus/parquetJoin and squashes the following commits:
    
    4081e77 [Michael Armbrust] Create new instances of Parquet relation when multiple copies are in a single plan.
    marmbrus authored and rxin committed Mar 30, 2014
    Configuration menu
    Copy the full SHA
    2861b07 View commit details
    Browse the repository at this point in the history
  4. SPARK-1336 Reducing the output of run-tests script.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    Author: Prashant Sharma <scrapcodes@gmail.com>
    
    Closes apache#262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits:
    
    87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast.
    811170f [Prashant Sharma] Reducing the ouput of run-tests script.
    ScrapCodes authored and pwendell committed Mar 30, 2014
    Configuration menu
    Copy the full SHA
    df1b9f7 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1354][SQL] Add tableName as a qualifier for SimpleCatelogy

    Fix attribute unresolved when query with table name as a qualifier in SQLContext with SimplCatelog, details please see [SPARK-1354](https://issues.apache.org/jira/browse/SPARK-1354?jql=project%20%3D%20SPARK).
    
    Author: jerryshao <saisai.shao@intel.com>
    
    Closes apache#272 from jerryshao/qualifier-fix and squashes the following commits:
    
    7950170 [jerryshao] Add tableName as a qualifier for SimpleCatelogy
    jerryshao authored and pwendell committed Mar 30, 2014
    Configuration menu
    Copy the full SHA
    95d7d2a View commit details
    Browse the repository at this point in the history
  6. SPARK-1352 - Comment style single space before ending */ check.

    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#261 from ScrapCodes/comment-style-check2 and squashes the following commits:
    
    6cde61e [Prashant Sharma] comment style space before ending */ check.
    ScrapCodes authored and pwendell committed Mar 30, 2014
    Configuration menu
    Copy the full SHA
    d666053 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2014

  1. SPARK-1352: Improve robustness of spark-submit script

    1. Better error messages when required arguments are missing.
    2. Support for unit testing cases where presented arguments are invalid.
    3. Bug fix: Only use environment varaibles when they are set (otherwise will cause NPE).
    4. A verbose mode to aid debugging.
    5. Visibility of several variables is set to private.
    6. Deprecation warning for existing scripts.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#271 from pwendell/spark-submit and squashes the following commits:
    
    9146def [Patrick Wendell] SPARK-1352: Improve robustness of spark-submit script
    pwendell committed Mar 31, 2014
    Configuration menu
    Copy the full SHA
    841721e View commit details
    Browse the repository at this point in the history
  2. [SQL] Rewrite join implementation to allow streaming of one relation.

    Before we were materializing everything in memory.  This also uses the projection interface so will be easier to plug in code gen (its ported from that branch).
    
    @rxin @liancheng
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#250 from marmbrus/hashJoin and squashes the following commits:
    
    1ad873e [Michael Armbrust] Change hasNext logic back to the correct version.
    8e6f2a2 [Michael Armbrust] Review comments.
    1e9fb63 [Michael Armbrust] style
    bc0cb84 [Michael Armbrust] Rewrite join implementation to allow streaming of one relation.
    marmbrus authored and rxin committed Mar 31, 2014
    Configuration menu
    Copy the full SHA
    5731af5 View commit details
    Browse the repository at this point in the history
  3. SPARK-1365 [HOTFIX] Fix RateLimitedOutputStream test

    This test needs to be fixed. It currently depends on Thread.sleep() having exact-timing
    semantics, which is not a valid assumption.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#277 from pwendell/rate-limited-stream and squashes the following commits:
    
    6c0ff81 [Patrick Wendell] SPARK-1365: Fix RateLimitedOutputStream test
    pwendell committed Mar 31, 2014
    Configuration menu
    Copy the full SHA
    33b3c2a View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    93f1c69 View commit details
    Browse the repository at this point in the history

Commits on Apr 1, 2014

  1. SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg"

    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#279 from sryza/sandy-spark-1376 and squashes the following commits:
    
    d8aebfa [Sandy Ryza] SPARK-1376. In the yarn-cluster submitter, rename "args" option to "arg"
    sryza authored and Mridul Muralidharan committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    564f1c1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1377] Upgrade Jetty to 8.1.14v20131031

    Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#280 from andrewor14/jetty-upgrade and squashes the following commits:
    
    dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade
    e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031
    andrewor14 authored and pwendell committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    94fe7fd View commit details
    Browse the repository at this point in the history
  3. [Hot Fix apache#42] Persisted RDD disappears on storage page if re-used

    If a previously persisted RDD is re-used, its information disappears from the Storage page.
    
    This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing information regarding that RDD with a fresh one, whether or not the information for the RDD already exists.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#281 from andrewor14/ui-storage-fix and squashes the following commits:
    
    408585a [Andrew Or] Fix storage UI bug
    andrewor14 authored and pwendell committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    ada310a View commit details
    Browse the repository at this point in the history
  4. Added basic stats to the StreamingUI and refactored the UI to a Page …

    …to make it easier to transition to using SparkUI later.
    tdas committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    4d86e98 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    db27bad View commit details
    Browse the repository at this point in the history
  6. [SQL] SPARK-1372 Support for caching and uncaching tables in a SQLCon…

    …text.
    
    This doesn't yet support different databases in Hive (though you can probably workaround this by calling `USE <dbname>`).  However, given the time constraints for 1.0 I think its probably worth including this now and extending the functionality in the next release.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#282 from marmbrus/cacheTables and squashes the following commits:
    
    83785db [Michael Armbrust] Support for caching and uncaching tables in a SQLContext.
    marmbrus authored and rxin committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    f5c418d View commit details
    Browse the repository at this point in the history
  7. Added Apache licenses.

    tdas committed Apr 1, 2014
    Configuration menu
    Copy the full SHA
    aef4dd5 View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2014

  1. [SPARK-1342] Scala 2.10.4

    Just a Scala version increment
    
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes apache#259 from markhamstra/scala-2.10.4 and squashes the following commits:
    
    fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4
    markhamstra authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    764353d View commit details
    Browse the repository at this point in the history
  2. [Spark-1134] only call ipython if no arguments are given; remove IPYT…

    …HONOPTS from call
    
    see comments on Pull Request apache#38
    (i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)
    
    Author: Diana Carroll <dcarroll@cloudera.com>
    
    Closes apache#227 from dianacarroll/spark-1134 and squashes the following commits:
    
    ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
    b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
    0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
    Diana Carroll authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    afb5ea6 View commit details
    Browse the repository at this point in the history
  3. Revert "[Spark-1134] only call ipython if no arguments are given; rem…

    …ove IPYTHONOPTS from call"
    
    This reverts commit afb5ea6.
    mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    45df912 View commit details
    Browse the repository at this point in the history
  4. MLI-1 Decision Trees

    Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.
    
    Key features:
    + Supports binary classification and regression
    + Supports gini, entropy and variance for information gain calculation
    + Supports both continuous and categorical features
    
    The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include:
    
    1. Level-wise training to reduce passes over the entire dataset.
    2. Bin-wise split calculation to reduce computation overhead.
    3. Aggregation over partitions before combining to reduce communication overhead.
    
    Author: Manish Amde <manish9ue@gmail.com>
    Author: manishamde <manish9ue@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#79 from manishamde/tree and squashes the following commits:
    
    1e8c704 [Manish Amde] remove numBins field in the Strategy class
    7d54b4f [manishamde] Merge pull request apache#4 from mengxr/dtree
    f536ae9 [Xiangrui Meng] another pass on code style
    e1dd86f [Manish Amde] implementing code style suggestions
    62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing
    201702f [Manish Amde] making some more methods private
    f963ef5 [Manish Amde] making methods private
    c487e6a [manishamde] Merge pull request #1 from mengxr/dtree
    24500c5 [Xiangrui Meng] minor style updates
    4576b64 [Manish Amde] documentation and for to while loop conversion
    ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins
    632818f [Manish Amde] removing threshold for classification predict method
    2116360 [Manish Amde] removing dummy bin calculation for categorical variables
    6068356 [Manish Amde] ensuring num bins is always greater than max number of categories
    62c2562 [Manish Amde] fixing comment indentation
    ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions
    d1ef4f6 [Manish Amde] more documentation
    794ff4d [Manish Amde] minor improvements to docs and style
    eb8fcbe [Manish Amde] minor code style updates
    cd2c2b4 [Manish Amde] fixing code style based on feedback
    63e786b [Manish Amde] added multiple train methods for java compatability
    d3023b3 [Manish Amde] adding more docs for nested methods
    84f85d6 [Manish Amde] code documentation
    9372779 [Manish Amde] code style: max line lenght <= 100
    dd0c0d7 [Manish Amde] minor: some docs
    0dd7659 [manishamde] basic doc
    5841c28 [Manish Amde] unit tests for categorical features
    f067d68 [Manish Amde] minor cleanup
    c0e522b [Manish Amde] updated predict and split threshold logic
    b09dc98 [Manish Amde] minor refactoring
    6b7de78 [Manish Amde] minor refactoring and tests
    d504eb1 [Manish Amde] more tests for categorical features
    dbb7ac1 [Manish Amde] categorical feature support
    6df35b9 [Manish Amde] regression predict logic
    53108ed [Manish Amde] fixing index for highest bin
    e23c2e5 [Manish Amde] added regression support
    c8f6d60 [Manish Amde] adding enum for feature type
    b0e3e76 [Manish Amde] adding enum for feature type
    154aa77 [Manish Amde] enums for configurations
    733d6dd [Manish Amde] fixed tests
    02c595c [Manish Amde] added command line parsing
    98ec8d5 [Manish Amde] tree building and prediction logic
    b0eb866 [Manish Amde] added logic to handle leaf nodes
    80e8c66 [Manish Amde] working version of multi-level split calculation
    4798aae [Manish Amde] added gain stats class
    dad0afc [Manish Amde] decison stump functionality working
    03f534c [Manish Amde] some more tests
    0012a77 [Manish Amde] basic stump working
    8bca1e2 [Manish Amde] additional code for creating intermediate RDD
    92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested.
    cd53eae [Manish Amde] skeletal framework
    manishamde authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    8b3045c View commit details
    Browse the repository at this point in the history
  5. Remove * from test case golden filename.

    @rxin mentioned this might cause issues on windows machines.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#297 from marmbrus/noStars and squashes the following commits:
    
    263122a [Michael Armbrust] Remove * from test case golden filename.
    marmbrus authored and rxin committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    ea9de65 View commit details
    Browse the repository at this point in the history
  6. Renamed stageIdToActiveJob to jobIdToActiveJob.

    This data structure was misused and, as a result, later renamed to an incorrect name.
    
    This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding.
    
    This patch renames it and removes an incorrect indexing into it.  The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here apache@5757993) was not always executed; this commit fixes that.
    
    Author: Kay Ousterhout <kayousterhout@gmail.com>
    
    Closes apache#301 from kayousterhout/fixCancellation and squashes the following commits:
    
    bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.
    kayousterhout authored and pwendell committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    11973a7 View commit details
    Browse the repository at this point in the history
  7. [SPARK-1385] Use existing code for JSON de/serialization of BlockId

    `BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type.
    With this PR, the de/serialization of BlockIds will go through the first (older) code path.
    
    (Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`)
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#289 from andrewor14/blockid-json and squashes the following commits:
    
    409d226 [Andrew Or] Simplify JSON de/serialization for BlockId
    andrewor14 authored and aarondav committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    de8eefa View commit details
    Browse the repository at this point in the history
  8. Do not re-use objects in the EdgePartition/EdgeTriplet iterators.

    This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.
    
    I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?
    
    This is an example benchmark I did:
    
    test("benchmark") {
      val builder = new EdgePartitionBuilder[Int]
      for (i <- (1 to 10000000)) {
        builder.add(i.toLong, i.toLong, i)
      }
      val p = builder.toEdgePartition
      p.map(_.attr + 1).iterator.toList
    }
    
    It ran for 10 seconds both before and after this change.
    
    Author: Daniel Darabos <darabos.daniel@gmail.com>
    
    Closes apache#276 from darabos/spark-1188 and squashes the following commits:
    
    574302b [Daniel Darabos] Restore "manual" copying in EdgePartition.map(Iterator). Add comment to discourage novices like myself from trying to simplify the code.
    4117a64 [Daniel Darabos] Revert EdgePartitionSuite.
    4955697 [Daniel Darabos] Create a copy of the Edge objects in EdgeRDD.compute(). This avoids exposing the object re-use, while still enables the more efficient behavior for internal code.
    4ec77f8 [Daniel Darabos] Add comments about object re-use to the affected functions.
    2da5e87 [Daniel Darabos] Restore object re-use in EdgePartition.
    0182f2b [Daniel Darabos] Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.
    c55f52f [Daniel Darabos] Tests that reproduce the problems from SPARK-1188.
    darabos authored and rxin committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    7823633 View commit details
    Browse the repository at this point in the history
  9. [SPARK-1371][WIP] Compression support for Spark SQL in-memory columna…

    …r storage
    
    JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)
    
    (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)
    
    This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:
    
    *   `CompressionScheme`
    
        Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:
    
        * `RunLengthEncoding`
        * `DictionaryEncoding`
    
        Algorithms to be implemented include:
    
        * `BooleanBitSet`
        * `IntDelta`
        * `LongDelta`
    
    *   `CompressibleColumnBuilder`
    
        A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.
    
        Memory layout of the final byte buffer is showed below:
    
        ```
         .--------------------------- Column type ID (4 bytes)
         |   .----------------------- Null count N (4 bytes)
         |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
         |   |   |     .------------- Compression scheme ID (4 bytes)
         |   |   |     |   .--------- Compressed non-null elements
         V   V   V     V   V
        +---+---+-----+---+---------+
        |   |   | ... |   | ... ... |
        +---+---+-----+---+---------+
         \-----------/ \-----------/
            header         body
        ```
    
    *   `CompressibleColumnAccessor`
    
        A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.
    
    *   `ColumnStats`
    
        Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.
    
        Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).
    
    A major refactoring change since PR apache#205 is:
    
    * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes apache#285 from liancheng/memColumnarCompression and squashes the following commits:
    
    ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
    d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
    5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
    c298b76 [Cheng Lian] Test suites refactored
    2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
    211331c [Cheng Lian] WIP: in-memory columnar compression support
    85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
    liancheng authored and pwendell committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    1faa579 View commit details
    Browse the repository at this point in the history
  10. StopAfter / TopK related changes

    1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
    2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
    3. Avoid breaking lineage in Limit.
    4. Added a bunch of override's to execution/basicOperators.scala.
    
    @marmbrus @liancheng
    
    Author: Reynold Xin <rxin@apache.org>
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#233 from rxin/limit and squashes the following commits:
    
    13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
    92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
    4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
    87b7d37 [Reynold Xin] Use the proper serializer in limit.
    9b79246 [Reynold Xin] Updated doc for Limit.
    47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
    231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.
    rxin authored and pwendell committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    ed730c9 View commit details
    Browse the repository at this point in the history
  11. [SPARK-1212, Part II] Support sparse data in MLlib

    In PR apache#117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:
    
    1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
    2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
    3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
    4. Add libSVMFile to MLContext.
    5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
    6. Gradient computation no longer creates temp vectors.
    7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.
    
    TODO:
    1. ~~Use axpy when possible.~~
    2. ~~Optimize Naive Bayes.~~
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#245 from mengxr/vector and squashes the following commits:
    
    eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
    c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
    11999c7 [Xiangrui Meng] Merge branch 'master' into vector
    f7da54b [Xiangrui Meng] add minSplits to libSVMFile
    da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
    493f26f [Xiangrui Meng] Merge branch 'master' into vector
    7c1bc01 [Xiangrui Meng] add a TODO to NB
    b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
    b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
    4addc50 [Xiangrui Meng] merge master
    4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
    f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
    d088552 [Xiangrui Meng] use static constructor for MLContext
    6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
    3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
    0f8759b [Xiangrui Meng] minor updates to NB
    b11659c [Xiangrui Meng] style update
    78c4671 [Xiangrui Meng] add libSVMFile to MLContext
    f0fe616 [Xiangrui Meng] add a test for sparse linear regression
    44733e1 [Xiangrui Meng] use in-place gradient computation
    e981396 [Xiangrui Meng] use axpy in Updater
    db808a1 [Xiangrui Meng] update JavaLR example
    befa592 [Xiangrui Meng] passed scala/java tests
    75c83a4 [Xiangrui Meng] passed test compile
    1859701 [Xiangrui Meng] passed compile
    834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
    135ab72 [Xiangrui Meng] merge glm
    0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
    d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
    3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
    mengxr authored and mateiz committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    9c65fa7 View commit details
    Browse the repository at this point in the history
  12. Refactoring the UI interface to add flexibility

    This commit introduces three (abstract) classes: WebUI, UITab, and UIPage.
    The top of the hierarchy is the WebUI, which contains many tabs and pages.
    Each tab in turn contains many pages.
    
    When a UITab is attached to a WebUI, the WebUI creates a handler for each
    of the tab's pages. Similarly, when a UIPage is attached to a WebUI, its
    handler is created. The server in WebUI is then ready to be bound to a host
    and a port.
    
    This commit also breaks down a couple of unnecessarily large files by
    moving certain classes to their own files.
    andrewor14 committed Apr 2, 2014
    Configuration menu
    Copy the full SHA
    7d57444 View commit details
    Browse the repository at this point in the history

Commits on Apr 3, 2014

  1. [SQL] SPARK-1364 Improve datatype and test coverage for ScalaReflecti…

    …on schema inference.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#293 from marmbrus/reflectTypes and squashes the following commits:
    
    f54e8e8 [Michael Armbrust] Improve datatype and test coverage for ScalaReflection schema inference.
    marmbrus authored and pwendell committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    47ebea5 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    cd000b0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a37ad4f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ed25dfc View commit details
    Browse the repository at this point in the history
  5. [SPARK-1398] Removed findbugs jsr305 dependency

    Should be a painless upgrade, and does offer some significant advantages should we want to leverage FindBugs more during the 1.0 lifecycle. http://findbugs.sourceforge.net/findbugs2.html
    
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes apache#307 from markhamstra/findbugs and squashes the following commits:
    
    99f2d09 [Mark Hamstra] Removed unnecessary findbugs jsr305 dependency
    markhamstra authored and pwendell committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    92a86b2 View commit details
    Browse the repository at this point in the history
  6. Spark parquet improvements

    A few improvements to the Parquet support for SQL queries:
    - Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
      sources
    - InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
      HiveQL)
    - tests now use the new API
    - Parquet logging can be set to WARNING level (Default)
    - Default compression for Parquet files (GZIP, as in parquet-mr)
    
    Author: Andre Schumacher <andre.schumacher@iki.fi>
    
    Closes apache#195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:
    
    54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation
    AndreSchumacher authored and rxin committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    fbebaed View commit details
    Browse the repository at this point in the history
  7. [SPARK-1360] Add Timestamp Support for SQL

    This PR includes:
    1) Add new data type Timestamp
    2) Add more data type casting base on Hive's Rule
    3) Fix bug missing data type in both parsers (HiveQl & SQLParser).
    
    Author: Cheng Hao <hao.cheng@intel.com>
    
    Closes apache#275 from chenghao-intel/timestamp and squashes the following commits:
    
    df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
    24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
    fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
    d0d1919 [Cheng Hao] Add more data type for scala reflection
    3259808 [Cheng Hao] Add the new Golden files
    3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
    54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
    9cb505c [Cheng Hao] Fix issues according to PR comments
    e529168 [Cheng Hao] Fix bug of converting from String
    6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
    8a1d4d6 [Cheng Hao] Add DataType for SqlParser
    ce4385e [Cheng Hao] Add TimestampType Support
    chenghao-intel authored and rxin committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    5d1feda View commit details
    Browse the repository at this point in the history
  8. Minor style updates.

    tdas committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    53be2c5 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    61358e3 View commit details
    Browse the repository at this point in the history
  10. Spark 1162 Implemented takeOrdered in pyspark.

    Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.
    
    We have our own implementation of max heap.
    
    Author: Prashant Sharma <prashant.s@imaginea.com>
    
    Closes apache#97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:
    
    35f86ba [Prashant Sharma] code review
    2b1124d [Prashant Sharma] fixed tests
    e8a08e2 [Prashant Sharma] Code review comments.
    49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
    ScrapCodes authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    c1ea3af View commit details
    Browse the repository at this point in the history
  11. [SQL] SPARK-1333 First draft of java API

    WIP: Some work remains...
     * [x] Hive support
     * [x] Tests
     * [x] Update docs
    
    Feedback welcome!
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#248 from marmbrus/javaSchemaRDD and squashes the following commits:
    
    b393913 [Michael Armbrust] @srowen 's java style suggestions.
    f531eb1 [Michael Armbrust] Address matei's comments.
    33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
    822f626 [Michael Armbrust] improve docs.
    ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
    0b859c8 [Michael Armbrust] First draft of java API.
    marmbrus authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    b8f5341 View commit details
    Browse the repository at this point in the history
  12. [SPARK-1134] Fix and document passing of arguments to IPython

    This is based on @dianacarroll's previous pull request apache#227, and @JoshRosen's comments on apache#38. Since we do want to allow passing arguments to IPython, this does the following:
    * It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see ipython/ipython#5226, but no released version has that fix.)
    * If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
    * The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.
    
    This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes ipython/ipython#5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.
    
    @JoshRosen you should probably take the final call on this.
    
    Author: Diana Carroll <dcarroll@cloudera.com>
    
    Closes apache#294 from mateiz/spark-1134 and squashes the following commits:
    
    747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
    Diana Carroll authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    a599e43 View commit details
    Browse the repository at this point in the history
  13. Allow adding tabs to SparkUI dynamically + add example

    An example of how this is done is in org.apache.spark.ui.FooTab. Run
    it through bin/spark-class to see what it looks like (which should
    more or less match your expectations...).
    andrewor14 committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    9a48fa1 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    0d61ee8 View commit details
    Browse the repository at this point in the history
  15. [BUILD FIX] Fix compilation of Spark SQL Java API.

    The JavaAPI and the Parquet improvements PRs didn't conflict, but broke the build.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#316 from marmbrus/hotFixJavaApi and squashes the following commits:
    
    0b84c2d [Michael Armbrust] Fix compilation of Spark SQL Java API.
    marmbrus authored and mateiz committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    d94826b View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    8f7323b View commit details
    Browse the repository at this point in the history
  17. Fix jenkins from giving the green light to builds that don't compile.

     Adding `| grep` swallows the non-zero return code from sbt failures. See [here](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13735/consoleFull) for a Jenkins run that fails to compile, but still gets a green light.
    
    Note the [BUILD FIX] commit isn't actually part of this PR, but github is out of date.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#317 from marmbrus/fixJenkins and squashes the following commits:
    
    7c77ff9 [Michael Armbrust] Remove output filter that was swallowing non-zero exit codes for test failures.
    marmbrus authored and rxin committed Apr 3, 2014
    Configuration menu
    Copy the full SHA
    9231b01 View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2014

  1. Revert "[SPARK-1398] Removed findbugs jsr305 dependency"

    This reverts commit 92a86b2.
    pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    33e6361 View commit details
    Browse the repository at this point in the history
  2. SPARK-1337: Application web UI garbage collects newest stages

    Simple fix...
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#320 from pwendell/stage-clean-up and squashes the following commits:
    
    29be62e [Patrick Wendell] SPARK-1337: Application web UI garbage collects newest stages instead old ones
    pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    ee6e9e7 View commit details
    Browse the repository at this point in the history
  3. SPARK-1350. Always use JAVA_HOME to run executor container JVMs.

    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#313 from sryza/sandy-spark-1350 and squashes the following commits:
    
    bb6d187 [Sandy Ryza] SPARK-1350. Always use JAVA_HOME to run executor container JVMs.
    sryza authored and tgravescs committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    7f32fd4 View commit details
    Browse the repository at this point in the history
  4. SPARK-1404: Always upgrade spark-env.sh vars to environment vars

    This was broken when spark-env.sh was made idempotent, as the idempotence check is an environment variable, but the spark-env.sh variables may not have been.
    
    Tested in zsh, bash, and sh.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#310 from aarondav/SPARK-1404 and squashes the following commits:
    
    c3406a5 [Aaron Davidson] Add extra export in spark-shell
    6a0e340 [Aaron Davidson] SPARK-1404: Always upgrade spark-env.sh vars to environment vars
    aarondav authored and pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    01cf4c4 View commit details
    Browse the repository at this point in the history
  5. [SPARK-1133] Add whole text files reader in MLlib

    Here is a pointer to the former [PR164](apache#164).
    
    I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib.
    
    Author: Xusen Yin <yinxusen@gmail.com>
    
    Closes apache#252 from yinxusen/whole-files-input and squashes the following commits:
    
    7191be6 [Xusen Yin] refine comments
    0af3faf [Xusen Yin] add JavaAPI test
    01745ee [Xusen Yin] fix deletion error
    cc97dca [Xusen Yin] move whole text file API to Spark core
    d792cee [Xusen Yin] remove the typo character "+"
    6bdf2c2 [Xusen Yin] test for small local file system block size
    a1f1e7e [Xusen Yin] add two extra spaces
    28cb0fe [Xusen Yin] add whole text files reader
    yinxusen authored and mateiz committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    f1fa617 View commit details
    Browse the repository at this point in the history
  6. SPARK-1375. Additional spark-submit cleanup

    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#278 from sryza/sandy-spark-1375 and squashes the following commits:
    
    5fbf1e9 [Sandy Ryza] SPARK-1375. Additional spark-submit cleanup
    sryza authored and pwendell committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    16b8308 View commit details
    Browse the repository at this point in the history
  7. Don't create SparkContext in JobProgressListenerSuite.

    This reduces the time of the test from 11 seconds to 20 milliseconds.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#324 from pwendell/job-test and squashes the following commits:
    
    868d9eb [Patrick Wendell] Don't create SparkContext in JobProgressListenerSuite.
    pwendell authored and rxin committed Apr 4, 2014
    Configuration menu
    Copy the full SHA
    a02b535 View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2014

  1. [SPARK-1198] Allow pipes tasks to run in different sub-directories

    This works as is on Linux/Mac/etc but doesn't cover working on Windows.  In here I use ln -sf for symlinks. Putting this up for comments on that. Do we want to create perhaps some classes for doing shell commands - Linux vs Windows.  Is there some other way we want to do this?   I assume we are still supporting jdk1.6?
    
    Also should I update the Java API for pipes to allow this parameter?
    
    Author: Thomas Graves <tgraves@apache.org>
    
    Closes apache#128 from tgravescs/SPARK1198 and squashes the following commits:
    
    abc1289 [Thomas Graves] remove extra tag in pom file
    ba23fc0 [Thomas Graves] Add support for symlink on windows, remove commons-io usage
    da4b221 [Thomas Graves] Merge branch 'master' of https://github.com/tgravescs/spark into SPARK1198
    61be271 [Thomas Graves] Fix file name filter
    6b783bd [Thomas Graves] style fixes
    1ab49ca [Thomas Graves] Add support for running pipe tasks is separate directories
    tgravescs authored and mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    198892f View commit details
    Browse the repository at this point in the history
  2. [SQL] Minor fixes.

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#315 from marmbrus/minorFixes and squashes the following commits:
    
    b23a15d [Michael Armbrust] fix scaladoc
    11062ac [Michael Armbrust] Fix registering "SELECT *" queries as tables and caching them.  As some tests for this and self-joins.
    3997dc9 [Michael Armbrust] Move Row extractor to catalyst.
    208bf5e [Michael Armbrust] More idiomatic naming of DSL functions. * subquery => as * for join condition => on, i.e., `r.join(s, condition = 'a == 'b)` =>`r.join(s, on = 'a == 'b)`
    87211ce [Michael Armbrust] Correctly handle self joins of in-memory cached tables.
    69e195e [Michael Armbrust] Change != to !== in the DSL since != will always translate to != on Any.
    01f2dd5 [Michael Armbrust] Correctly assign aliases to tables in SqlParser.
    marmbrus authored and rxin committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    d956cc2 View commit details
    Browse the repository at this point in the history
  3. SPARK-1414. Python API for SparkContext.wholeTextFiles

    Also clarified comment on each file having to fit in memory
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes apache#327 from mateiz/py-whole-files and squashes the following commits:
    
    9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
    mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    60e18ce View commit details
    Browse the repository at this point in the history
  4. Add test utility for generating Jar files with compiled classes.

    This was requested by a few different people and may be generally
    useful, so I'd like to contribute this and not block on a different
    PR for it to get in.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#326 from pwendell/class-loader-test-utils and squashes the following commits:
    
    ff3e88e [Patrick Wendell] Add test utility for generating Jar files with compiled classes.
    pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    5f3c1bb View commit details
    Browse the repository at this point in the history
  5. [SPARK-1419] Bumped parent POM to apache 14

    Keeping up-to-date with the parent, which includes some bugfixes.
    
    Author: Mark Hamstra <markhamstra@gmail.com>
    
    Closes apache#328 from markhamstra/Apache14 and squashes the following commits:
    
    3f19975 [Mark Hamstra] Bumped parent POM to apache 14
    markhamstra authored and pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    1347ebd View commit details
    Browse the repository at this point in the history
  6. SPARK-1305: Support persisting RDD's directly to Tachyon

    Move the PR#468 of apache-incubator-spark to the apache-spark
    "Adding an option to persist Spark RDD blocks into Tachyon."
    
    Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
    Author: RongGu <gurongwalker@gmail.com>
    
    Closes apache#158 from RongGu/master and squashes the following commits:
    
    72b7768 [Haoyuan Li] merge master
    9f7fa1b [Haoyuan Li] fix code style
    ae7834b [Haoyuan Li] minor cleanup
    a8b3ec6 [Haoyuan Li] merge master branch
    e0f4891 [Haoyuan Li] better check offheap.
    55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
    7cd4600 [RongGu] remove some logic code for tachyonstore's replication
    51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
    8adfcfa [RongGu] address arron's comment on inTachyonSize
    120e48a [RongGu] changed the root-level dir name in Tachyon
    5cc041c [Haoyuan Li] address aaron's comments
    9b97935 [Haoyuan Li] address aaron's comments
    d9a6438 [Haoyuan Li] fix for pspark
    77d2703 [Haoyuan Li] change python api.git status
    3dcace4 [Haoyuan Li] address matei's comments
    91fa09d [Haoyuan Li] address patrick's comments
    589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
    64348b2 [Haoyuan Li] update conf docs.
    ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
    619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
    be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
    49cc724 [Haoyuan Li] update docs with off_headp option
    4572f9f [RongGu] reserving the old apply function API of StorageLevel
    04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
    c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
    76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
    e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
    fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
    939e467 [Haoyuan Li] 0.4.1-thrift from maven central
    86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
    16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
    eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
    bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
    6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
    d827250 [RongGu] fix JsonProtocolSuie test failure
    716e93b [Haoyuan Li] revert the version
    ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
    2825a13 [RongGu] up-merging to the current master branch of the apache spark
    6a22c1a [Haoyuan Li] fix scalastyle
    8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
    77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
    1dcadf9 [Haoyuan Li] typo
    bf278fa [Haoyuan Li] fix python tests
    e82909c [Haoyuan Li] minor cleanup
    776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
    8859371 [Haoyuan Li] various minor fixes and clean up
    e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
    fcaeab2 [Haoyuan Li] address Aaron's comment
    e554b1e [Haoyuan Li] add python code
    47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
    dc8ef24 [Haoyuan Li] add old storelevel constructor
    e01a271 [Haoyuan Li] update tachyon 0.4.1
    8011a96 [RongGu] fix a brought-in mistake in StorageLevel
    70ca182 [RongGu] a bit change in comment
    556978b [RongGu] fix the scalastyle errors
    791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
    haoyuan authored and pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    b50ddfd View commit details
    Browse the repository at this point in the history
  7. [SQL] SPARK-1366 Consistent sql function across different types of SQ…

    …LContexts
    
    Now users who want to use HiveQL should explicitly say `hiveql` or `hql`.
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#319 from marmbrus/standardizeSqlHql and squashes the following commits:
    
    de68d0e [Michael Armbrust] Fix sampling test.
    fbe4a54 [Michael Armbrust] Make `sql` always use spark sql parser, users of hive context can now use hql or hiveql to run queries using HiveQL instead.
    marmbrus authored and rxin committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    8de038e View commit details
    Browse the repository at this point in the history
  8. small fix ( proogram -> program )

    Author: Prabeesh K <prabsmails@gmail.com>
    
    Closes apache#331 from prabeesh/patch-3 and squashes the following commits:
    
    9399eb5 [Prabeesh K] small fix(proogram -> program)
    prabeesh authored and rxin committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    0acc7a0 View commit details
    Browse the repository at this point in the history
  9. HOTFIX for broken CI, by SPARK-1336

    Learnt about `set -o pipefail` is very useful.
    
    Author: Prashant Sharma <prashant.s@imaginea.com>
    Author: Prashant Sharma <scrapcodes@gmail.com>
    
    Closes apache#321 from ScrapCodes/hf-SPARK-1336 and squashes the following commits:
    
    9d22bc2 [Prashant Sharma] added comment why echo -e q exists.
    f865951 [Prashant Sharma] made error to match with word boundry so errors does not match. This is there to make sure build fails if provided SparkBuild has compile errors.
    7fffdf2 [Prashant Sharma] Removed a stray line.
    97379d8 [Prashant Sharma] HOTFIX for broken CI, by SPARK-1336
    ScrapCodes authored and pwendell committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    7c18428 View commit details
    Browse the repository at this point in the history
  10. Remove the getStageInfo() method from SparkContext.

    This method exposes the Stage objects, which are
    private to Spark and should not be exposed to the
    user.
    
    This method was added in apache@01d77f3; ccing @squito here in case there's a good reason to keep this!
    
    Author: Kay Ousterhout <kayousterhout@gmail.com>
    
    Closes apache#308 from kayousterhout/remove_public_method and squashes the following commits:
    
    2e2f009 [Kay Ousterhout] Remove the getStageInfo() method from SparkContext.
    kayousterhout authored and mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    2d0150c View commit details
    Browse the repository at this point in the history
  11. [SPARK-1371] fix computePreferredLocations signature to not depend on…

    … underlying implementation
    
    Change to Map and Set - not mutable HashMap and HashSet
    
    Author: Mridul Muralidharan <mridulm80@apache.org>
    
    Closes apache#302 from mridulm/master and squashes the following commits:
    
    df747af [Mridul Muralidharan] Address review comments
    17e2907 [Mridul Muralidharan] fix computePreferredLocations signature to not depend on underlying implementation
    Mridul Muralidharan authored and mateiz committed Apr 5, 2014
    Configuration menu
    Copy the full SHA
    6e88583 View commit details
    Browse the repository at this point in the history

Commits on Apr 6, 2014

  1. Fix for PR apache#195 for Java 6

    Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility. See PR apache#195
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#334 from srowen/FixPR195ForJava6 and squashes the following commits:
    
    f92fbd3 [Sean Owen] Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility
    srowen authored and pwendell committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    890d63b View commit details
    Browse the repository at this point in the history
  2. SPARK-1421. Make MLlib work on Python 2.6

    The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.
    
    Author: Matei Zaharia <matei@databricks.com>
    
    Closes apache#335 from mateiz/mllib-python-2.6 and squashes the following commits:
    
    f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
    a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
    mateiz committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    0b85516 View commit details
    Browse the repository at this point in the history
  3. Fix SPARK-1420 The maven build error for Spark Catalyst

    Author: witgo <witgo@qq.com>
    
    Closes apache#333 from witgo/SPARK-1420 and squashes the following commits:
    
    902519e [witgo] add dependency scala-reflect to catalyst
    witgo authored and pwendell committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    7012ffa View commit details
    Browse the repository at this point in the history
  4. [SPARK-1259] Make RDD locally iterable

    Author: Egor Pakhomov <pahomov.egor@gmail.com>
    
    Closes apache#156 from epahomov/SPARK-1259 and squashes the following commits:
    
    8ec8f24 [Egor Pakhomov] Make to local iterator shorter
    34aa300 [Egor Pakhomov] Fix toLocalIterator docs
    08363ef [Egor Pakhomov] SPARK-1259 from toLocallyIterable to toLocalIterator
    6a994eb [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
    8be3dcf [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
    33ecb17 [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
    epahomov authored and pwendell committed Apr 6, 2014
    Configuration menu
    Copy the full SHA
    e258e50 View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2014

  1. SPARK-1387. Update build plugins, avoid plugin version warning, centr…

    …alize versions
    
    Another handful of small build changes to organize and standardize a bit, and avoid warnings:
    
    - Update Maven plugin versions for good measure
    - Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
    - Use variables to define versions across dependencies where they should move in lock step
    - ... and make this consistent between Maven/SBT
    
    OK, I also updated the JIRA URL while I was at it here.
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes apache#291 from srowen/SPARK-1387 and squashes the following commits:
    
    461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
    c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
    srowen authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    856c50f View commit details
    Browse the repository at this point in the history
  2. SPARK-1349: spark-shell gets its own command history

    Currently, spark-shell shares its command history with scala repl.
    
    This fix is simply a modification of the default FileBackedHistory file setting:
    https://github.com/scala/scala/blob/master/src/repl/scala/tools/nsc/interpreter/session/FileBackedHistory.scala#L77
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#267 from aarondav/repl and squashes the following commits:
    
    f9c62d2 [Aaron Davidson] SPARK-1349: spark-shell gets its own command history separate from scala repl
    aarondav authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    7ce52c4 View commit details
    Browse the repository at this point in the history
  3. SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging

    Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.
    
    This patch has the following features/bug fixes:
    
    - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
    - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
    - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
    - avoid adding log message in compute-classpath.sh to the classpath :)
    
    Still TODO before mergeable:
    - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
    - Spark SQL documentation updates.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#237 from aarondav/master and squashes the following commits:
    
    5dc4329 [Aaron Davidson] Typo fixes
    dd4f298 [Aaron Davidson] Doc update
    dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
    a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
    aarondav authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    4106558 View commit details
    Browse the repository at this point in the history
  4. SPARK-1154: Clean up app folders in worker nodes

    This is a fix for [SPARK-1154](https://issues.apache.org/jira/browse/SPARK-1154).   The issue is that worker nodes fill up with a huge number of app-* folders after some time.  This change adds a periodic cleanup task which asynchronously deletes app directories older than a configurable TTL.
    
    Two new configuration parameters have been introduced:
      spark.worker.cleanup_interval
      spark.worker.app_data_ttl
    
    This change does not include moving the downloads of application jars to a location outside of the work directory.  We will address that if we have time, but that potentially involves caching so it will come either as part of this PR or a separate PR.
    
    Author: Evan Chan <ev@ooyala.com>
    Author: Kelvin Chu <kelvinkwchu@yahoo.com>
    
    Closes apache#288 from velvia/SPARK-1154-cleanup-app-folders and squashes the following commits:
    
    0689995 [Evan Chan] CR from @aarondav - move config, clarify for standalone mode
    9f10d96 [Evan Chan] CR from @pwendell - rename configs and add cleanup.enabled
    f2f6027 [Evan Chan] CR from @andrewor14
    553d8c2 [Kelvin Chu] change the variable name to currentTimeMillis since it actually tracks in seconds
    8dc9cb5 [Kelvin Chu] Fixed a bug in Utils.findOldFiles() after merge.
    cb52f2b [Kelvin Chu] Change the name of findOldestFiles() to findOldFiles()
    72f7d2d [Kelvin Chu] Fix a bug of Utils.findOldestFiles(). file.lastModified is returned in milliseconds.
    ad99955 [Kelvin Chu] Add unit test for Utils.findOldestFiles()
    dc1a311 [Evan Chan] Don't recompute current time with every new file
    e3c408e [Evan Chan] Document the two new settings
    b92752b [Evan Chan] SPARK-1154: Add a periodic task to clean up app directories
    Evan Chan authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    1440154 View commit details
    Browse the repository at this point in the history
  5. SPARK-1431: Allow merging conflicting pull requests

    Sometimes if there is a small conflict it's nice to be able to just
    manually fix it up rather than have another RTT with the contributor.
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#342 from pwendell/merge-conflicts and squashes the following commits:
    
    cdce61a [Patrick Wendell] SPARK-1431: Allow merging conflicting pull requests
    pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    87d0928 View commit details
    Browse the repository at this point in the history
  6. [SQL] SPARK-1371 Hash Aggregation Improvements

    Given:
    ```scala
    case class Data(a: Int, b: Int)
    val rdd =
      sparkContext
        .parallelize(1 to 200)
        .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
    rdd.registerAsTable("data")
    cacheTable("data")
    ```
    Before:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    16795.567ms
    SELECT a, SUM(b) FROM data GROUP BY a
    7536.436ms
    SELECT SUM(b) FROM data
    10954.1ms
    ```
    
    After:
    ```
    SELECT COUNT(*) FROM data:[10000000]
    1372.175ms
    SELECT a, SUM(b) FROM data GROUP BY a
    2070.446ms
    SELECT SUM(b) FROM data
    958.969ms
    ```
    
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#295 from marmbrus/hashAgg and squashes the following commits:
    
    ec63575 [Michael Armbrust] Add comment.
    d0495a9 [Michael Armbrust] Use scaladoc instead.
    b4a6887 [Michael Armbrust] Address review comments.
    a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
    7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
    5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
    6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
    92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
    1279df2 [Michael Armbrust] Increase default number of partitions.
    marmbrus authored and rxin committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    accd099 View commit details
    Browse the repository at this point in the history
  7. [SQL] SPARK-1427 Fix toString for SchemaRDD NativeCommands.

    Author: Michael Armbrust <michael@databricks.com>
    
    Closes apache#343 from marmbrus/toStringFix and squashes the following commits:
    
    37198fe [Michael Armbrust] Fix toString for SchemaRDD NativeCommands.
    marmbrus authored and rxin committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    b5bae84 View commit details
    Browse the repository at this point in the history
  8. SPARK-1432: Make sure that all metadata fields are properly cleaned

    While working on spark-1337 with @pwendell, we noticed that not all of the metadata maps in JobProgessListener were being properly cleaned. This could lead to a (hypothetical) memory leak issue should a job run long enough. This patch aims to address the issue.
    
    Author: Davis Shepherd <davis@conviva.com>
    
    Closes apache#338 from dgshep/master and squashes the following commits:
    
    a77b65c [Davis Shepherd] In the contex of SPARK-1337: Make sure that all metadata fields are properly cleaned
    Davis Shepherd authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    a3c51c6 View commit details
    Browse the repository at this point in the history
  9. [sql] Rename Expression.apply to eval for better readability.

    Also used this opportunity to add a bunch of override's and made some members private.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#340 from rxin/eval and squashes the following commits:
    
    a7c7ca7 [Reynold Xin] Fixed conflicts in merge.
    9069de6 [Reynold Xin] Merge branch 'master' into eval
    3ccc313 [Reynold Xin] Merge branch 'master' into eval
    1a47e10 [Reynold Xin] Renamed apply to eval for generators and added a bunch of override's.
    ea061de [Reynold Xin] Rename Expression.apply to eval for better readability.
    rxin committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    83f2a2f View commit details
    Browse the repository at this point in the history
  10. SPARK-1252. On YARN, use container-log4j.properties for executors

    container-log4j.properties is a file that YARN provides so that containers can have log4j.properties distinct from that of the NodeManagers.
    
    Logs now go to syslog, and stderr and stdout just have the process's standard err and standard out.
    
    I tested this on pseudo-distributed clusters for both yarn (Hadoop 2.2) and yarn-alpha (Hadoop 0.23.7)/
    
    Author: Sandy Ryza <sandy@cloudera.com>
    
    Closes apache#148 from sryza/sandy-spark-1252 and squashes the following commits:
    
    c0043b8 [Sandy Ryza] Put log4j.properties file under common
    55823da [Sandy Ryza] Add license headers to new files
    10934b8 [Sandy Ryza] Add log4j-spark-container.properties and support SPARK_LOG4J_CONF
    e74450b [Sandy Ryza] SPARK-1252. On YARN, use container-log4j.properties for executors
    sryza authored and tgravescs committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    9dd8b91 View commit details
    Browse the repository at this point in the history
  11. HOTFIX: Disable actor input stream test.

    This test makes incorrect assumptions about the behavior of Thread.sleep().
    
    Author: Patrick Wendell <pwendell@gmail.com>
    
    Closes apache#347 from pwendell/stream-tests and squashes the following commits:
    
    10e09e0 [Patrick Wendell] HOTFIX: Disable actor input stream.
    pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    2a2ca48 View commit details
    Browse the repository at this point in the history
  12. SPARK-1099: Introduce local[*] mode to infer number of cores

    This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#182 from aarondav/110 and squashes the following commits:
    
    a88294c [Aaron Davidson] Rebased changes for new spark-shell
    a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
    aarondav authored and pwendell committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    0307db0 View commit details
    Browse the repository at this point in the history
  13. Remove outdated comment

    andrewor14 committed Apr 7, 2014
    Configuration menu
    Copy the full SHA
    c78c92d View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2014

  1. [sql] Rename execution/aggregates.scala Aggregate.scala, and added a …

    …bunch of private[this] to variables.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#348 from rxin/aggregate and squashes the following commits:
    
    f4bc36f [Reynold Xin] Rename execution/aggregates.scala Aggregate.scala, and added a bunch of private[this] to variables.
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    14c9238 View commit details
    Browse the repository at this point in the history
  2. Removed the default eval implementation from Expression, and added a …

    …bunch of override's in classes I touched.
    
    It is more robust to not provide a default implementation for Expression's.
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#350 from rxin/eval-default and squashes the following commits:
    
    0a83b8f [Reynold Xin] Removed the default eval implementation from Expression, and added a bunch of override's in classes I touched.
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    55dfd5d View commit details
    Browse the repository at this point in the history
  3. Added eval for Rand (without any support for user-defined seed).

    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#349 from rxin/rand and squashes the following commits:
    
    fd11322 [Reynold Xin] Added eval for Rand (without any support for user-defined seed).
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    31e6fff View commit details
    Browse the repository at this point in the history
  4. Change timestamp cast semantics. When cast to numeric types, return t…

    …he unix time in seconds (instead of millis).
    
    @marmbrus @chenghao-intel
    
    Author: Reynold Xin <rxin@apache.org>
    
    Closes apache#352 from rxin/timestamp-cast and squashes the following commits:
    
    18aacd3 [Reynold Xin] Fixed precision for double.
    2adb235 [Reynold Xin] Change timestamp cast semantics. When cast to numeric types, return the unix time in seconds (instead of millis).
    rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    f27e56a View commit details
    Browse the repository at this point in the history
  5. [SPARK-1402] Added 3 more compression schemes

    JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402)
    
    This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage:
    
    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`
    
    Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme.
    
    Also fixed a bug in PR apache#286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception.
    
    Author: Cheng Lian <lian.cs.zju@gmail.com>
    
    Closes apache#330 from liancheng/moreCompressionSchemes and squashes the following commits:
    
    1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times
    d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta)
    3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites
    44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.
    liancheng authored and rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    0d0493f View commit details
    Browse the repository at this point in the history
  6. [SPARK-1103] Automatic garbage collection of RDD, shuffle and broadca…

    …st data
    
    This PR allows Spark to automatically cleanup metadata and data related to persisted RDDs, shuffles and broadcast variables when the corresponding RDDs, shuffles and broadcast variables fall out of scope from the driver program. This is still a work in progress as broadcast cleanup has not been implemented.
    
    **Implementation Details**
    A new class `ContextCleaner` is responsible cleaning all the state. It is instantiated as part of a `SparkContext`. RDD and ShuffleDependency classes have overridden `finalize()` function that gets called whenever their instances go out of scope. The `finalize()` function enqueues the object’s identifier (i.e. RDD ID, shuffle ID, etc.) with the `ContextCleaner`, which is a very short and cheap operation and should not significantly affect the garbage collection mechanism. The `ContextCleaner`, on a different thread, performs the cleanup, whose details are given below.
    
    *RDD cleanup:*
    `ContextCleaner` calls `RDD.unpersist()` is used to cleanup persisted RDDs. Regarding metadata, the DAGScheduler automatically cleans up all metadata related to a RDD after all jobs have completed. Only the `SparkContext.persistentRDDs` keeps strong references to persisted RDDs. The `TimeStampedHashMap` used for that has been replaced by `TimeStampedWeakValueHashMap` that keeps only weak references to the RDDs, allowing them to be garbage collected.
    
    *Shuffle cleanup:*
    New BlockManager message `RemoveShuffle(<shuffle ID>)` asks the `BlockManagerMaster` and currently active `BlockManager`s to delete all the disk blocks related to the shuffle ID. `ContextCleaner` cleans up shuffle data using this message and also cleans up the metadata in the `MapOutputTracker` of the driver. The `MapOutputTracker` at the workers, that caches the shuffle metadata, maintains a `BoundedHashMap` to limit the shuffle information it caches. Refetching the shuffle information from the driver is not too costly.
    
    *Broadcast cleanup:*
    To be done. [This PR](https://github.com/apache/incubator-spark/pull/543/) adds mechanism for explicit cleanup of broadcast variables. `Broadcast.finalize()` will enqueue its own ID with ContextCleaner and the PRs mechanism will be used to unpersist the Broadcast data.
    
    *Other cleanup:*
    `ShuffleMapTask` and `ResultTask` caches tasks and used TTL based cleanup (using `TimeStampedHashMap`), so nothing got cleaned up if TTL was not set. Instead, they now use `BoundedHashMap` to keep a limited number of map output information. Cost of repopulating the cache if necessary is very small.
    
    **Current state of implementation**
    Implemented RDD and shuffle cleanup. Things left to be done are.
    - Cleaning up for broadcast variable still to be done.
    - Automatic cleaning up keys with empty weak refs as values in `TimeStampedWeakValueHashMap`
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    Author: Andrew Or <andrewor14@gmail.com>
    Author: Roman Pastukhov <ignatich@mail.ru>
    
    Closes apache#126 from tdas/state-cleanup and squashes the following commits:
    
    61b8d6e [Tathagata Das] Fixed issue with Tachyon + new BlockManager methods.
    f489fdc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    d25a86e [Tathagata Das] Fixed stupid typo.
    cff023c [Tathagata Das] Fixed issues based on Andrew's comments.
    4d05314 [Tathagata Das] Scala style fix.
    2b95b5e [Tathagata Das] Added more documentation on Broadcast implementations, specially which blocks are told about to the driver. Also, fixed Broadcast API to hide destroy functionality.
    41c9ece [Tathagata Das] Added more unit tests for BlockManager, DiskBlockManager, and ContextCleaner.
    6222697 [Tathagata Das] Fixed bug and adding unit test for removeBroadcast in BlockManagerSuite.
    104a89a [Tathagata Das] Fixed failing BroadcastSuite unit tests by introducing blocking for removeShuffle and removeBroadcast in BlockManager*
    a430f06 [Tathagata Das] Fixed compilation errors.
    b27f8e8 [Tathagata Das] Merge pull request #3 from andrewor14/cleanup
    cd72d19 [Andrew Or] Make automatic cleanup configurable (not documented)
    ada45f0 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup
    a2cc8bc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    c5b1d98 [Andrew Or] Address Patrick's comments
    a6460d4 [Andrew Or] Merge github.com:apache/spark into cleanup
    762a4d8 [Tathagata Das] Merge pull request #1 from andrewor14/cleanup
    f0aabb1 [Andrew Or] Correct semantics for TimeStampedWeakValueHashMap + add tests
    5016375 [Andrew Or] Address TD's comments
    7ed72fb [Andrew Or] Fix style test fail + remove verbose test message regarding broadcast
    634a097 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup
    7edbc98 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into state-cleanup
    8557c12 [Andrew Or] Merge github.com:apache/spark into cleanup
    e442246 [Andrew Or] Merge github.com:apache/spark into cleanup
    88904a3 [Andrew Or] Make TimeStampedWeakValueHashMap a wrapper of TimeStampedHashMap
    fbfeec8 [Andrew Or] Add functionality to query executors for their local BlockStatuses
    34f436f [Andrew Or] Generalize BroadcastBlockId to remove BroadcastHelperBlockId
    0d17060 [Andrew Or] Import, comments, and style fixes (minor)
    c92e4d9 [Andrew Or] Merge github.com:apache/spark into cleanup
    f201a8d [Andrew Or] Test broadcast cleanup in ContextCleanerSuite + remove BoundedHashMap
    e95479c [Andrew Or] Add tests for unpersisting broadcast
    544ac86 [Andrew Or] Clean up broadcast blocks through BlockManager*
    d0edef3 [Andrew Or] Add framework for broadcast cleanup
    ba52e00 [Andrew Or] Refactor broadcast classes
    c7ccef1 [Andrew Or] Merge branch 'bc-unpersist-merge' of github.com:ignatich/incubator-spark into cleanup
    6c9dcf6 [Tathagata Das] Added missing Apache license
    d2f8b97 [Tathagata Das] Removed duplicate unpersistRDD.
    a007307 [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    620eca3 [Tathagata Das] Changes based on PR comments.
    f2881fd [Tathagata Das] Changed ContextCleaner to use ReferenceQueue instead of finalizer
    e1fba5f [Tathagata Das] Style fix
    892b952 [Tathagata Das] Removed use of BoundedHashMap, and made BlockManagerSlaveActor cleanup shuffle metadata in MapOutputTrackerWorker.
    a7260d3 [Tathagata Das] Added try-catch in context cleaner and null value cleaning in TimeStampedWeakValueHashMap.
    e61daa0 [Tathagata Das] Modifications based on the comments on PR 126.
    ae9da88 [Tathagata Das] Removed unncessary TimeStampedHashMap from DAGScheduler, added try-catches in finalize() methods, and replaced ArrayBlockingQueue to LinkedBlockingQueue to avoid blocking in Java's finalizing thread.
    cb0a5a6 [Tathagata Das] Fixed docs and styles.
    a24fefc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
    8512612 [Tathagata Das] Changed TimeStampedHashMap to use WrappedJavaHashMap.
    e427a9e [Tathagata Das] Added ContextCleaner to automatically clean RDDs and shuffles when they fall out of scope. Also replaced TimeStampedHashMap to BoundedHashMaps and TimeStampedWeakValueHashMap for the necessary hashmap behavior.
    80dd977 [Roman Pastukhov] Fix for Broadcast unpersist patch.
    1e752f1 [Roman Pastukhov] Added unpersist method to Broadcast.
    tdas authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    11eabbe View commit details
    Browse the repository at this point in the history
  7. [SPARK-1331] Added graceful shutdown to Spark Streaming

    Current version of StreamingContext.stop() directly kills all the data receivers (NetworkReceiver) without waiting for the data already received to be persisted and processed. This PR provides the fix. Now, when the StreamingContext.stop() is called, the following sequence of steps will happen.
    1. The driver will send a stop signal to all the active receivers.
    2. Each receiver, when it gets a stop signal from the driver, first stop receiving more data, then waits for the thread that persists data blocks to BlockManager to finish persisting all receive data, and finally quits.
    3. After all the receivers have stopped, the driver will wait for the Job Generator and Job Scheduler to finish processing all the received data.
    
    It also fixes the semantics of StreamingContext.start and stop. It will throw appropriate errors and warnings if stop() is called before start(), stop() is called twice, etc.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes apache#247 from tdas/graceful-shutdown and squashes the following commits:
    
    61c0016 [Tathagata Das] Updated MIMA binary check excludes.
    ae1d39b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into graceful-shutdown
    6b59cfc [Tathagata Das] Minor changes based on Andrew's comment on PR.
    d0b8d65 [Tathagata Das] Reduced time taken by graceful shutdown unit test.
    f55bc67 [Tathagata Das] Fix scalastyle
    c69b3a7 [Tathagata Das] Updates based on Patrick's comments.
    c43b8ae [Tathagata Das] Added graceful shutdown to Spark Streaming.
    tdas authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    83ac9a4 View commit details
    Browse the repository at this point in the history
  8. [SPARK-1396] Properly cleanup DAGScheduler on job cancellation.

    Previously, when jobs were cancelled, not all of the state in the
    DAGScheduler was cleaned up, leading to a slow memory leak in the
    DAGScheduler.  As we expose easier ways to cancel jobs, it's more
    important to fix these issues.
    
    This commit also fixes a second and less serious problem, which is that
    previously, when a stage failed, not all of the appropriate stages
    were cancelled.  See the "failure of stage used by two jobs" test
    for an example of this.  This just meant that extra work was done, and is
    not a correctness problem.
    
    This commit adds 3 tests.  “run shuffle with map stage failure” is
    a new test to more thoroughly test this functionality, and passes on
    both the old and new versions of the code.  “trivial job
    cancellation” fails on the old code because all state wasn’t cleaned
    up correctly when jobs were cancelled (we didn’t remove the job from
    resultStageToJob).  “failure of stage used by two jobs” fails on the
    old code because taskScheduler.cancelTasks wasn’t called for one of
    the stages (see test comments).
    
    This should be checked in before apache#246, which makes it easier to
    cancel stages / jobs.
    
    Author: Kay Ousterhout <kayousterhout@gmail.com>
    
    Closes apache#305 from kayousterhout/incremental_abort_fix and squashes the following commits:
    
    f33d844 [Kay Ousterhout] Mark review comments
    9217080 [Kay Ousterhout] Properly cleanup DAGScheduler on job cancellation.
    kayousterhout committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    6dc5f58 View commit details
    Browse the repository at this point in the history
  9. Remove extra semicolon in import statement and unused import in Appli…

    …cationMaster
    
    Small nit cleanup to remove extra semicolon and unused import in Yarn's stable ApplicationMaster (it bothers me every time I saw it)
    
    Author: Henry Saputra <hsaputra@apache.org>
    
    Closes apache#358 from hsaputra/nitcleanup_removesemicolon_import_applicationmaster and squashes the following commits:
    
    bffb685 [Henry Saputra] Remove extra semicolon in import statement and unused import in ApplicationMaster.scala
    hsaputra authored and rxin committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    3bc0548 View commit details
    Browse the repository at this point in the history
  10. SPARK-1348 binding Master, Worker, and App Web UI to all interfaces

    Author: Kan Zhang <kzhang@apache.org>
    
    Closes apache#318 from kanzhang/SPARK-1348 and squashes the following commits:
    
    e625a5f [Kan Zhang] reverting the changes to startJettyServer()
    7a8084e [Kan Zhang] SPARK-1348 binding Master, Worker, and App Web UI to all interfaces
    kanzhang authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    a8d86b0 View commit details
    Browse the repository at this point in the history
  11. SPARK-1445: compute-classpath should not print error if lib_managed n…

    …ot found
    
    This was added to the check for the assembly jar, forgot it for the datanucleus jars.
    
    Author: Aaron Davidson <aaron@databricks.com>
    
    Closes apache#361 from aarondav/cc and squashes the following commits:
    
    8facc16 [Aaron Davidson] SPARK-1445: compute-classpath should not print error if lib_managed not found
    aarondav authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    e25b593 View commit details
    Browse the repository at this point in the history
  12. [SPARK-1397] Notify SparkListeners when stages fail or are cancelled.

    [I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, apache#305.  You can look at just the second commit: kayousterhout@93f08ba to see just the changes relevant to this PR]
    
    Previously, when stages fail or get cancelled, the SparkListener is only notified
    indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
    stage that failed.  This worked before job cancellation, because jobs would only fail
    due to a single stage failure.  However, with job cancellation, multiple running stages
    can fail when a job gets cancelled.  Right now, this is not handled correctly, which
    results in stages that get stuck in the “Running Stages” window in the UI even
    though they’re dead.
    
    This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
    event, and uses this event to tell SparkListeners when stages fail in addition to when
    they complete successfully.  This change is NOT publicly backward compatible for two
    reasons.  First, it changes the SparkListener interface.  We could alternately add a new event,
    SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted.  However,
    this is less consistent with the listener events for tasks / jobs ending, and will result in some
    code duplication for listeners (because failed and completed stages are handled in similar
    ways).  Note that I haven’t finished updating the JSON code to correctly handle the new event
    because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).
    
    It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
    method to no longer include a stage that caused the failure.  I think this change should definitely
    stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
    single stage.
    
    Author: Kay Ousterhout <kayousterhout@gmail.com>
    
    Closes apache#309 from kayousterhout/stage_cancellation and squashes the following commits:
    
    5533ecd [Kay Ousterhout] Fixes in response to Mark's review
    320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.
    kayousterhout authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    fac6085 View commit details
    Browse the repository at this point in the history
  13. SPARK-1433: Upgrade Mesos dependency to 0.17.0

    Mesos 0.13.0 was released 6 months ago.
    Upgrade Mesos dependency to 0.17.0
    
    Author: Sandeep <sandeep@techaddict.me>
    
    Closes apache#355 from techaddict/mesos_update and squashes the following commits:
    
    f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
    techaddict authored and pwendell committed Apr 8, 2014
    Configuration menu
    Copy the full SHA
    12c077d View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2014

  1. Spark 1271: Co-Group and Group-By should pass Iterable[X]

    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes apache#242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:
    
    f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
    77048f8 [Holden Karau] Fix merge up to master
    d3fe909 [Holden Karau] use toSeq instead
    7a092a3 [Holden Karau] switch resultitr to resultiterable
    eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
    c5075aa [Holden Karau] If guava 14 had iterables
    2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
    11e730c [Holden Karau] Fix streaming tests
    66b583d [Holden Karau] Fix the core test suite to compile
    4ed579b [Holden Karau] Refactor from iterator to iterable
    d052c07 [Holden Karau] Python tests now pass with iterator pandas
    3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
    cd1e81c [Holden Karau] Try and make pickling list iterators work
    c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
    88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
    a5ee714 [Holden Karau] oops, was checking wrong iterator
    e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
    ec8cc3e [Holden Karau] Fix test issues\!
    4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
    fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
    ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
    b692868 [Holden Karau] Revert
    7e533f7 [Holden Karau] Fix the bug
    8a5153a [Holden Karau] Revert me, but we have some stuff to debug
    b4e86a9 [Holden Karau] Add a join based on the problem in SVD
    c4510e2 [Holden Karau] Revert this but for now put things in list pandas
    b4e0b1d [Holden Karau] Fix style issues
    71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
    b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
    37888ec [Holden Karau] core/tests now pass
    249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
    6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
    fe992fe [Holden Karau] hmmm try and fix up basic operation suite
    172705c [Holden Karau] Fix Java API suite
    caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
    88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
    4991af6 [Holden Karau] Fix some tests
    be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
    687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
    holdenk authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    ce8ec54 View commit details
    Browse the repository at this point in the history
  2. [SPARK-1434] [MLLIB] change labelParser from anonymous function to trait

    This is a patch to address @mateiz 's comment in apache#245
    
    MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass.
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#345 from mengxr/label-parser and squashes the following commits:
    
    ac44409 [Xiangrui Meng] use singleton objects for label parsers
    3b1a7c6 [Xiangrui Meng] add tests for label parsers
    c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton
    11c94e0 [Xiangrui Meng] add return types
    7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
    mengxr authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    b9e0c93 View commit details
    Browse the repository at this point in the history
  3. Spark-939: allow user jars to take precedence over spark jars

    I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great.
    
    Author: Holden Karau <holden@pigscanfly.ca>
    
    Closes apache#217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits:
    
    cf0cac9 [Holden Karau] Fix the executorclassloader
    1955232 [Holden Karau] Fix long line in TestUtils
    8f89965 [Holden Karau] Fix tests for new class name
    7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader
    644719f [Holden Karau] User the class generator for the repl class loader tests too
    f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests
    204b199 [Holden Karau] Fix the generated classes
    9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes
    858aba2 [Holden Karau] Remove a bunch of test junk
    261aaee [Holden Karau] simplify executorurlclassloader a bit
    7a7bf5f [Holden Karau] CR feedback
    d4ae848 [Holden Karau] rewrite component into scala
    aa95083 [Holden Karau] CR feedback
    7752594 [Holden Karau] re-add https comment
    a0ef85a [Holden Karau] Fix style issues
    125ea7f [Holden Karau] Easier to just remove those files, we don't need them
    bb8d179 [Holden Karau] Fix issues with the repl class loader
    241b03d [Holden Karau] fix my rat excludes
    a343350 [Holden Karau] Update rat-excludes and remove a useless file
    d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it
    4919bf9 [Holden Karau] Fix parent calling class loader issue
    8a67302 [Holden Karau] Test are good
    9e2d236 [Holden Karau] It works comrade
    691ee00 [Holden Karau] It works ish
    dc4fe44 [Holden Karau] Does not depend on being in my home directory
    47046ff [Holden Karau] Remove bad import'
    22d83cb [Holden Karau] Add a test suite for the executor url class loader suite
    7ef4628 [Holden Karau] Clean up
    792d961 [Holden Karau] Almost works
    16aecd1 [Holden Karau] Doesn't quite work
    8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options
    648b559 [Holden Karau] Both class loaders compile. Now for testing
    e1d9f71 [Holden Karau] One loader workers.
    holdenk authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    fa0524f View commit details
    Browse the repository at this point in the history
  4. [SPARK-1390] Refactoring of matrices backed by RDDs

    This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have
    
    1. `org.apache.spark.mllib.linalg.SparseMatrix`, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format.
    2. `org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix`, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format.
    
    We will see naming collision when we introduce local `SparseMatrix`, and the name `TallSkinnyDenseMatrix` is not exact if we switch to `RDD[Vector]` from `RDD[Array[Double]]`. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs.
    
    The proposed names are (all under `org.apache.spark.mllib.linalg.rdd`):
    
    1. `RDDMatrix`: trait for matrices backed by one or more RDDs
    2. `CoordinateRDDMatrix`: wrapper of `RDD[(Long, Long, Double)]`
    3. `RowRDDMatrix`: wrapper of `RDD[Vector]` whose rows do not have special ordering
    4. `IndexedRowRDDMatrix`: wrapper of `RDD[(Long, Vector)]` whose rows are associated with indices
    
    The current code also introduces local matrices.
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#296 from mengxr/mat and squashes the following commits:
    
    24d8294 [Xiangrui Meng] fix for groupBy returning Iterable
    bfc2b26 [Xiangrui Meng] merge master
    8e4f1f5 [Xiangrui Meng] Merge branch 'master' into mat
    0135193 [Xiangrui Meng] address Reza's comments
    03cd7e1 [Xiangrui Meng] add pca/gram to IndexedRowMatrix add toBreeze to DistributedMatrix for test simplify tests
    b177ff1 [Xiangrui Meng] address Matei's comments
    be119fe [Xiangrui Meng] rename m/n to numRows/numCols for local matrix add tests for matrices
    b881506 [Xiangrui Meng] rename SparkPCA/SVD to TallSkinnyPCA/SVD
    e7d0d4a [Xiangrui Meng] move IndexedRDDMatrixRow to IndexedRowRDDMatrix
    0d1491c [Xiangrui Meng] fix test errors
    a85262a [Xiangrui Meng] rename RDDMatrixRow to IndexedRDDMatrixRow
    b8b6ac3 [Xiangrui Meng] Remove old code
    4cf679c [Xiangrui Meng] port pca to RowRDDMatrix, and add multiply and covariance
    7836e2f [Xiangrui Meng] initial refactoring of matrices backed by RDDs
    mengxr authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    9689b66 View commit details
    Browse the repository at this point in the history
  5. SPARK-1093: Annotate developer and experimental API's

    This patch marks some existing classes as private[spark] and adds two types of API annotations:
    - `EXPERIMENTAL API` = experimental user-facing module
    - `DEVELOPER API - UNSTABLE` = developer-facing API that might change
    
    There is some discussion of the different mechanisms for doing this here:
    https://issues.apache.org/jira/browse/SPARK-1081
    
    I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility.
    
    A few notes here:
    - In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them.
    - Noted that compression and serialization formats don't have to be wire compatible across versions.
    - Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly.
    - Metrics sources are made private - user only interacts with them through Spark's reflection
    
    Author: Patrick Wendell <pwendell@gmail.com>
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#274 from pwendell/private-apis and squashes the following commits:
    
    44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis
    042c803 [Patrick Wendell] spark.annotations -> spark.annotation
    bfe7b52 [Patrick Wendell] Adding experimental for approximate counts
    8d0c873 [Patrick Wendell] Warning in SparkEnv
    99b223a [Patrick Wendell] Cleaning up annotations
    e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations
    982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs
    a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations
    c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi
    0d48908 [Andrew Or] Comments and new lines (minor)
    f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug
    99192ef [Andrew Or] Dynamically add badges based on annotations
    824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs
    037755c [Patrick Wendell] Some changes after working with andrew or
    f7d124f [Patrick Wendell] Small fixes
    c318b24 [Patrick Wendell] Use CSS styles
    e4c76b9 [Patrick Wendell] Logging
    f390b13 [Patrick Wendell] Better visibility for workaround constructors
    d6b0afd [Patrick Wendell] Small chang to existing constructor
    403ba52 [Patrick Wendell] Style fix
    870a7ba [Patrick Wendell] Work around for SI-8479
    7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD
    4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL
    c581dce [Patrick Wendell] Changes after building against Shark.
    8452309 [Patrick Wendell] Style fixes
    1ed27d2 [Patrick Wendell] Formatting and coloring of badges
    cd7a465 [Patrick Wendell] Code review feedback
    2f706f1 [Patrick Wendell] Don't use floats
    542a736 [Patrick Wendell] Small fixes
    cf23ec6 [Patrick Wendell] Marking GraphX as alpha
    d86818e [Patrick Wendell] Another naming change
    5a76ed6 [Patrick Wendell] More visiblity clean-up
    42c1f09 [Patrick Wendell] Using better labels
    9d48cbf [Patrick Wendell] Initial pass
    pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    87bd1f9 View commit details
    Browse the repository at this point in the history
  6. [SPARK-1357] [MLLIB] Annotate developer and experimental APIs

    Annotate developer and experimental APIs in MLlib.
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#298 from mengxr/api and squashes the following commits:
    
    13390e8 [Xiangrui Meng] Merge branch 'master' into api
    dc4cbb3 [Xiangrui Meng] mark distribute matrices experimental
    6b9f8e2 [Xiangrui Meng] add Experimental annotation
    8773d0d [Xiangrui Meng] add DeveloperApi annotation
    da31733 [Xiangrui Meng] update developer and experimental tags
    555e0fe [Xiangrui Meng] Merge branch 'master' into api
    ef1a717 [Xiangrui Meng] mark some constructors private add default parameters to JavaDoc
    00ffbcc [Xiangrui Meng] update tree API annotation
    0b674fa [Xiangrui Meng] mark decision tree APIs
    86b9e34 [Xiangrui Meng] one pass over APIs of GLMs, NaiveBayes, and ALS
    f21d862 [Xiangrui Meng] Merge branch 'master' into api
    2b133d6 [Xiangrui Meng] intial annotation of developer and experimental apis
    mengxr authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    bde9cc1 View commit details
    Browse the repository at this point in the history
  7. SPARK-1407 drain event queue before stopping event logger

    Author: Kan Zhang <kzhang@apache.org>
    
    Closes apache#366 from kanzhang/SPARK-1407 and squashes the following commits:
    
    cd0629f [Kan Zhang] code refactoring and adding test
    b073ee6 [Kan Zhang] SPARK-1407 drain event queue before stopping event logger
    kanzhang authored and pwendell committed Apr 9, 2014
    Configuration menu
    Copy the full SHA
    eb5f2b6 View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2014

  1. [SPARK-1357 (fix)] remove empty line after :: DeveloperApi/Experiment…

    …al ::
    
    Remove empty line after :: DeveloperApi/Experimental :: in comments to make the original doc show up in the preview of the generated html docs. Thanks @andrewor14 !
    
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes apache#373 from mengxr/api and squashes the following commits:
    
    9c35bdc [Xiangrui Meng] remove the empty line after :: DeveloperApi/Experimental ::
    mengxr authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    0adc932 View commit details
    Browse the repository at this point in the history
  2. SPARK-729: Closures not always serialized at capture time

    [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured.  Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created).  There are a few possible approaches to solving this problem and this PR will discuss some of them.  The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing.
    
    ## What I did
    
    The basic approach I took depends on the work I did for apache#143, and so this PR is based atop that.  Specifically: apache#143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled).  Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used.
    
    Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in apache#143 is creating a serialized closure as part of `ClosureCleaner.clean`.  `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument).  This branch modifies `ClosureCleaner.clean` so that it returns a value:  if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time.  `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument.
    
    I've added tests for this behavior (777a1bc).  The pull request as it stands, given the changes in apache#143, is nearly trivial.  There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important.  I think this is a fine way to solve this problem, but it's not perfect.
    
    ## What we might want to do
    
    The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure.  If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later.  By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches.  Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like:
    
    ```scala
    package org.apache.spark.util
    
    import java.nio.ByteBuffer
    import scala.reflect.ClassManifest
    
    sealed abstract class ClosureBox[T] { def func: T }
    final case class RawClosure[T](func: T) extends ClosureBox[T] {}
    final case class CleanedClosure[T](func: T) extends ClosureBox[T] {}
    final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {}
    
    object ClosureBoxImplicits {
      implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun)
    }
    ```
    
    With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`.  A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer).  Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time).  Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate.  Tracking this sort of state in the type system seems like the right thing to do to me.
    
    ### Why we might not want to do that
    
    _It's pretty invasive._  Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`.  This obscures what's going on and is not a little ugly.  Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions).
    
    _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._  It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future.  Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal.
    
    Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once.
    
    ## Other possibilities
    
    I felt like the serialize-and-deserialize approach was best due to its obvious simplicity.  But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`.  It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness.
    
    I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general.
    
    Author: William Benton <willb@redhat.com>
    
    Closes apache#189 from willb/spark-729 and squashes the following commits:
    
    f4cafa0 [William Benton] Stylistic changes and cleanups
    b3d9c86 [William Benton] Fixed style issues in tests
    9b56ce0 [William Benton] Added array-element capture test
    97e9d91 [William Benton] Split closure-serializability failure tests
    12ef6e3 [William Benton] Skip proactive closure capture for runJob
    8ee3ee7 [William Benton] Predictable closure environment capture
    12c63a7 [William Benton] Added tests for variable capture in closures
    d6e8dd6 [William Benton] Don't check serializability of DStream transforms.
    4ecf841 [William Benton] Make proactive serializability checking optional.
    d8df3db [William Benton] Adds proactive closure-serializablilty checking
    21b4b06 [William Benton] Test cases for SPARK-897.
    d5947b3 [William Benton] Ensure assertions in Graph.apply are asserted.
    willb authored and mateiz committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    8ca3b2b View commit details
    Browse the repository at this point in the history
  3. Merge remote-tracking branch 'apache/master' into streaming-web-ui

    Conflicts:
    	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
    	streaming/src/main/scala/org/apache/spark/streaming/dstream/NetworkInputDStream.scala
    	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
    	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
    	streaming/src/main/scala/org/apache/spark/streaming/scheduler/NetworkInputTracker.scala
    tdas committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    3e986f8 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #2 from andrewor14/ui-refactor

    Refactor UI interface to allow dynamically adding tabs
    tdas committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    168fe86 View commit details
    Browse the repository at this point in the history
  5. Merge branch 'streaming-web-ui' of github.com:tdas/spark into streami…

    …ng-web-ui
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerWebUI.scala
    	core/src/main/scala/org/apache/spark/ui/SparkUI.scala
    	core/src/main/scala/org/apache/spark/ui/storage/IndexPage.scala
    tdas committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    827e81a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    1af239b View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    1c0bcef View commit details
    Browse the repository at this point in the history
  8. Fixed long line.

    tdas committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    fa760fe View commit details
    Browse the repository at this point in the history
  9. SPARK-1446: Spark examples should not do a System.exit

    Spark examples should exit nice using SparkContext.stop() method, rather than System.exit
    System.exit can cause issues like in SPARK-1407
    
    Author: Sandeep <sandeep@techaddict.me>
    
    Closes apache#370 from techaddict/1446 and squashes the following commits:
    
    e9234cf [Sandeep] SPARK-1446: Spark examples should not do a System.exit Spark examples should exit nice using SparkContext.stop() method, rather than System.exit System.exit can cause issues like in SPARK-1407
    techaddict authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    e55cc4b View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    e6d4a74 View commit details
    Browse the repository at this point in the history
  11. Fix SPARK-1413: Parquet messes up stdout and stdin when used in Spark…

    … REPL
    
    Author: witgo <witgo@qq.com>
    
    Closes apache#325 from witgo/SPARK-1413 and squashes the following commits:
    
    e57cd8e [witgo] use scala reflection to access and call the SLF4JBridgeHandler  methods
    45c8f40 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    5e35d87 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    0d5f819 [witgo] review commit
    45e5b70 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    fa69dcf [witgo] Merge branch 'master' into SPARK-1413
    3c98dc4 [witgo] Merge branch 'master' into SPARK-1413
    38160cb [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    ba09bcd [witgo] remove set the parquet log level
    a63d574 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    5231ecd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    3feb635 [witgo] parquet logger use parent handler
    fa00d5d [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    8bb6ffd [witgo] enableLogForwarding note fix
    edd9630 [witgo]  move to
    f447f50 [witgo] merging master
    5ad52bd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
    76670c1 [witgo] review commit
    70f3c64 [witgo] Fix SPARK-1413
    witgo authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    a74fbbb View commit details
    Browse the repository at this point in the history
  12. [SPARK-1276] Add a HistoryServer to render persisted UI

    The new feature of event logging, introduced in apache#42, allows the user to persist the details of his/her Spark application to storage, and later replay these events to reconstruct an after-the-fact SparkUI.
    Currently, however, a persisted UI can only be rendered through the standalone Master. This greatly limits the use case of this new feature as many people also run Spark on Yarn / Mesos.
    
    This PR introduces a new entity called the HistoryServer, which, given a log directory, keeps track of all completed applications independently of a Spark Master. Unlike Master, the HistoryServer needs not be running while the application is still running. It is relatively light-weight in that it only maintains static information of applications and performs no scheduling.
    
    To quickly test it out, generate event logs with ```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh <log-dir-path>```. Your HistoryServer awaits on port 18080.
    
    Comments and feedback are most welcome.
    
    ---
    
    A few other changes introduced in this PR include refactoring the WebUI interface, which is beginning to have a lot of duplicate code now that we have added more functionality to it. Two new SparkListenerEvents have been introduced (SparkListenerApplicationStart/End) to keep track of application name and start/finish times. This PR also clarifies the semantics of the ReplayListenerBus introduced in apache#42.
    
    A potential TODO in the future (not part of this PR) is to render live applications in addition to just completed applications. This is useful when applications fail, a condition that our current HistoryServer does not handle unless the user manually signals application completion (by creating the APPLICATION_COMPLETION file). Handling live applications becomes significantly more challenging, however, because it is now necessary to render the same SparkUI multiple times. To avoid reading the entire log every time, which is inefficient, we must handle reading the log from where we previously left off, but this becomes fairly complicated because we must deal with the arbitrary behavior of each input stream.
    
    Author: Andrew Or <andrewor14@gmail.com>
    
    Closes apache#204 from andrewor14/master and squashes the following commits:
    
    7b7234c [Andrew Or] Finished -> Completed
    b158d98 [Andrew Or] Address Patrick's comments
    69d1b41 [Andrew Or] Do not block on posting SparkListenerApplicationEnd
    19d5dd0 [Andrew Or] Merge github.com:apache/spark
    f7f5bf0 [Andrew Or] Make history server's web UI port a Spark configuration
    2dfb494 [Andrew Or] Decouple checking for application completion from replaying
    d02dbaa [Andrew Or] Expose Spark version and include it in event logs
    2282300 [Andrew Or] Add documentation for the HistoryServer
    567474a [Andrew Or] Merge github.com:apache/spark
    6edf052 [Andrew Or] Merge github.com:apache/spark
    19e1fb4 [Andrew Or] Address Thomas' comments
    248cb3d [Andrew Or] Limit number of live applications + add configurability
    a3598de [Andrew Or] Do not close file system with ReplayBus + fix bind address
    bc46fc8 [Andrew Or] Merge github.com:apache/spark
    e2f4ff9 [Andrew Or] Merge github.com:apache/spark
    050419e [Andrew Or] Merge github.com:apache/spark
    81b568b [Andrew Or] Fix strange error messages...
    0670743 [Andrew Or] Decouple page rendering from loading files from disk
    1b2f391 [Andrew Or] Minor changes
    a9eae7e [Andrew Or] Merge branch 'master' of github.com:apache/spark
    d5154da [Andrew Or] Styling and comments
    5dbfbb4 [Andrew Or] Merge branch 'master' of github.com:apache/spark
    60bc6d5 [Andrew Or] First complete implementation of HistoryServer (only for finished apps)
    7584418 [Andrew Or] Report application start/end times to HistoryServer
    8aac163 [Andrew Or] Add basic application table
    c086bd5 [Andrew Or] Add HistoryServer and scripts ++ Refactor WebUI interface
    andrewor14 authored and pwendell committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    79820fe View commit details
    Browse the repository at this point in the history
  13. SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 …

    …instead of complaining
    
    Author: Sandeep <sandeep@techaddict.me>
    
    Closes apache#356 from techaddict/1428 and squashes the following commits:
    
    3bdf5f6 [Sandeep] SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
    techaddict authored and mateiz committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    3bd3129 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    ee6543f View commit details
    Browse the repository at this point in the history
  15. Merge remote-tracking branch 'apache/master' into streaming-web-ui

    Conflicts:
    	core/src/main/scala/org/apache/spark/deploy/master/ui/MasterWebUI.scala
    	core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerWebUI.scala
    	core/src/main/scala/org/apache/spark/ui/SparkUI.scala
    	core/src/main/scala/org/apache/spark/ui/WebUI.scala
    	core/src/main/scala/org/apache/spark/ui/env/IndexPage.scala
    	core/src/main/scala/org/apache/spark/ui/exec/IndexPage.scala
    	core/src/main/scala/org/apache/spark/ui/jobs/IndexPage.scala
    	core/src/main/scala/org/apache/spark/ui/jobs/JobProgressTab.scala
    	core/src/main/scala/org/apache/spark/ui/jobs/PoolPage.scala
    	core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
    	core/src/main/scala/org/apache/spark/ui/storage/BlockManagerTab.scala
    	core/src/main/scala/org/apache/spark/ui/storage/IndexPage.scala
    	core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala
    tdas committed Apr 10, 2014
    Configuration menu
    Copy the full SHA
    6de06b0 View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2014

  1. Wide refactoring of WebUI, UITab, and UIPage (see commit message)

    The biggest changes include
    (1) Decoupling the SparkListener from any member of the hierarchy. This was
        previously arbitrarily tied to the UITab.
    (2) Decoupling initializing a UITab from attaching it to a WebUI. This involves
        having each UITab initializing itself instead.
    (3) Add an abstract parent for each UITab. This allows us to move the access
        of header tabs of the UI into the UITab abstract class itself.
    (4) Abstract bind() logic into WebUI.
    (5) Renaming UITab -> WebUITab, and UIPage -> WebUIPage.
    andrewor14 committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    548c98c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    914b8ff View commit details
    Browse the repository at this point in the history
  3. Merge pull request apache#5 from andrewor14/ui-refactor

    Wide refactoring of WebUI, UITab, and UIPage
    tdas committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    585cd65 View commit details
    Browse the repository at this point in the history
  4. Merge branch 'streaming-web-ui' of github.com:tdas/spark into streami…

    …ng-web-ui
    
    Conflicts:
    	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
    	streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingTab.scala
    tdas committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    caa5e05 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    f8e1053 View commit details
    Browse the repository at this point in the history
  6. Rename tabs and pages (No more IndexPage.scala)

    Previously there were 7 different IndexPage.scala's in different packages.
    andrewor14 committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    aa396d4 View commit details
    Browse the repository at this point in the history
  7. Added binary check exclusions

    tdas committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    2fc09c8 View commit details
    Browse the repository at this point in the history
  8. Merge pull request apache#6 from andrewor14/ui-refactor

    Rename tabs and pages (No more IndexPage.scala)
    tdas committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    72fe256 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    89dae36 View commit details
    Browse the repository at this point in the history
  10. Address Patrick's comments

    andrewor14 committed Apr 11, 2014
    Configuration menu
    Copy the full SHA
    90feb8d View commit details
    Browse the repository at this point in the history