Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1251 Support for optimizing and executing structured queries #146

Closed
wants to merge 862 commits into from
Closed
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
862 commits
Select commit Hold shift + click to select a range
605255e
Added scalastyle checker.
rxin Jan 30, 2014
08e4d05
First round of style cleanup.
rxin Jan 30, 2014
7213a2c
style fix for Hive.scala.
rxin Jan 31, 2014
5c1e600
Added hash code implementation for AttributeReference
rxin Jan 31, 2014
7e24436
Removed dependency on JDK 7 (nio.file).
rxin Jan 31, 2014
41bbee6
Merge remote-tracking branch 'upstream/master' into exchangeOperator
yhuai Jan 31, 2014
f47c2f6
set outputPartitioning in BroadcastNestedLoopJoin
yhuai Jan 31, 2014
d91e276
Remove dependence on HIVE_HOME for running tests. This was done by m…
marmbrus Jan 31, 2014
bce024d
Merge remote-tracking branch 'databricks/master' into style
marmbrus Jan 31, 2014
d20b565
fix if style
marmbrus Jan 31, 2014
807b2d7
check style and publish docs with travis
marmbrus Feb 1, 2014
d3a3d48
add testing to travis
marmbrus Feb 1, 2014
271e483
Update build status icon.
marmbrus Feb 1, 2014
6015f93
Merge pull request #29 from rxin/style
marmbrus Feb 1, 2014
fc67b50
Check for a Sort operator with the global flag set instead of an Exch…
yhuai Feb 1, 2014
235cbb4
Merge remote-tracking branch 'upstream/master' into exchangeOperator
yhuai Feb 1, 2014
45b334b
fix comments
yhuai Feb 1, 2014
e079f2b
Add GenericUDAF wrapper and HiveUDAFFunction
tnachen Jan 16, 2014
8e0931f
Cast to avoid using deprecated hive API.
marmbrus Jan 28, 2014
b1151a8
Fix load data regex
tnachen Jan 29, 2014
5b7afd8
Merge pull request #10 from yhuai/exchangeOperator
marmbrus Feb 2, 2014
6eb5960
Merge remote-tracking branch 'databricks/master' into udafs
marmbrus Feb 2, 2014
41b41f3
Only cast unresolved inserts.
marmbrus Feb 2, 2014
63003e9
Fix spacing.
marmbrus Feb 2, 2014
2de89d0
Merge pull request #13 from tnachen/master
marmbrus Feb 2, 2014
cb775ac
get rid of SharkContext singleton
marmbrus Jan 12, 2014
dfb67aa
add test case
marmbrus Jan 13, 2014
19bfd74
store hive output in circular buffer
marmbrus Jan 22, 2014
1590568
add log4j.properties
marmbrus Feb 3, 2014
b649c20
fix test logging / caching.
marmbrus Feb 3, 2014
7845364
deactivate concurrent test.
marmbrus Feb 3, 2014
ea6f37f
fix style.
marmbrus Feb 3, 2014
82163e3
special case handling of partitionKeys when casting insert into tables
marmbrus Feb 3, 2014
9c22b4e
Support for parsing nested types.
marmbrus Feb 1, 2014
efa7217
Support for reading structs in HiveTableScan.
marmbrus Feb 1, 2014
d670e41
Print nested fields like hive does.
marmbrus Feb 1, 2014
dc6463a
Support for resolving access to nested fields using "." notation.
marmbrus Feb 1, 2014
6709441
Evaluation for accessing nested fields.
marmbrus Feb 1, 2014
da7ae9d
Add boolean writable that was breaking udf_regexp test. Not sure how…
marmbrus Feb 1, 2014
6420c7c
Memoize the ordinal in the GetField expression.
marmbrus Feb 1, 2014
1579eec
Only cast unresolved inserts.
marmbrus Feb 2, 2014
cf8d992
Use built in functions for creating temp directory.
marmbrus Feb 2, 2014
c654f19
Support for list and maps in hive table scan.
marmbrus Feb 2, 2014
c3feda7
use toArray.
marmbrus Feb 2, 2014
a9388fb
printing for map types.
marmbrus Feb 2, 2014
35a70fb
multi-letter field names.
marmbrus Feb 2, 2014
2c6deb3
improve printing compatibility.
marmbrus Feb 2, 2014
5b33216
work on decimal support.
marmbrus Feb 3, 2014
5b3d2c8
implement distinct.
marmbrus Feb 3, 2014
3f9e519
use names w/ boolean args
marmbrus Feb 3, 2014
3734a94
only quote string types.
marmbrus Feb 3, 2014
bbec500
update test coverage, new golden
marmbrus Feb 2, 2014
5e54aa6
quotes for struct field names.
marmbrus Feb 3, 2014
e4def6b
set dataType for HiveGenericUdfs.
marmbrus Feb 3, 2014
aa430e7
Update .travis.yml
marmbrus Feb 3, 2014
7661b6c
blacklist machines specific tests
marmbrus Feb 3, 2014
72a003d
revert regex change
marmbrus Feb 3, 2014
9c06778
fix serialization issues, add JavaStringObjectInspector.
marmbrus Feb 3, 2014
92e4158
Merge pull request #32 from marmbrus/tooManyProjects
rxin Feb 3, 2014
692a477
Support for wrapping arrays to be written into hive tables.
marmbrus Feb 4, 2014
ac9d7de
Resolve *s in Transform clauses.
marmbrus Feb 4, 2014
7a0f543
Avoid propagating types from unresolved nodes.
marmbrus Feb 4, 2014
010accb
add tinyint to metastore type parser.
marmbrus Feb 4, 2014
e7933e9
fix casting bug when working with fractional expressions.
marmbrus Feb 4, 2014
25288d0
Implement [] for arrays and maps.
marmbrus Feb 4, 2014
ab9a131
when UDFs fail they should return null.
marmbrus Feb 4, 2014
1679554
add toString for if and IS NOT NULL.
marmbrus Feb 4, 2014
ab5bff3
Support for get item of map types.
marmbrus Feb 4, 2014
42ec4af
improve complex type support in hive udfs/udafs.
marmbrus Feb 4, 2014
44d343c
Merge remote-tracking branch 'databricks/master' into complex
marmbrus Feb 4, 2014
e3c10bd
update whitelist.
marmbrus Feb 4, 2014
389525d
update golden, blacklist mr.
marmbrus Feb 4, 2014
2f27604
Address comments / style errors.
marmbrus Feb 4, 2014
cb57459
blacklist machine specific test.
marmbrus Feb 4, 2014
67128b8
Merge pull request #30 from marmbrus/complex
rxin Feb 4, 2014
b4be6a5
better logging when applying rules.
marmbrus Feb 5, 2014
ccdb07a
Fix bug where averages of strings are turned into sums of strings. R…
marmbrus Feb 5, 2014
d8cb805
Implement partial aggregation.
marmbrus Feb 5, 2014
f94345c
fix doc link
marmbrus Feb 5, 2014
e1999f9
Use Deserializer and Serializer instead of AbstractSerDe.
yhuai Feb 5, 2014
32b615b
add override to asPartial.
marmbrus Feb 5, 2014
883006d
improve tests.
marmbrus Feb 5, 2014
cab1a84
Fix PartialAggregate inheritance.
marmbrus Feb 5, 2014
dc6353b
turn off deprecation
marmbrus Feb 5, 2014
8017afb
fix copy paste error.
marmbrus Feb 5, 2014
5479066
Merge pull request #36 from marmbrus/partialAgg
rxin Feb 5, 2014
5e4d9b4
Merge pull request #35 from marmbrus/smallFixes
marmbrus Feb 7, 2014
02ff8e4
Correctly parse the db name and table name in a CTAS query.
yhuai Feb 7, 2014
8841eb8
Rename Transform -> ScriptTransformation.
marmbrus Feb 7, 2014
acb9566
Correctly type attributes of CTAS.
marmbrus Feb 7, 2014
016b489
fix typo.
marmbrus Feb 7, 2014
bea4b7f
Add SumDistinct.
marmbrus Feb 7, 2014
ea76cf9
Add NoRelation to planner.
marmbrus Feb 7, 2014
dd00b7e
initial implementation of generators.
marmbrus Feb 7, 2014
ba8897f
Merge remote-tracking branch 'yin/parseDBNameInCTAS' into lateralView
marmbrus Feb 7, 2014
0ce61b0
Docs for GenericHiveUdtf.
marmbrus Feb 7, 2014
740febb
Tests for tgfs.
marmbrus Feb 7, 2014
db92adc
more tests passing. clean up logging.
marmbrus Feb 7, 2014
ff5ea3f
new golden
marmbrus Feb 7, 2014
5cc367c
use berkeley instead of cloudbees
marmbrus Feb 8, 2014
b376d15
fix newlines at EOF
marmbrus Feb 8, 2014
7123225
Correctly parse the db name and table name in INSERT queries.
yhuai Feb 8, 2014
2897deb
fix scaladoc
marmbrus Feb 8, 2014
0e6c1d7
Merge pull request #38 from yhuai/parseDBNameInCTAS
rxin Feb 8, 2014
341116c
address comments.
marmbrus Feb 8, 2014
7785ee6
Tighten visibility based on comments.
marmbrus Feb 10, 2014
964368f
Merge pull request #39 from marmbrus/lateralView
marmbrus Feb 11, 2014
dce0593
move golden answer to the source code directory.
marmbrus Feb 11, 2014
9329820
add golden answer files to repository
marmbrus Feb 11, 2014
a7ad058
Merge pull request #40 from marmbrus/includeGoldens
marmbrus Feb 11, 2014
2407a21
Added optimized logical plan to debugging output
liancheng Feb 12, 2014
cf691df
Added the PhysicalOperation to generalize ColumnPrunings
liancheng Feb 12, 2014
f235914
Test case udf_regex and udf_like need BooleanWritable registered
liancheng Feb 12, 2014
f0c3742
Refactored PhysicalOperation
liancheng Feb 12, 2014
5720d2b
Fixed comment typo
liancheng Feb 12, 2014
bc9a12c
Move hive test files.
marmbrus Feb 13, 2014
7588a57
Break into 3 major components and move everything into the org.apache…
marmbrus Feb 13, 2014
1f7d00a
Merge pull request #41 from marmbrus/splitComponents
rxin Feb 14, 2014
887f928
Merge remote-tracking branch 'upstream/master' into SerDeNew
yhuai Feb 14, 2014
678341a
Replaced non-ascii text
markhamstra Feb 14, 2014
5ae010f
Merge pull request #42 from markhamstra/non-ascii
marmbrus Feb 14, 2014
1f6260d
Fixed package name and test suite name in Makefile
liancheng Feb 14, 2014
b6de691
Merge pull request #43 from liancheng/fixMakefile
marmbrus Feb 14, 2014
7f206b5
Add support for hive TABLESAMPLE PERCENT.
marmbrus Feb 14, 2014
ed3a1d1
Load data directly into Hive.
yhuai Feb 14, 2014
59e37a3
Merge remote-tracking branch 'upstream/master' into SerDeNew
yhuai Feb 14, 2014
346f828
Move SharkHadoopWriter to the correct location.
yhuai Feb 15, 2014
a9c3188
Fix udaf struct return
tnachen Feb 15, 2014
69adf72
Set cloneRecords to false.
yhuai Feb 15, 2014
566fd66
Whitelist tests and add support for Binary type
tnachen Feb 15, 2014
9ad474d
Merge pull request #44 from marmbrus/sampling
marmbrus Feb 15, 2014
3cb4f2e
Merge pull request #45 from tnachen/master
marmbrus Feb 15, 2014
8506c17
Address review feedback.
marmbrus Feb 15, 2014
3bb272d
move org.apache.spark.sql package.scala to the correct location.
marmbrus Feb 15, 2014
1596e1b
Cleanup imports to make IntelliJ happy.
yhuai Feb 15, 2014
5495fab
Remove cloneRecords which is no longer needed.
yhuai Feb 15, 2014
bdab5ed
Add a TODO for loading data into partitioned tables.
yhuai Feb 15, 2014
35c9a8a
Merge pull request #46 from marmbrus/reviewFeedback
marmbrus Feb 15, 2014
563bb22
Set compression info in FileSinkDesc.
yhuai Feb 16, 2014
e089627
Code style.
yhuai Feb 16, 2014
45ffb86
Merge remote-tracking branch 'upstream/master' into SerDeNew
yhuai Feb 16, 2014
eea75c5
Correctly set codec.
yhuai Feb 16, 2014
428aff5
Distinguish `INSERT INTO` and `INSERT OVERWRITE`.
yhuai Feb 16, 2014
a40d6d6
Loading the static partition specified in a INSERT INTO/OVERWRITE query.
yhuai Feb 16, 2014
334aace
New golden files.
yhuai Feb 16, 2014
d00260b
Strips backticks from partition keys.
yhuai Feb 17, 2014
555fb1d
Correctly set the extension for a text file.
yhuai Feb 17, 2014
feb022c
Partitioning key should be case insensitive.
yhuai Feb 17, 2014
a1a4776
Update comments.
yhuai Feb 17, 2014
017872c
Remove stats20 from whitelist.
yhuai Feb 17, 2014
128a9f8
Minor changes.
yhuai Feb 18, 2014
f670c8c
Throw a NotImplementedError for not supported clauses in a CTAS query.
yhuai Feb 18, 2014
c5a4fab
Merge branch 'master' into columnPruning
liancheng Feb 16, 2014
2682f72
Merge remote-tracking branch 'origin/master' into columnPruning
liancheng Feb 18, 2014
54f165b
Fixed spelling typo in two golden answer file names
liancheng Feb 18, 2014
cf4db59
Added golden answers for PruningSuite
liancheng Feb 18, 2014
f22df3a
Merge pull request #37 from yhuai/SerDe
marmbrus Feb 18, 2014
9990ec7
Merge pull request #28 from liancheng/columnPruning
marmbrus Feb 18, 2014
29effad
Include alias in attributes that are produced by overridden tables.
marmbrus Feb 24, 2014
c9116a6
Add combiner to avoid NPE when spark performs external aggregation.
marmbrus Feb 24, 2014
8c01c24
Move definition of Row out of execution to top level sql package.
marmbrus Feb 24, 2014
4905b2b
Add more efficient TopK that avoids global sort for logical Sort => S…
marmbrus Feb 24, 2014
532dd37
Allow the local warehouse path to be specified.
marmbrus Feb 24, 2014
a430895
Planning for logical Repartition operators.
marmbrus Feb 24, 2014
5fe7de4
Move table creation out of rule into a separate function.
marmbrus Feb 24, 2014
b922511
Fix insertion of nested types into hive tables.
marmbrus Feb 24, 2014
18a861b
Correctly convert nested products into nested rows when turning scala…
marmbrus Feb 24, 2014
df88f01
add a simple test for aggregation
marmbrus Feb 24, 2014
6e04e5b
Add insertIntoTable to the DSL.
marmbrus Feb 24, 2014
24eaa79
fix > 100 chars
marmbrus Feb 24, 2014
d393d2a
Review Comments: Add comment to map that adds a sub query.
marmbrus Feb 24, 2014
2225431
Merge pull request #48 from marmbrus/minorFixes
marmbrus Feb 24, 2014
3ac9416
Merge support for working with schema-ed RDDs using catalyst in as a …
marmbrus Feb 25, 2014
f5e7492
Add Apache license. Make naming more consistent.
marmbrus Feb 25, 2014
5f2963c
naming and continuous compilation fixes.
marmbrus Feb 27, 2014
4d57d0e
Fix test execution on travis.
marmbrus Feb 27, 2014
7413ac2
make test downloading quieter.
marmbrus Feb 28, 2014
608a29e
Add hive as a repl dependency
marmbrus Feb 28, 2014
c334386
Initial support for generating schema's based on case classes.
marmbrus Feb 24, 2014
b33e47e
First commit of Parquet import of primitive column types
AndreSchumacher Feb 16, 2014
99a9209
Expanding ParquetQueryTests to cover all primitive types
AndreSchumacher Feb 16, 2014
eb0e521
Fixing package names and other problems that came up after the rebase
AndreSchumacher Feb 17, 2014
6ad05b3
Moving ParquetRelation to spark.sql core
AndreSchumacher Feb 19, 2014
a11e364
Adding Parquet RowWriteSupport
AndreSchumacher Feb 19, 2014
0f17d7b
Rewriting ParquetRelation tests with RowWriteSupport
AndreSchumacher Feb 19, 2014
6a6bf98
Added column projections to ParquetTableScan
AndreSchumacher Feb 19, 2014
f347273
Adding ParquetMetaData extraction, fixing schema projection
AndreSchumacher Feb 20, 2014
75262ee
Integrating operations on Parquet files into SharkStrategies
AndreSchumacher Feb 24, 2014
18fdc44
Reworking Parquet metadata in relation and adding CREATE TABLE AS for…
AndreSchumacher Feb 26, 2014
3a0a552
Reorganizing Parquet table operations
AndreSchumacher Feb 26, 2014
3321195
Fixing one import in ParquetQueryTests.scala
AndreSchumacher Feb 27, 2014
61e3bfb
Adding WriteToFile operator and rewriting ParquetQuerySuite
AndreSchumacher Mar 2, 2014
c863bed
Codestyle checks
AndreSchumacher Mar 2, 2014
3ac9eb0
Rebasing to new main branch
AndreSchumacher Mar 2, 2014
3bda72d
Adding license banner to new files
AndreSchumacher Mar 2, 2014
d7fbc3a
Several performance enhancements and simplifications of the expressio…
marmbrus Feb 27, 2014
296fe50
Address review feedback.
marmbrus Feb 27, 2014
6fdefe6
Port sbt improvements from master.
marmbrus Mar 3, 2014
da9afbd
Add byte wrappers for hive UDFS.
marmbrus Mar 3, 2014
7b9d142
Update travis to increase permgen size.
marmbrus Mar 3, 2014
99e61fb
Merge pull request #51 from marmbrus/expressionEval
marmbrus Mar 3, 2014
8d5da5e
modify compute-classpath.sh to include datanucleus jars explicitly
marmbrus Feb 27, 2014
6d315bb
Added Row.unapplySeq to extract fields from a Row object.
liancheng Mar 5, 2014
70e489d
Fixed a spelling typo
liancheng Mar 5, 2014
1ce01c7
Merge pull request #56 from liancheng/unapplySeqForRow
marmbrus Mar 5, 2014
0040ae6
Feedback from code review
AndreSchumacher Mar 5, 2014
9d419a6
Merge remote-tracking branch 'catalyst/catalystIntegration' into parq…
marmbrus Mar 5, 2014
7d0f13e
Update parquet support with master.
marmbrus Mar 5, 2014
3c3f962
Fix a bug due to array reuse. This will need to be revisited after w…
marmbrus Mar 5, 2014
c9f8fb3
Merge pull request #53 from AndreSchumacher/parquet_support
marmbrus Mar 6, 2014
d371393
Add a framework for dealing with mutable rows to reduce the number of…
marmbrus Mar 5, 2014
959bdf0
Don't silently swallow all KryoExceptions, only the one that indicate…
marmbrus Mar 6, 2014
9049cf0
Extend MutablePair interface to support easy syntax for in-place upda…
marmbrus Mar 6, 2014
d994333
Remove copies before shuffle, this required changing the default shuf…
marmbrus Mar 6, 2014
ba28849
code review comments.
marmbrus Mar 6, 2014
c2a658d
Merge pull request #55 from marmbrus/mutableRows
marmbrus Mar 6, 2014
54637ec
First part of second round of code review feedback
AndreSchumacher Mar 9, 2014
5bacdc0
Moving towards mutable rows inside ParquetRowSupport
AndreSchumacher Mar 9, 2014
7ca4b4e
Improving checks in Parquet tests
AndreSchumacher Mar 11, 2014
aeaef54
Removing unnecessary Row copying and reverting some changes to Mutabl…
AndreSchumacher Mar 11, 2014
7386a9f
Initial example programs using spark sql.
marmbrus Mar 11, 2014
f0ba39e
Merge remote-tracking branch 'origin/master' into maven
marmbrus Mar 11, 2014
7233a74
initial support for maven builds
marmbrus Mar 11, 2014
3447c3e
Don't override the metastore / warehouse in non-local/test hive context.
marmbrus Mar 13, 2014
3386e4f
Merge pull request #58 from AndreSchumacher/parquet_fixes
marmbrus Mar 13, 2014
1a4bbd9
Merge pull request #60 from marmbrus/maven
marmbrus Mar 13, 2014
f93aa39
Better handling of path names in ParquetRelation
AndreSchumacher Mar 14, 2014
5d71074
Merge pull request #62 from AndreSchumacher/parquet_file_fixes
marmbrus Mar 14, 2014
8b35e0a
address feedback, work on DSL
marmbrus Mar 13, 2014
d2d9678
Make sure hive isn't in the assembly jar. Create a separate, optiona…
marmbrus Mar 14, 2014
9eb0294
Bring expressions implicits into SqlContext.
marmbrus Mar 14, 2014
f7d992d
Naming / spelling.
marmbrus Mar 14, 2014
ce8073b
clean up implicits.
marmbrus Mar 14, 2014
2f22454
WIP: Parquet example.
marmbrus Mar 14, 2014
c01470f
Clean up example
marmbrus Mar 14, 2014
013f62a
Fix documentation / code style.
marmbrus Mar 14, 2014
c2efad6
First draft of SQL documentation.
marmbrus Mar 14, 2014
e5e1d6b
Remove travis configuration.
marmbrus Mar 14, 2014
1d0eb63
update changes with spark core
marmbrus Mar 14, 2014
6978dd8
update docs, add apache license
marmbrus Mar 14, 2014
9dffbfa
Style fixes. Add downloading of test cases to jenkins.
marmbrus Mar 14, 2014
adcf1a4
Update sql-programming-guide.md
hcook Mar 14, 2014
461581c
Blacklist test that depends on JVM specific rounding behaviour
marmbrus Mar 15, 2014
48a99bc
Address first round of feedback.
marmbrus Mar 17, 2014
fee847b
Merge remote-tracking branch 'origin/master' into catalyst, integrati…
marmbrus Mar 17, 2014
fead0b6
Create a new RDD type, SchemaRDD, that is now the return type for all…
marmbrus Mar 19, 2014
778299a
Fix some old links to spark-project.org
marmbrus Mar 19, 2014
bdab185
Address another round of comments:
marmbrus Mar 20, 2014
0d638c3
Typo fix from @ash211.
marmbrus Mar 20, 2014
458bd1b
Update people.txt
marmbrus Mar 20, 2014
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>net.sf.py4j</groupId>
<artifactId>py4j</artifactId>
Expand Down
31 changes: 27 additions & 4 deletions bin/compute-classpath.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,23 +33,43 @@ fi
# Build up classpath
CLASSPATH="$SPARK_CLASSPATH:$FWDIR/conf"

# Support for interacting with Hive. Since hive pulls in a lot of dependencies that might break
# existing Spark applications, it is not included in the standard spark assembly. Instead, we only
# include it in the classpath if the user has explicitly requested it by running "sbt hive/assembly"
# Hopefully we will find a way to avoid uber-jars entirely and deploy only the needed packages in
# the future.
if [ -f "$FWDIR"/sql/hive/target/scala-$SCALA_VERSION/spark-hive-assembly-*.jar ]; then
echo "Hive assembly found, including hive support. If this isn't desired run sbt hive/clean."

# Datanucleus jars do not work if only included in the uberjar as plugin.xml metadata is lost.
DATANUCLEUSJARS=$(JARS=("$FWDIR/lib_managed/jars"/datanucleus-*.jar); IFS=:; echo "${JARS[*]}")
CLASSPATH=$CLASSPATH:$DATANUCLEUSJARS

ASSEMBLY_DIR="$FWDIR/sql/hive/target/scala-$SCALA_VERSION/"
else
ASSEMBLY_DIR="$FWDIR/assembly/target/scala-$SCALA_VERSION/"
fi

# First check if we have a dependencies jar. If so, include binary classes with the deps jar
if [ -f "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar ]; then
if [ -f "$ASSEMBLY_DIR"/spark-assembly*hadoop*-deps.jar ]; then
CLASSPATH="$CLASSPATH:$FWDIR/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/repl/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/mllib/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/classes"

DEPS_ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*-deps.jar`
DEPS_ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*-deps.jar`
CLASSPATH="$CLASSPATH:$DEPS_ASSEMBLY_JAR"
else
# Else use spark-assembly jar from either RELEASE or assembly directory
if [ -f "$FWDIR/RELEASE" ]; then
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark-assembly*.jar`
ASSEMBLY_JAR=`ls "$FWDIR"/jars/spark*-assembly*.jar`
else
ASSEMBLY_JAR=`ls "$FWDIR"/assembly/target/scala-$SCALA_VERSION/spark-assembly*hadoop*.jar`
ASSEMBLY_JAR=`ls "$ASSEMBLY_DIR"/spark*-assembly*hadoop*.jar`
fi
CLASSPATH="$CLASSPATH:$ASSEMBLY_JAR"
fi
Expand All @@ -62,6 +82,9 @@ if [[ $SPARK_TESTING == 1 ]]; then
CLASSPATH="$CLASSPATH:$FWDIR/bagel/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/graphx/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/streaming/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/catalyst/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/core/target/scala-$SCALA_VERSION/test-classes"
CLASSPATH="$CLASSPATH:$FWDIR/sql/hive/target/scala-$SCALA_VERSION/test-classes"
fi

# Add hadoop conf dir if given -- otherwise FileSystem.*, etc fail !
Expand Down
4 changes: 4 additions & 0 deletions dev/download-hive-tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/sh

wget -O hiveTests.tgz http://cs.berkeley.edu/~marmbrus/tmp/hiveTests.tgz
tar zxf hiveTests.tgz
3 changes: 3 additions & 0 deletions dev/run-tests
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@
FWDIR="$(cd `dirname $0`/..; pwd)"
cd $FWDIR

# Download Hive Compatability Files
dev/download-hive-tests.sh

# Remove work directory
rm -rf ./work

Expand Down
9 changes: 9 additions & 0 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
<li><a href="python-programming-guide.html">Spark in Python</a></li>
<li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">Spark SQL</a></li>
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
Expand All @@ -79,6 +80,14 @@
<li><a href="api/pyspark/index.html">Spark Core for Python</a></li>
<li class="divider"></li>
<li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
<li class="dropdown-submenu">
<a tabindex="-1" href="#">Spark SQL</a>
<ul class="dropdown-menu">
<li><a href="api/sql/core/org/apache/spark/sql/SQLContext.html">Spark SQL Core</a></li>
<li><a href="api/sql/hive/org/apache/spark/sql/hive/package.html">Hive Support</a></li>
<li><a href="api/sql/catalyst/org/apache/spark/sql/catalyst/package.html">Catalyst (Optimization)</a></li>
</ul>
</li>
<li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
<li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
<li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph Processing)</a></li>
Expand Down
13 changes: 13 additions & 0 deletions docs/_plugins/copy_api_dirs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
# Build Scaladoc for Java/Scala
core_projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"]
external_projects = ["flume", "kafka", "mqtt", "twitter", "zeromq"]
sql_projects = ["catalyst", "core", "hive"]

projects = core_projects + external_projects.map { |project_name| "external/" + project_name }

Expand Down Expand Up @@ -49,6 +50,18 @@
cp_r(source + "/.", dest)
end

sql_projects.each do |project_name|
source = "../sql/" + project_name + "/target/scala-2.10/api/"
dest = "api/sql/" + project_name

puts "echo making directory " + dest
mkdir_p dest

# From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't.
puts "cp -r " + source + "/. " + dest
cp_r(source + "/.", dest)
end

# Build Epydoc for Python
puts "Moving to python directory and building epydoc."
cd("../python")
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
* [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
* [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs
Expand Down
143 changes: 143 additions & 0 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
layout: global
title: Spark SQL Programming Guide
---
**Spark SQL is currently an Alpha component. Therefore, the APIs may be changed in future releases.**

* This will become a table of contents (this text will be scraped).
{:toc}

# Overview
Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using
Spark. At the core of this component is a new type of RDD,
[SchemaRDD](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed
[Row](api/sql/catalyst/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).

**All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell.**

***************************************************************************************************

# Getting Started

The entry point into all relational functionallity in Spark is the
[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its
decendents. To create a basic SQLContext, all you need is a SparkContext.

{% highlight scala %}
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Importing the SQL context gives access to all the public SQL functions and implicit conversions.
import sqlContext._
{% endhighlight %}

## Running SQL on RDDs
One type of table that is supported by Spark SQL is an RDD of Scala case classetees. The case class
defines the schema of the table. The names of the arguments to the case class are read using
reflection and become the names of the columns. Case classes can also be nested or contain complex
types such as Sequences or Arrays. This RDD can be implicitly converted to a SchemaRDD and then be
registered as a table. Tables can used in subsequent SQL statements.

{% highlight scala %}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

// Define the schema using a case class.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
{% endhighlight %}

**Note that Spark SQL currently uses a very basic SQL parser, and the keywords are case sensitive.**
Users that want a more complete dialect of SQL should look at the HiveQL support provided by
`HiveContext`.

## Using Parquet

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL
provides support for both reading and writing parquet files that automatically preserves the schema
of the original data. Using the data from the above example:

{% highlight scala %}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

val people: RDD[Person] // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a SchemaRDD, allowing it to be stored using parquet.
people.saveAsParquetFile("people.parquet")

// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
// The result of loading a parquet file is also a SchemaRDD.
val parquetFile = sqlContext.parquetFile("people.parquet")

//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerAsTable("parquetFile")
val teenagers = sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.collect().foreach(println)
{% endhighlight %}

## Writing Language-Integrated Relational Queries

Spark SQL also supports a domain specific language for writing queries. Once again,
using the data from the above examples:

{% highlight scala %}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val people: RDD[Person] // An RDD of case class objects, from the first example.

// The following is the same as 'SELECT name FROM people WHERE age >= 10 AND age <= 19'
val teenagers = people.where('age >= 10).where('age <= 19).select('name)
{% endhighlight %}

The DSL uses Scala symbols to represent columns in the underlying table, which are identifiers
prefixed with a tick (`'`). Implicit conversions turn these symbols into expressions that are
evaluated by the SQL execution engine. A full list of the functions supported can be found in the
[ScalaDoc](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD).

<!-- TODO: Include the table of operations here. -->

# Hive Support

Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.
In order to use Hive you must first run '`sbt/sbt hive/assembly`'. This command builds a new assembly
jar that includes Hive. When this jar is present, Spark will use the Hive
assembly instead of the normal Spark assembly. Note that this Hive assembly jar must also be present
on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
(SerDes) in order to acccess data stored in Hive.

Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.

When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
adds support for finding tables in in the MetaStore and writing queries using HiveQL. Users who do
not have an existing Hive deployment can also experiment with the `LocalHiveContext`,
which is similar to `HiveContext`, but creates a local copy of the `metastore` and `warehouse`
automatically.

{% highlight scala %}
val sc: SparkContext // An existing SparkContext.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

// Importing the SQL context gives access to all the public SQL functions and implicit conversions.
import hiveContext._

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT key, value FROM src").collect().foreach(println)
{% endhighlight %}
6 changes: 6 additions & 0 deletions examples/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,12 @@
<version>${project.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
Expand Down
Loading