Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-580] update doc on data source(DS V1/V2 usage) #587

Merged
merged 2 commits into from
Nov 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib
![Overview](./docs/image/dataset.png)

A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source)
Note both data source V1 and V2 are supported. Please check the [example section](arrow-data-source/#run-a-query-with-arrowdatasource-scala) for arrow data source

### Apache Arrow Compute/Gandiva based operators

Expand Down
32 changes: 32 additions & 0 deletions arrow-data-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,38 @@ val df = spark.read
df.createOrReplaceTempView("my_temp_view")
spark.sql("SELECT * FROM my_temp_view LIMIT 10").show(10)
```

### Example on integrating with hive metastore
Hive metastore is commonly used in thrift-server based setup, arrow data source also supports this case. The APIs are little different than vanilla Spark. Here's one example on how to create metadata table for parqute based TPCH tables:
```
// create a database first, otherwise those tables will be stored in default database
spark.sql("create database testtpch;").show
spark.sql("use testtpch;").show

spark.catalog.createTable("lineitem", "hdfs:////user/root/date_tpchnp_1000/lineitem", "arrow")
spark.catalog.createTable("orders", "hdfs:////user/root/date_tpchnp_1000/orders", "arrow")
spark.catalog.createTable("customer", "hdfs:////user/root/date_tpchnp_1000/customer", "arrow")
spark.catalog.createTable("nation", "hdfs:////user/root/date_tpchnp_1000/nation", "arrow")
spark.catalog.createTable("region", "hdfs:////user/root/date_tpchnp_1000/region", "arrow")
spark.catalog.createTable("part", "hdfs:////user/root/date_tpchnp_1000/part", "arrow")
spark.catalog.createTable("partsupp", "hdfs:////user/root/date_tpchnp_1000/partsupp", "arrow")
spark.catalog.createTable("supplier", "hdfs:////user/root/date_tpchnp_1000/supplier", "arrow")

// need to recover the partitions if it's partiton table
spark.sql("ALTER TABLE lineitem RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE orders RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE customer RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE nation RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE region RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE partsupp RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE part RECOVER PARTITIONS").show;
spark.sql("ALTER TABLE supplier RECOVER PARTITIONS").show;
```
By default, the "arrow" format indicates for reading parquet files with arrow data source. Note this step only creates the metadata, those original files are not changed. We also support other file formats(ORC, CSV). Here's one example on how to create metadata table for ORC files:
```
spark.catalog.createTable("web_site", "arrow", Map("path" -> "hdfs:///root/tmp/TPCDS_ORC/web_site", "originalFormat" -> "orc"))
```

### To validate if ArrowDataSource works properly

To validate if ArrowDataSource works, you can go to the DAG to check if ArrowScan has been used from the above example query.
Expand Down