diff --git a/README.md b/README.md index c39ecd223..6f6fd485c 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,7 @@ With [Spark 27396](https://issues.apache.org/jira/browse/SPARK-27396) its possib ![Overview](./docs/image/dataset.png) A native parquet reader was developed to speed up the data loading. it's based on Apache Arrow Dataset. For details please check [Arrow Data Source](https://github.com/oap-project/native-sql-engine/tree/master/arrow-data-source) +Note both data source V1 and V2 are supported. Please check the [example section](arrow-data-source/#run-a-query-with-arrowdatasource-scala) for arrow data source ### Apache Arrow Compute/Gandiva based operators diff --git a/arrow-data-source/README.md b/arrow-data-source/README.md index 5ba77b1fe..20e07cf9e 100644 --- a/arrow-data-source/README.md +++ b/arrow-data-source/README.md @@ -205,6 +205,38 @@ val df = spark.read df.createOrReplaceTempView("my_temp_view") spark.sql("SELECT * FROM my_temp_view LIMIT 10").show(10) ``` + +### Example on integrating with hive metastore +Hive metastore is commonly used in thrift-server based setup, arrow data source also supports this case. The APIs are little different than vanilla Spark. Here's one example on how to create metadata table for parqute based TPCH tables: +``` +// create a database first, otherwise those tables will be stored in default database +spark.sql("create database testtpch;").show +spark.sql("use testtpch;").show + +spark.catalog.createTable("lineitem", "hdfs:////user/root/date_tpchnp_1000/lineitem", "arrow") +spark.catalog.createTable("orders", "hdfs:////user/root/date_tpchnp_1000/orders", "arrow") +spark.catalog.createTable("customer", "hdfs:////user/root/date_tpchnp_1000/customer", "arrow") +spark.catalog.createTable("nation", "hdfs:////user/root/date_tpchnp_1000/nation", "arrow") +spark.catalog.createTable("region", "hdfs:////user/root/date_tpchnp_1000/region", "arrow") +spark.catalog.createTable("part", "hdfs:////user/root/date_tpchnp_1000/part", "arrow") +spark.catalog.createTable("partsupp", "hdfs:////user/root/date_tpchnp_1000/partsupp", "arrow") +spark.catalog.createTable("supplier", "hdfs:////user/root/date_tpchnp_1000/supplier", "arrow") + +// need to recover the partitions if it's partiton table +spark.sql("ALTER TABLE lineitem RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE orders RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE customer RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE nation RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE region RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE partsupp RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE part RECOVER PARTITIONS").show; +spark.sql("ALTER TABLE supplier RECOVER PARTITIONS").show; +``` +By default, the "arrow" format indicates for reading parquet files with arrow data source. Note this step only creates the metadata, those original files are not changed. We also support other file formats(ORC, CSV). Here's one example on how to create metadata table for ORC files: +``` +spark.catalog.createTable("web_site", "arrow", Map("path" -> "hdfs:///root/tmp/TPCDS_ORC/web_site", "originalFormat" -> "orc")) +``` + ### To validate if ArrowDataSource works properly To validate if ArrowDataSource works, you can go to the DAG to check if ArrowScan has been used from the above example query.