Bigquery connector changes #3001

k-anshul · 2023-08-30T08:55:35Z

This PR makes following changes in the existing logic of bigquery connector:

Creates local temporary parquet files from arrow records obtained from bigquery SDK. This logic is currently implemented in a fork of bigquery SDK with changes from a bigquery draft PR : feat(bigquery): expose Apache Arrow data through ArrowIterator googleapis/google-cloud-go#8506 Migrate to the original SDK once the PR is merged.
Parquet files are then ingested to duckdb.

For cases when bigquery SDK does not return arrow records we will dump the records in json format.

Changes as compared to previous impl :

BigNumeric will not be directly supported. Users can either cast to varchar or to NUMERIC in SQL query if loss of precision is acceptable.
Repeated and Nested types are now supported.
Hopefully faster than previous approach for bigger datasets :)

k-anshul · 2023-08-30T12:03:32Z

Sample performance results:
Query : SELECT * FROM bigquery-public-data.covid19_open_data.compatibility_view LIMIT 10000000
Old Approach : 135 seconds
New Approach: 60 seconds

go.mod

runtime/drivers/bigquery/sql_store.go

begelundmuller · 2023-09-04T13:41:32Z

runtime/drivers/bigquery/sql_store.go

+	writer.Close()
+	fw.Close()


There are already deferred calls to these – is it safe/necessary to call twice?

Yes. writer.Close() will return an error which we are ignoring and fw.Close() will not do anything. There are many return paths in this code so I used a defer as well to close there in error cases.

runtime/pkg/fileutil/fileutil.go

runtime/drivers/duckdb/transporter/sqlstore_to_duckDB.go

runtime/drivers/bigquery/arrow.go

begelundmuller · 2023-09-04T14:06:51Z

runtime/drivers/bigquery/arrow.go

+// Next returns true if another record can be produced
+func (rs *arrowRecordReader) Next() bool {
+	if rs.err != nil {
+		return false
+	}
+
+	if len(rs.records) == 0 {
+		tz := time.Now()
+		next, err := rs.bqIter.Next()
+		if err != nil {
+			rs.err = err
+			return false
+		}
+		rs.apinext += time.Since(tz)
+
+		rs.records, rs.err = rs.nextArrowRecords(next)
+		if rs.err != nil {
+			return false
+		}
+	}
+	if rs.cur != nil {
+		rs.cur.Release()
+	}
+	rs.cur = rs.records[0]
+	rs.records = rs.records[1:]
+	return true
+}
+
+func (rs *arrowRecordReader) Err() error {
+	if errors.Is(rs.err, iterator.Done) {
+		return nil
+	}
+	return rs.err
+}
+
+func (rs *arrowRecordReader) nextArrowRecords(r *bigquery.ArrowRecordBatch) ([]arrow.Record, error) {
+	t := time.Now()
+	defer func() {
+		rs.ipcread += time.Since(t)
+	}()
+
+	buf := bytes.NewBuffer(rs.bqIter.SerializedArrowSchema())
+	buf.Write(r.Data)
+	rdr, err := ipc.NewReader(buf, ipc.WithSchema(rs.arrowSchema), ipc.WithAllocator(rs.allocator))
+	if err != nil {
+		return nil, err
+	}
+	defer rdr.Release()
+	records := make([]arrow.Record, 0)
+	for rdr.Next() {
+		rec := rdr.Record()
+		rec.Retain()
+		records = append(records, rec)
+	}
+	return records, rdr.Err()


Why is the extra layer of buffering in rs.records needed versus directly buffering rdr and proxying to it in Next()?

I tried proxying rdr to next but overall it feels lot more complicated than current implementation. There are multiple cases to consider. rdr will be nil in first next call, rdr can return next as false if is are no more data or if there is some genuine error(which needs to be handled separately as well). Overall it felt lot easier to just look at the size of the array.
Also as of now ArrowRecordBatch from bigquery will only return one arrow record but it's better to keep array for any change in underlying impls in future.

k-anshul added 8 commits August 18, 2023 16:40

test

00c7614

interim commit

5d0c8b5

poc

df7ba8b

arrow changes

0972614

interim commit

0d65aa3

use draft bigquery changes

1173696

progress and limit checking

8536716

small refactor

5b8fc80

k-anshul changed the title ~~Arrow~~ Bigquery connector v2 Aug 30, 2023

Merge remote-tracking branch 'origin/main' into arrow

6bd0dcd

k-anshul changed the title ~~Bigquery connector v2~~ Bigquery connector changes Aug 30, 2023

k-anshul marked this pull request as ready for review August 30, 2023 10:02

k-anshul added 2 commits August 30, 2023 15:51

better error msg

29f64e6

dump to json for fast paths

cbc4e7c

k-anshul added 2 commits August 30, 2023 17:45

small self review

0a45de8

consistent datatype handling

935ed37

k-anshul requested a review from begelundmuller August 30, 2023 13:02

begelundmuller requested changes Sep 4, 2023

View reviewed changes

k-anshul added 4 commits September 5, 2023 11:53

Merge remote-tracking branch 'origin/main' into arrow

8f60104

review changes

c7b1fd2

review comments - use safename

442f041

quick fixes

21f4adb

begelundmuller approved these changes Sep 5, 2023

View reviewed changes

k-anshul added 2 commits September 5, 2023 22:16

no result fix

d2f1f1b

reuse the same buffer

b872d8b

k-anshul merged commit ccd6fd2 into main Sep 6, 2023

k-anshul deleted the arrow branch September 6, 2023 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigquery connector changes #3001

Bigquery connector changes #3001

k-anshul commented Aug 30, 2023 •

edited

Loading

k-anshul commented Aug 30, 2023

begelundmuller Sep 4, 2023

k-anshul Sep 5, 2023

begelundmuller Sep 4, 2023

k-anshul Sep 5, 2023

Bigquery connector changes #3001

Bigquery connector changes #3001

Conversation

k-anshul commented Aug 30, 2023 • edited Loading

k-anshul commented Aug 30, 2023

begelundmuller Sep 4, 2023

Choose a reason for hiding this comment

k-anshul Sep 5, 2023

Choose a reason for hiding this comment

begelundmuller Sep 4, 2023

Choose a reason for hiding this comment

k-anshul Sep 5, 2023

Choose a reason for hiding this comment

k-anshul commented Aug 30, 2023 •

edited

Loading