-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bigquery connector changes #3001
Conversation
Sample performance results: |
writer.Close() | ||
fw.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already deferred calls to these – is it safe/necessary to call twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. writer.Close()
will return an error which we are ignoring and fw.Close()
will not do anything. There are many return paths in this code so I used a defer as well to close there in error cases.
runtime/drivers/bigquery/arrow.go
Outdated
// Next returns true if another record can be produced | ||
func (rs *arrowRecordReader) Next() bool { | ||
if rs.err != nil { | ||
return false | ||
} | ||
|
||
if len(rs.records) == 0 { | ||
tz := time.Now() | ||
next, err := rs.bqIter.Next() | ||
if err != nil { | ||
rs.err = err | ||
return false | ||
} | ||
rs.apinext += time.Since(tz) | ||
|
||
rs.records, rs.err = rs.nextArrowRecords(next) | ||
if rs.err != nil { | ||
return false | ||
} | ||
} | ||
if rs.cur != nil { | ||
rs.cur.Release() | ||
} | ||
rs.cur = rs.records[0] | ||
rs.records = rs.records[1:] | ||
return true | ||
} | ||
|
||
func (rs *arrowRecordReader) Err() error { | ||
if errors.Is(rs.err, iterator.Done) { | ||
return nil | ||
} | ||
return rs.err | ||
} | ||
|
||
func (rs *arrowRecordReader) nextArrowRecords(r *bigquery.ArrowRecordBatch) ([]arrow.Record, error) { | ||
t := time.Now() | ||
defer func() { | ||
rs.ipcread += time.Since(t) | ||
}() | ||
|
||
buf := bytes.NewBuffer(rs.bqIter.SerializedArrowSchema()) | ||
buf.Write(r.Data) | ||
rdr, err := ipc.NewReader(buf, ipc.WithSchema(rs.arrowSchema), ipc.WithAllocator(rs.allocator)) | ||
if err != nil { | ||
return nil, err | ||
} | ||
defer rdr.Release() | ||
records := make([]arrow.Record, 0) | ||
for rdr.Next() { | ||
rec := rdr.Record() | ||
rec.Retain() | ||
records = append(records, rec) | ||
} | ||
return records, rdr.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the extra layer of buffering in rs.records
needed versus directly buffering rdr
and proxying to it in Next()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried proxying rdr
to next but overall it feels lot more complicated than current implementation. There are multiple cases to consider. rdr
will be nil in first next call, rdr
can return next as false if is are no more data or if there is some genuine error(which needs to be handled separately as well). Overall it felt lot easier to just look at the size of the array.
Also as of now ArrowRecordBatch
from bigquery will only return one arrow record but it's better to keep array for any change in underlying impls in future.
This PR makes following changes in the existing logic of bigquery connector:
For cases when bigquery SDK does not return arrow records we will dump the records in json format.
Changes as compared to previous impl :
BigNumeric
will not be directly supported. Users can either cast tovarchar
or toNUMERIC
in SQL query if loss of precision is acceptable.Repeated
andNested
types are now supported.