Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery Time Values May Be Incorrect #1125

Closed
dhermes opened this issue Sep 8, 2015 · 12 comments
Closed

BigQuery Time Values May Be Incorrect #1125

dhermes opened this issue Sep 8, 2015 · 12 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API.

Comments

@dhermes
Copy link
Contributor

dhermes commented Sep 8, 2015

On the _datetime_from_json function, @arterrey commented that

I'm getting values out of BigQuery in seconds. So * 1000.0 is not enough - it needs to be * 1000000.0 to work. However, I'm not sure if that is a bug in the BigQuery API or a bug in this client library.

This contradicts the note in the code that

Field value will be in milliseconds.

@dhermes dhermes added the api: bigquery Issues related to the BigQuery API. label Sep 8, 2015
@dhermes dhermes changed the title Bigquery Time Values May Be Incorrect BigQuery Time Values May Be Incorrect Sep 8, 2015
@dhermes
Copy link
Contributor Author

dhermes commented Sep 8, 2015

@arterrey Can you elaborate where you are "getting values out of BigQuery in seconds"? Like an example use of this library and how the output differs from what you expect?

dhermes referenced this issue Sep 8, 2015
Also making it require a float (or int) instead of bailing
out for NoneType. Changing the use of the method in
bigquery to not pass potentially null values and to multiply
by 1000.0 (convert from millis to micros). Also updating
micros -> datetime conversion in datastore to use the newly
converted method.
@arterrey
Copy link

arterrey commented Sep 8, 2015

The field value from the web interface is "2012-06-12 08:05:14 UTC". From the pdb session below you can see that the value returns a dattime in 1970 rather then 2012.

(Pdb) list 762
757     def _bool_from_json(value, field):
758         if _not_null(value, field):
759             return value.lower() in ['t', 'true', '1']
760     
761     
762     def _datetime_from_json(value, field):
763         if _not_null(value, field):
764             import pdb; pdb.set_trace()
765             # Field value will be in milliseconds.        
766  ->         return _datetime_from_microseconds(1000.0 * float(value))
767     
(Pdb) value
u'1.33948831400937E9'
(Pdb) float(value)
1339488314.00937
(Pdb) _datetime_from_microseconds(1000.0 * float(value))
datetime.datetime(1970, 1, 16, 12, 4, 48, 314009, tzinfo=<UTC>)
(Pdb) _datetime_from_microseconds(1000000.0 * float(value))
datetime.datetime(2012, 6, 12, 8, 5, 14, 9370, tzinfo=<UTC>)
(Pdb) 

@dhermes
Copy link
Contributor Author

dhermes commented Sep 8, 2015

That's really great info! Thanks for showing the pdb stack.

It looks like

value == 1339.488314009370

Which request did you make to retrieve this data? Can you provide a code snippet?

@arterrey
Copy link

arterrey commented Sep 8, 2015

I think you mean

    value == 1339488314.00937

just using fetch_data...

    def fetch_table(data_set, table_name):

        table = data_set.table(table_name)
        assert table.exists()
        table.reload()

        next_page = None

        while (True):
            rows, total_count, next_page = table.fetch_data(page_token=next_page)
            for row in rows:

                # use schema to convert tuple into dict
                row_dict = {}
                for i in range(len(table.schema)):
                    field = table.schema[i]
                    row_dict[field.name] = row[i]

                yield PathableDict(row_dict)

            if next_page is None:
                break

@dhermes
Copy link
Contributor Author

dhermes commented Sep 8, 2015

Sorry I overlooked the actual value:

(Pdb) value
u'1.33948831400937E9'

It seems that instead of

datetime.timedelta(microseconds=1000.0 * float(value))

we should just be using

datetime.timedelta(seconds=float(value))

Like you said, the value is seconds, not milliseconds.

I couldn't find any evidence of "seconds" in the discovery doc:
https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest

FWIW the bigquery.tabledata.list returns a TableDataList, and the schema for that type has TableDataList.rows a list of TableRow, which in turn has TableRow.f a list of TableCell which then is an any type (which is kind of a black hole).

I'm off to bed for now but will try to throw in some test data (from our system tests) and inspect the output on the APIs explorer:
https://developers.google.com/apis-explorer/#p/bigquery/v2/bigquery.tabledata.list

Also FWIW our link in our docstring:
https://cloud.google.com/bigquery/reference/rest/v2/tabledata/list
is broken. It should be:
https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list


Really just

rows, total_count, next_page = table.fetch_data(page_token=next_page)

was what I was looking for. Thanks!

@tseaver
Copy link
Contributor

tseaver commented Sep 8, 2015

I just created a table with the following schema:

[
   {
    "name": "Timestamp",
    "type": "TIMESTAMP",
    "mode": "REQUIRED"
   },
   {
    "name": "Description",
    "type": "STRING",
    "mode": "REQUIRED"
   },
   {
    "name": "Quantity",
    "type": "INTEGER",
    "mode": "REQUIRED"
   },
   {
    "name": "Unit_Price",
    "type": "FLOAT",
    "mode": "REQUIRED"
   }
]

and uploaded the following data:

$ cat /tmp/with_timestamp.csv 
2015-09-09T17:59:22Z,Testing,1,2.5

In the developer console, the data looks like:

Row  Timestamp                Description  Quantity Unit_Price   
1    2015-09-09 17:59:22 UTC  Testing      1        2.5

In the API explorer, tabledata.list returns:

{
 "kind": "bigquery#tableDataList",
 "etag": "\"m-9bZ2XqlR7n8jGvqgVmLOTsm1s/xddlFZR3Ob1xPfHtKV-u3wV2TLU\"",
 "totalRows": "1",
 "rows": [
  {
   "f": [
    {
     "v": "1.441821562E9"
    },
    {
     "v": "Testing"
    },
    {
     "v": "1"
    },
    {
     "v": "2.5"
    }
   ]
  }
 ]
}

The Preparing data for bigquery docs don't specify how timestamps are returned, only the formats which can be used to load the data. It does say,

BigQuery stores TIMESTAMP data internally as a UNIX timestamp with microsecond precision.

@dhermes
Copy link
Contributor Author

dhermes commented Sep 8, 2015

So what shall we do? Also, how does 2015-09-09T17:59:22Z,Testing,1,2.5 make it through our pipeline?

@tseaver
Copy link
Contributor

tseaver commented Sep 8, 2015

Given that the values are always returned as scientific-formatted floats of seconds (unlike datastore timestamps, which are stored as microseconds), we should be using float(value).

How does 2015-09-09T17:59:22Z,Testing,1,2.5 make it through our pipeline?

ISO-8660 strings are one of the supported formats for loading TIMESTAMP data.

@dhermes
Copy link
Contributor Author

dhermes commented Sep 8, 2015

I mean what is the codepath.

@tseaver
Copy link
Contributor

tseaver commented Sep 8, 2015

Client.load_table_from_storage could trigger that path.

@dhermes
Copy link
Contributor Author

dhermes commented Sep 8, 2015

I see. So the CSV goes directly through GCS, so our code doesn't actually handle the timestamp string?

@tseaver
Copy link
Contributor

tseaver commented Sep 8, 2015

Yup. Table.insert_data expects clients to pass in real Python datetime.datetime instances for TIMESTAMP fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API.
Projects
None yet
Development

No branches or pull requests

3 participants