Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heads-up on future of telemetry data layout in BigQuery #53

Open
jklukas opened this issue Aug 2, 2019 · 12 comments
Open

Heads-up on future of telemetry data layout in BigQuery #53

jklukas opened this issue Aug 2, 2019 · 12 comments
Labels
big query Issues or feature requests related to Google's BigQuery

Comments

@jklukas
Copy link

jklukas commented Aug 2, 2019

I wanted to summarize implications from recent and upcoming changes to the layout of telemetry data in BQ as it relates to STMO.

First, we are starting to migrate all data into the moz-fx-data-shared-prod project from the moz-fx-data-derived-datasets project. shared-prod will be the combined home for both incoming data from the telemetry pipeline and for derived datasets. As we transition derived data, we will plan to present views of that data in both derived-datasets and in shared-prod, so that STMO users can still access the tables as they move. Eventually (by end of August?), we should be able to switch STMO's BigQuery sources to point to shared-prod with no change to queries.

Second, we are planning to present all user-facing data as views rather than having users hit tables directly. This gives a number of benefits in terms of allowing reconfiguration of backend tables without changing the user interface and conversely allowing us to evolve the user interface without having to rewrite backend tables. This will mostly be transparent to users, since we are replacing tables with views that have an identical schema. One issue we're aware of, however, is that views don't have the same preview functionality as tables (see #52). As we move all tables to being clustered, however, it is now efficient to run SELECT * LIMIT 10-type queries to investigate content; this may be sufficient that loss of table preview is not terrible.

Third, the overall data layout is becoming more complex. For each document namespace, we now have four separate BigQuery datasets. For telemetry, there are tables in telemetry_live, telemetry_stable, and telemetry_derived; then there are user-facing views in telemetry (unsuffixed). We will be producing documentation to explain the full function of each of these datasets, but for the vast majority of use cases, we will be expecting users to query only the views presented in the unsuffixed datasets (primarily telemetry). It may be useful to have an option in STMO to hide all suffixed datasets (those ending in _live, _stable, or _derived) so that users see only the views that are intended for end-user queries.

@jezdez
Copy link
Contributor

jezdez commented Aug 6, 2019

I wanted to summarize implications from recent and upcoming changes to the layout of telemetry data in BQ as it relates to STMO.

@jklukas Thanks for the notes, but I'm surprised by the way this is being communicated since this seems like a non-trivial project and the time-frame (3 weeks!) is a bit small to make sure things will continue to work. Is @rafrombrc and @jasonthomas aware about those changes?

First, we are starting to migrate all data into the moz-fx-data-shared-prod project from the moz-fx-data-derived-datasets project. shared-prod will be the combined home for both incoming data from the telemetry pipeline and for derived datasets. As we transition derived data, we will plan to present views of that data in both derived-datasets and in shared-prod, so that STMO users can still access the tables as they move. Eventually (by end of August?), we should be able to switch STMO's BigQuery sources to point to shared-prod with no change to queries.

Will those changes be communicated via the usual channels (e.g. fx-data-dev etc)? As maintainers of the Redash fork it's usually not our task to talk to query owners about upcoming data changes. Since your changes will have a user experience effect I would recommend to err on the side of over-communication to reduce the level of stress such subtle changes may introduce.

Second, we are planning to present all user-facing data as views rather than having users hit tables directly. This gives a number of benefits in terms of allowing reconfiguration of backend tables without changing the user interface and conversely allowing us to evolve the user interface without having to rewrite backend tables. This will mostly be transparent to users, since we are replacing tables with views that have an identical schema. One issue we're aware of, however, is that views don't have the same preview functionality as tables (see #52). As we move all tables to being clustered, however, it is now efficient to run SELECT * LIMIT 10-type queries to investigate content; this may be sufficient that loss of table preview is not terrible.

The preview functionality via tabledata.list isn't being used right now, I think. Can you clarify which API endpoints are considered "preview"?

Here's the query runner being used: https://github.com/getredash/redash/blob/master/redash/query_runner/big_query.py

FWIW we do have the ability to override functionality in an own query runner (in fact we're working on one in #38).

Regarding the use of SELECT * LIMIT 10-type queries in the future, how does this match Google's recommendation to not use LIMIT for cost control? Would our partitions simply make that problem moot?

Third, the overall data layout is becoming more complex. For each document namespace, we now have four separate BigQuery datasets. For telemetry, there are tables in telemetry_live, telemetry_stable, and telemetry_derived; then there are user-facing views in telemetry (unsuffixed). We will be producing documentation to explain the full function of each of these datasets, but for the vast majority of use cases, we will be expecting users to query only the views presented in the unsuffixed datasets (primarily telemetry). It may be useful to have an option in STMO to hide all suffixed datasets (those ending in _live, _stable, or _derived) so that users see only the views that are intended for end-user queries.

FWIW, I'm trying to understand why there are separate views when only one is user-facing? Who is using the non-user-facing views?

Can we prevent the non-user-facing views to appear in the API results by using permissions instead of having to hard-code schema information in a consumer of the views and doing the filtering on the client side?

I'm asking since the implementation complexity to filter out specific tables from the schema browser isn't super high, but it's there and we've been burned with a non-trivial maintenance burden with Redash customizations. IOW I'm reluctant to implement something that can be easier be implemented on the API layer.

E.g. do you expect STMO users to sometimes use non-user-facing views (_live, _stable and _derived)?

@jezdez jezdez added the big query Issues or feature requests related to Google's BigQuery label Aug 6, 2019
@jklukas
Copy link
Author

jklukas commented Aug 6, 2019

@jezdez - I wanted to make sure we started a conversation about these things, but I hope I haven't induced too much panic. In general, we are expecting that these changes will be basically invisible to users, but I want to make sure we're not missing any details.

The ideas here were advertised as proposals and shared with the overall data engineering team in the Monday meetings for feedback and discussion. @jasonthomas has been actively engaged in that feedback and @rafrombrc should also be at least marginally aware.

Will those changes be communicated via the usual channels (e.g. fx-data-dev etc)?

Indeed. And they'll be documented in docs.t.m.o. It's my expectation that we will keep derived-datasets and shared-prod projects compatible by duplicating views in both places, so users of Redash should never be aware of the data underneath moving. Redash itself shouldn't need any code changes in that sense; we will handle everything via backend views and then eventually by changing the BigQuery source configuration to point to the new location.

Regarding the use of SELECT * LIMIT 10-type queries in the future, how does this match Google's recommendation to not use LIMIT for cost control? Would our partitions simply make that problem moot?

Yes, that point no longer applies for clustered tables. We have asked Google to update that documentation to make that distinction, and we have some relevant notes on query optimization in DTMO.

FWIW, I'm trying to understand why there are separate views when only one is user-facing? Who is using the non-user-facing views?

It's generally the underlying tables that are non-user-facing, and the SQL views are what we present to users.

E.g. do you expect STMO users to sometimes use non-user-facing views (_live, _stable and _derived)?

There will probably be some cases where advanced users will want to query a _stable table for a particular version of a ping, or they might want to directly query _live tables for use cases that want to see near-realtime data rather than completed days.

Now that I think more about it, the existing toggle in redash for hiding tables with a version suffix is likely sufficient here as well. The views we present in the unsuffixed datasets will mostly not have a version attached, but the tables in _live, _stable, and _derived will always have versions. So hiding tables/views with a version suffix will effectively hide everything in _live, _stable, and _derived.

@jklukas
Copy link
Author

jklukas commented Aug 6, 2019

The preview functionality via tabledata.list isn't being used right now, I think. Can you clarify which API endpoints are considered "preview"?

Looking at STMO now, I'm not seeing anything that looks like the "preview" that's available in BigQuery. I think I am misremembering a conversation about that functionality being added to STMO. I believe you are correct that it's the tabledata.list API that populates the "Preview" section of the BigQuery UI.

So, to get back to your original concern:

this seems like a non-trivial project and the time-frame (3 weeks!)

I don't think I see anything that needs to change in STMO code to support these backend changes, and I apologize again for causing concern. I do want to better socialize the upcoming changes to see if there are more thoughts about how STMO can provide an even better experience for users interacting with BigQuery.

@rafrombrc
Copy link
Member

Hey there... mostly I think you're right, @jklukas, that the preview functionality isn't in use. However, @emtwo has been working on code to generate sample data for each field to show up in the schema drawer, and I can't recall whether that was making use of the preview functionality. Marina, can you let us know if you think these changes will impact your data sample work?

@emtwo
Copy link
Contributor

emtwo commented Aug 8, 2019

Unfortunately, we do use tabledata().list() here: https://github.com/mozilla/redash/blob/master/redash/query_runner/big_query.py#L339

It's used to gather samples for tables, and I confirmed that we don't get a response for views. Instead we get this error Cannot list a table of type VIEW.

Since we can use limit more safely now, it's likely that we can fall back on it for displaying samples for views.

@jklukas
Copy link
Author

jklukas commented Aug 8, 2019

Since we can use limit more safely now, it's likely that we can fall back on it for displaying samples for views.

Yes, issuing a query like SELECT * FROM foo LIMIT 10 should now return quickly and be relatively cheap, so I do expect it can be a suitable replacement for tabledata().list(). BigQuery should also cache the query results in most cases, so repeated queries on the same day will be free.

@jklukas
Copy link
Author

jklukas commented Aug 9, 2019

For doc days today, I put together a draft of dtmo updates to reflect the upcoming prod structure and give some more guidance about how to navigate the layout: mozilla/data-docs#302

That's a formalization of what wanted to convey in the original issue here.

@emtwo
Copy link
Contributor

emtwo commented Aug 26, 2019

@jklukas since the deployment of this change: mozilla/redash@32c910a we now fallback to a SELECT * FROM foo LIMIT 1 fallback for when a BigQuery API call to tabledata().list() call fails for whatever reason.

One such reason is due to attempting to list a view, as I mentioned above. I have found a couple of views, e.g. telemetry.addons_v2 and telemetry.active_profiles (if I'm understanding the BQ console view labels correctly) that are visible in stmo and also have data samples available for them. So I believe this is confirmation that this new change seems to be working.

There are, however, several views that still don't have data samples. This could potentially be caused by a couple of things:

  1. This issue: Allow data samples to be gathered for partitioned tables. redash#966 where partitioned tables need a more specific query
  2. With all the new schema changes, the queue for updating samples might be backlogged with incomplete updates.

@jklukas
Copy link
Author

jklukas commented Aug 26, 2019

This issue: mozilla/redash#966 where partitioned tables need a more specific query

Great point; most of these tables will need a filter on submission_date (for most views on derived tables) or DATE(submission_timestamp) (for views on ping tables).

A complete solution might look like this:

  • Fetch the Table object from the BigQuery API
  • Check if timePartitioning.requirePartitionFilter is set to true
  • If so, check timePartitioning.field to get the name of the field and add WHERE DATE({timePartitioning.field}) = '{date}' where {date} is filled in with yesterday's date; if the field is already a date rather than a timestamp, the DATE wrapper may need to be dropped

@emtwo
Copy link
Contributor

emtwo commented Aug 26, 2019

Thanks for pointing out the timePartitioning.requirePartitionFilter field! Sounds like exactly what we need here.

I am not super familiar with the data though, and I wonder if filtering on the most recent date (yesterday) would be sufficient for all tables? Are there any old tables that have stopped being updated but we might still need samples for? What do you think @jklukas?

@jklukas
Copy link
Author

jklukas commented Aug 26, 2019

I wonder if filtering on the most recent date (yesterday) would be sufficient for all tables? Are there any old tables that have stopped being updated but we might still need samples for?

It should be a minority of tables that aren't getting new data daily, but your are definitely right that this is an edge case that we'll need to handle.

Because we're doing a LIMIT 1 here, we can be pretty loose about the partition filter. The following would still be efficient:

SELECT * FROM mytable
WHERE submission_timestamp > '0001-01-01'
LIMIT 1

So that query effectively sidesteps the requirePartitionFilter field without actually applying any filter on partitions. But we only need 1 record so we only scan a small amount of data.

I was tempted to recommend grabbing the most recent row by putting an ORDER BY in there, but it looks like that breaks partition filtering and it ends up scanning all partitions in that case.

If we decided it was desirable to get a recent record, we could list partitions in the table in order to be sure that we're choosing the most recent non-empty date.

@emtwo
Copy link
Contributor

emtwo commented Aug 26, 2019

Cool! I like your idea with > '0001-01-01' it's simple and addresses the issue. And I'm glad it's still efficient! :) I don't think it's necessary (at least not for now) to get the most recent non-empty date, but just to find a partition that has some data we can use for a sample.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big query Issues or feature requests related to Google's BigQuery
Projects
None yet
Development

No branches or pull requests

4 participants