-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using incremental syncing, the final table shows 0 records (the raw table contains records) #8286
Comments
related to #8028 |
maybe this is the same problems discussed in another issue: #7479 (comment) |
I've been trying to reproduce this setup but was unable to get the same results (it's working as expected on my side):
Find below logs from the first time the sync is run (with log messages stating that tables are created from scratch)
Then when running in incremental, the rows are updated with new inserts: The dbt logs with more details are here: and destination is properly showing the right number of rows being emitted/processed: the sql query is:
Maybe @danieldiamond if you could please post equivalent log files and results from same queries on your environment, that would help compare what you have against my setup? |
first sync dbt.log2021-11-28 21:54:52.765966 (Thread-12): On model.airbyte_utils.search_analytics_by_page: /* {"app": "dbt", "dbt_version": "0.21.0", "profile_name": "normalize", "target_name": "prod", "node_id": "model.airbyte_utils.search_analytics_by_page"} */
create temporary table
"search_analytics_by_page__dbt_tmp215452545898"
compound sortkey(_airbyte_emitted_at)
as (
with __dbt__cte__search_analytics_by_page_ab1 as (
-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema
-- depends_on: "snowplow".test_gsc._airbyte_raw_search_analytics_by_page
select
case when json_extract_path_text(_airbyte_data, 'ctr', true) != '' then json_extract_path_text(_airbyte_data, 'ctr', true) end as ctr,
case when json_extract_path_text(_airbyte_data, 'date', true) != '' then json_extract_path_text(_airbyte_data, 'date', true) end as date,
case when json_extract_path_text(_airbyte_data, 'page', true) != '' then json_extract_path_text(_airbyte_data, 'page', true) end as page,
case when json_extract_path_text(_airbyte_data, 'clicks', true) != '' then json_extract_path_text(_airbyte_data, 'clicks', true) end as clicks,
case when json_extract_path_text(_airbyte_data, 'position', true) != '' then json_extract_path_text(_airbyte_data, 'position', true) end as position,
case when json_extract_path_text(_airbyte_data, 'site_url', true) != '' then json_extract_path_text(_airbyte_data, 'site_url', true) end as site_url,
case when json_extract_path_text(_airbyte_data, 'impressions', true) != '' then json_extract_path_text(_airbyte_data, 'impressions', true) end as impressions,
case when json_extract_path_text(_airbyte_data, 'search_type', true) != '' then json_extract_path_text(_airbyte_data, 'search_type', true) end as search_type,
_airbyte_ab_id,
_airbyte_emitted_at,
getdate() as _airbyte_normalized_at
from "snowplow".test_gsc._airbyte_raw_search_analytics_by_page as table_alias
-- search_analytics_by_page
where 1 = 1
), __dbt__cte__search_analytics_by_page_ab2 as (
-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type
-- depends_on: __dbt__cte__search_analytics_by_page_ab1
select
cast(ctr as
float
) as ctr,
cast(nullif(date, '') as
date
) as date,
cast(page as varchar) as page,
cast(clicks as
bigint
) as clicks,
cast(position as
float
) as position,
cast(site_url as varchar) as site_url,
cast(impressions as
bigint
) as impressions,
cast(search_type as varchar) as search_type,
_airbyte_ab_id,
_airbyte_emitted_at,
getdate() as _airbyte_normalized_at
from __dbt__cte__search_analytics_by_page_ab1
-- search_analytics_by_page
where 1 = 1
), __dbt__cte__search_analytics_by_page_ab3 as (
-- SQL model to build a hash column based on the values of this record
-- depends_on: __dbt__cte__search_analytics_by_page_ab2
select
md5(cast(coalesce(cast(ctr as varchar), '') || '-' || coalesce(cast(date as varchar), '') || '-' || coalesce(cast(page as varchar), '') || '-' || coalesce(cast(clicks as varchar), '') || '-' || coalesce(cast(position as varchar), '') || '-' || coalesce(cast(site_url as varchar), '') || '-' || coalesce(cast(impressions as varchar), '') || '-' || coalesce(cast(search_type as varchar), '') as varchar)) as _airbyte_search_analytics_by_page_hashid,
tmp.*
from __dbt__cte__search_analytics_by_page_ab2 tmp
-- search_analytics_by_page
where 1 = 1
)-- Final base SQL model
-- depends_on: __dbt__cte__search_analytics_by_page_ab3
select
ctr,
date,
page,
clicks,
position,
site_url,
impressions,
search_type,
_airbyte_ab_id,
_airbyte_emitted_at,
getdate() as _airbyte_normalized_at,
_airbyte_search_analytics_by_page_hashid
from __dbt__cte__search_analytics_by_page_ab3
-- search_analytics_by_page from "snowplow".test_gsc._airbyte_raw_search_analytics_by_page
where 1 = 1
and cast(_airbyte_emitted_at as
timestamp with time zone
) >= (select max(cast(_airbyte_emitted_at as
timestamp with time zone
)) from "snowplow".test_gsc."search_analytics_by_page")
); second (incremental) sync dbt.log2021-11-28 21:56:39.327054 (Thread-12): On model.airbyte_utils.search_analytics_by_page: /* {"app": "dbt", "dbt_version": "0.21.0", "profile_name": "normalize", "target_name": "prod", "node_id": "model.airbyte_utils.search_analytics_by_page"} */
create temporary table
"search_analytics_by_page__dbt_tmp215639087673"
compound sortkey(_airbyte_emitted_at)
as (
with __dbt__cte__search_analytics_by_page_ab1 as (
-- SQL model to parse JSON blob stored in a single column and extract into separated field columns as described by the JSON Schema
-- depends_on: "snowplow".test_gsc._airbyte_raw_search_analytics_by_page
select
case when json_extract_path_text(_airbyte_data, 'ctr', true) != '' then json_extract_path_text(_airbyte_data, 'ctr', true) end as ctr,
case when json_extract_path_text(_airbyte_data, 'date', true) != '' then json_extract_path_text(_airbyte_data, 'date', true) end as date,
case when json_extract_path_text(_airbyte_data, 'page', true) != '' then json_extract_path_text(_airbyte_data, 'page', true) end as page,
case when json_extract_path_text(_airbyte_data, 'clicks', true) != '' then json_extract_path_text(_airbyte_data, 'clicks', true) end as clicks,
case when json_extract_path_text(_airbyte_data, 'position', true) != '' then json_extract_path_text(_airbyte_data, 'position', true) end as position,
case when json_extract_path_text(_airbyte_data, 'site_url', true) != '' then json_extract_path_text(_airbyte_data, 'site_url', true) end as site_url,
case when json_extract_path_text(_airbyte_data, 'impressions', true) != '' then json_extract_path_text(_airbyte_data, 'impressions', true) end as impressions,
case when json_extract_path_text(_airbyte_data, 'search_type', true) != '' then json_extract_path_text(_airbyte_data, 'search_type', true) end as search_type,
_airbyte_ab_id,
_airbyte_emitted_at,
getdate() as _airbyte_normalized_at
from "snowplow".test_gsc._airbyte_raw_search_analytics_by_page as table_alias
-- search_analytics_by_page
where 1 = 1
), __dbt__cte__search_analytics_by_page_ab2 as (
-- SQL model to cast each column to its adequate SQL type converted from the JSON schema type
-- depends_on: __dbt__cte__search_analytics_by_page_ab1
select
cast(ctr as
float
) as ctr,
cast(nullif(date, '') as
date
) as date,
cast(page as varchar) as page,
cast(clicks as
bigint
) as clicks,
cast(position as
float
) as position,
cast(site_url as varchar) as site_url,
cast(impressions as
bigint
) as impressions,
cast(search_type as varchar) as search_type,
_airbyte_ab_id,
_airbyte_emitted_at,
getdate() as _airbyte_normalized_at
from __dbt__cte__search_analytics_by_page_ab1
-- search_analytics_by_page
where 1 = 1
), __dbt__cte__search_analytics_by_page_ab3 as (
-- SQL model to build a hash column based on the values of this record
-- depends_on: __dbt__cte__search_analytics_by_page_ab2
select
md5(cast(coalesce(cast(ctr as varchar), '') || '-' || coalesce(cast(date as varchar), '') || '-' || coalesce(cast(page as varchar), '') || '-' || coalesce(cast(clicks as varchar), '') || '-' || coalesce(cast(position as varchar), '') || '-' || coalesce(cast(site_url as varchar), '') || '-' || coalesce(cast(impressions as varchar), '') || '-' || coalesce(cast(search_type as varchar), '') as varchar)) as _airbyte_search_analytics_by_page_hashid,
tmp.*
from __dbt__cte__search_analytics_by_page_ab2 tmp
-- search_analytics_by_page
where 1 = 1
)-- Final base SQL model
-- depends_on: __dbt__cte__search_analytics_by_page_ab3
select
ctr,
date,
page,
clicks,
position,
site_url,
impressions,
search_type,
_airbyte_ab_id,
_airbyte_emitted_at,
getdate() as _airbyte_normalized_at,
_airbyte_search_analytics_by_page_hashid
from __dbt__cte__search_analytics_by_page_ab3
-- search_analytics_by_page from "snowplow".test_gsc._airbyte_raw_search_analytics_by_page
where 1 = 1
and cast(_airbyte_emitted_at as
timestamp with time zone
) >= (select max(cast(_airbyte_emitted_at as
timestamp with time zone
)) from "snowplow".test_gsc."search_analytics_by_page")
);
|
@ChristopheDuong you can see in the logs i posted earlier the INSERT logs from dbt showing 0 records inserted |
yes, I saw that the final insert is processing 0 rows. The question is then what is the content of intermediate queries or temporary tables That's why I wanted to see a comparison of data in your raw tables vs final tables. I don't see the row counts per emitted date in the I am guessing it's because the table
|
just ran that above SQL
will cause that query to return nothing. guessing by looking at it that because |
I see, so if you delete completely the table But if the table is empty, the incremental behavior does not work because of |
Ok, I reproduced and confirmed it locally, I will make a PR to handle cases when the destination table already exists and is empty. Workaround for the moment is to drop the final table in the meantime. Thanks for your help! |
thank you! handling empty tables is definitely the right move but future work, I wonder if |
Yes, you are right the reset implementation does introduce some edge cases here and there... |
Environment
Reported via Slack
Is this your first time deploying Airbyte: no
OS Version / Instance: Linux EC2 m5.2xlarge
Deployment: Docker
Airbyte Version: 0.32.6-alpha
Source name: GSC 0.1.7
destination: Redshift 0.3.20
This is happening on a brand new instance of Airbyte
Expected Behavior
The normalized table should be populated with data.
Logs
If applicable, please upload the logs from the failing operation.
For sync jobs, you can download the full logs from the UI by going to the sync attempt page and
clicking the download logs button at the top right of the logs display window.
LOG
Steps to Reproduce
sync data on a new instance from GSC to Redshift with normalization turned on
The text was updated successfully, but these errors were encountered: