Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add high level overview to normalization doc. #6445

Merged
merged 2 commits into from
Sep 28, 2021

Conversation

avaidyanatha
Copy link
Contributor

@avaidyanatha avaidyanatha commented Sep 24, 2021

Main Changes

  • Makes the Basic Normalization doc a little more readable to first-time deployers.

@avaidyanatha avaidyanatha added the area/documentation Improvements or additions to documentation label Sep 24, 2021
@avaidyanatha avaidyanatha changed the title Add high level overview to normalization Add high level overview to normalization doc. Sep 24, 2021
@avaidyanatha avaidyanatha temporarily deployed to more-secrets September 24, 2021 21:23 Inactive
@@ -50,6 +44,24 @@ The [normalization rules](basic-normalization.md#Rules) are _not_ configurable.

Airbyte places the json blob version of your data in a table called `_airbyte_raw_<stream name>`. If basic normalization is turned on, it will place a separate copy of the data in a table called `<stream name>`. Under the hood, Airbyte is using dbt, which means that the data only ingresses into the data store one time. The normalization happens as a query within the datastore. This implementation avoids extra network time and costs.

## Why does Airbyte have Basic Normalization?

At its core, Airbyte is geared to handle the EL \(Extract Load\) steps of an ELT process. These steps can also be referred in Airbyte's dialect as "Source" and "Destination".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to explain why the raw table exists since that is something we get questions about a lot.

e.g. (you can word it better) A core tenant of the ELT approach is that the E and L steps mutate the data as little as possible. By getting a copy of the unmodified data into the destination, we reduce the need for resending data in the future, because the "original" data is already in the destination. If you change your mind on how you want to materialize the data, Airbyte can use the untouched raw version that is already in the destination to do it and doesn't need to resend anything.

(of course we do actually resend data in a lot of cases right now, but aspirationally this is what we are going for and why we adhere to this philosophy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This absolutely makes sense and I think it's good to explain why it exists. I've included a short explanation on the philosophy.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Abhi Vaidyanatha seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants