diff --git a/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (1).png b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (1).png new file mode 100644 index 000000000000..015d141aed6a Binary files /dev/null and b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (1).png differ diff --git a/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (2).png b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (2).png new file mode 100644 index 000000000000..015d141aed6a Binary files /dev/null and b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (2).png differ diff --git a/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (3).png b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (3).png new file mode 100644 index 000000000000..015d141aed6a Binary files /dev/null and b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (3).png differ diff --git a/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (4).png b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (4).png new file mode 100644 index 000000000000..015d141aed6a Binary files /dev/null and b/docs/.gitbook/assets/change-to-per-week (3) (3) (4) (4).png differ diff --git a/docs/.gitbook/assets/datasources (4) (4) (4) (1).png b/docs/.gitbook/assets/datasources (4) (4) (4) (1).png new file mode 100644 index 000000000000..bbf0b03d6817 Binary files /dev/null and b/docs/.gitbook/assets/datasources (4) (4) (4) (1).png differ diff --git a/docs/.gitbook/assets/datasources (4) (4) (4) (2).png b/docs/.gitbook/assets/datasources (4) (4) (4) (2).png new file mode 100644 index 000000000000..bbf0b03d6817 Binary files /dev/null and b/docs/.gitbook/assets/datasources (4) (4) (4) (2).png differ diff --git a/docs/.gitbook/assets/datasources (4) (4) (4) (3).png b/docs/.gitbook/assets/datasources (4) (4) (4) (3).png new file mode 100644 index 000000000000..bbf0b03d6817 Binary files /dev/null and b/docs/.gitbook/assets/datasources (4) (4) (4) (3).png differ diff --git a/docs/.gitbook/assets/datasources (4) (4) (4) (4).png b/docs/.gitbook/assets/datasources (4) (4) (4) (4).png new file mode 100644 index 000000000000..bbf0b03d6817 Binary files /dev/null and b/docs/.gitbook/assets/datasources (4) (4) (4) (4).png differ diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (1).png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (1).png new file mode 100644 index 000000000000..23054ffc2ebc Binary files /dev/null and b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (1).png differ diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (2).png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (2).png new file mode 100644 index 000000000000..23054ffc2ebc Binary files /dev/null and b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (2).png differ diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (3).png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (3).png new file mode 100644 index 000000000000..23054ffc2ebc Binary files /dev/null and b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (3).png differ diff --git a/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (4).png b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (4).png new file mode 100644 index 000000000000..23054ffc2ebc Binary files /dev/null and b/docs/.gitbook/assets/evolution-of-meetings-per-week (3) (3) (4) (4).png differ diff --git a/docs/.gitbook/assets/launch (3) (3) (4) (1).png b/docs/.gitbook/assets/launch (3) (3) (4) (1).png new file mode 100644 index 000000000000..cfcc543a16bf Binary files /dev/null and b/docs/.gitbook/assets/launch (3) (3) (4) (1).png differ diff --git a/docs/.gitbook/assets/launch (3) (3) (4) (2).png b/docs/.gitbook/assets/launch (3) (3) (4) (2).png new file mode 100644 index 000000000000..cfcc543a16bf Binary files /dev/null and b/docs/.gitbook/assets/launch (3) (3) (4) (2).png differ diff --git a/docs/.gitbook/assets/launch (3) (3) (4) (3).png b/docs/.gitbook/assets/launch (3) (3) (4) (3).png new file mode 100644 index 000000000000..cfcc543a16bf Binary files /dev/null and b/docs/.gitbook/assets/launch (3) (3) (4) (3).png differ diff --git a/docs/.gitbook/assets/launch (3) (3) (4) (4).png b/docs/.gitbook/assets/launch (3) (3) (4) (4).png new file mode 100644 index 000000000000..cfcc543a16bf Binary files /dev/null and b/docs/.gitbook/assets/launch (3) (3) (4) (4).png differ diff --git a/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (1).png b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (1).png new file mode 100644 index 000000000000..2f943f18e902 Binary files /dev/null and b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (1).png differ diff --git a/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (2).png b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (2).png new file mode 100644 index 000000000000..2f943f18e902 Binary files /dev/null and b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (2).png differ diff --git a/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (3).png b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (3).png new file mode 100644 index 000000000000..2f943f18e902 Binary files /dev/null and b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (3).png differ diff --git a/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (4).png b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (4).png new file mode 100644 index 000000000000..2f943f18e902 Binary files /dev/null and b/docs/.gitbook/assets/meetings-participant-ranked (3) (3) (4) (4).png differ diff --git a/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (1).png b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (1).png new file mode 100644 index 000000000000..b56bc6dc50a0 Binary files /dev/null and b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (1).png differ diff --git a/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (2).png b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (2).png new file mode 100644 index 000000000000..b56bc6dc50a0 Binary files /dev/null and b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (2).png differ diff --git a/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (3).png b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (3).png new file mode 100644 index 000000000000..b56bc6dc50a0 Binary files /dev/null and b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (3).png differ diff --git a/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (4).png b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (4).png new file mode 100644 index 000000000000..b56bc6dc50a0 Binary files /dev/null and b/docs/.gitbook/assets/postgres_credentials (3) (3) (4) (4).png differ diff --git a/docs/.gitbook/assets/schema (3) (3) (4) (1).png b/docs/.gitbook/assets/schema (3) (3) (4) (1).png new file mode 100644 index 000000000000..9d4e5d4b6920 Binary files /dev/null and b/docs/.gitbook/assets/schema (3) (3) (4) (1).png differ diff --git a/docs/.gitbook/assets/schema (3) (3) (4) (2).png b/docs/.gitbook/assets/schema (3) (3) (4) (2).png new file mode 100644 index 000000000000..9d4e5d4b6920 Binary files /dev/null and b/docs/.gitbook/assets/schema (3) (3) (4) (2).png differ diff --git a/docs/.gitbook/assets/schema (3) (3) (4) (3).png b/docs/.gitbook/assets/schema (3) (3) (4) (3).png new file mode 100644 index 000000000000..9d4e5d4b6920 Binary files /dev/null and b/docs/.gitbook/assets/schema (3) (3) (4) (3).png differ diff --git a/docs/.gitbook/assets/schema (3) (3) (4) (4).png b/docs/.gitbook/assets/schema (3) (3) (4) (4).png new file mode 100644 index 000000000000..9d4e5d4b6920 Binary files /dev/null and b/docs/.gitbook/assets/schema (3) (3) (4) (4).png differ diff --git a/docs/.gitbook/assets/setup-successful (3) (2) (1) (1).png b/docs/.gitbook/assets/setup-successful (3) (2) (1) (1).png new file mode 100644 index 000000000000..7d4f42151d9a Binary files /dev/null and b/docs/.gitbook/assets/setup-successful (3) (2) (1) (1).png differ diff --git a/docs/.gitbook/assets/setup-successful (3) (2) (1) (2).png b/docs/.gitbook/assets/setup-successful (3) (2) (1) (2).png new file mode 100644 index 000000000000..7d4f42151d9a Binary files /dev/null and b/docs/.gitbook/assets/setup-successful (3) (2) (1) (2).png differ diff --git a/docs/.gitbook/assets/setup-successful (3) (2) (1) (3).png b/docs/.gitbook/assets/setup-successful (3) (2) (1) (3).png new file mode 100644 index 000000000000..7d4f42151d9a Binary files /dev/null and b/docs/.gitbook/assets/setup-successful (3) (2) (1) (3).png differ diff --git a/docs/.gitbook/assets/sync-screen (3) (3) (3) (1).png b/docs/.gitbook/assets/sync-screen (3) (3) (3) (1).png new file mode 100644 index 000000000000..7c031ca71c36 Binary files /dev/null and b/docs/.gitbook/assets/sync-screen (3) (3) (3) (1).png differ diff --git a/docs/.gitbook/assets/sync-screen (3) (3) (3) (2).png b/docs/.gitbook/assets/sync-screen (3) (3) (3) (2).png new file mode 100644 index 000000000000..7c031ca71c36 Binary files /dev/null and b/docs/.gitbook/assets/sync-screen (3) (3) (3) (2).png differ diff --git a/docs/.gitbook/assets/sync-screen (3) (3) (3) (3).png b/docs/.gitbook/assets/sync-screen (3) (3) (3) (3).png new file mode 100644 index 000000000000..7c031ca71c36 Binary files /dev/null and b/docs/.gitbook/assets/sync-screen (3) (3) (3) (3).png differ diff --git a/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (1).png b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (1).png new file mode 100644 index 000000000000..b3c5c91b7bac Binary files /dev/null and b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (1).png differ diff --git a/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (2).png b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (2).png new file mode 100644 index 000000000000..b3c5c91b7bac Binary files /dev/null and b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (2).png differ diff --git a/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (3).png b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (3).png new file mode 100644 index 000000000000..b3c5c91b7bac Binary files /dev/null and b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (3).png differ diff --git a/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (4).png b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (4).png new file mode 100644 index 000000000000..b3c5c91b7bac Binary files /dev/null and b/docs/.gitbook/assets/tableau-dashboard (3) (3) (3) (4).png differ diff --git a/docs/.gitbook/assets/ux_hierarchy_pyramid2.png b/docs/.gitbook/assets/ux_hierarchy_pyramid2.png new file mode 100644 index 000000000000..80f23dcf6d3d Binary files /dev/null and b/docs/.gitbook/assets/ux_hierarchy_pyramid2.png differ diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (1).png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (1).png new file mode 100644 index 000000000000..f481e066aaa2 Binary files /dev/null and b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (1).png differ diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (2).png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (2).png new file mode 100644 index 000000000000..f481e066aaa2 Binary files /dev/null and b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (2).png differ diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (3).png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (3).png new file mode 100644 index 000000000000..f481e066aaa2 Binary files /dev/null and b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (3).png differ diff --git a/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (4).png b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (4).png new file mode 100644 index 000000000000..f481e066aaa2 Binary files /dev/null and b/docs/.gitbook/assets/zoom-marketplace-build-screen (3) (3) (1) (4).png differ diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 38ae3223b0c4..b009e9aaf37f 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -9,11 +9,11 @@ * [Deploying Airbyte](deploying-airbyte/README.md) * [Local Deployment](deploying-airbyte/local-deployment.md) * [On Airbyte Cloud](deploying-airbyte/on-cloud.md) - * [On AWS \(EC2\)](deploying-airbyte/on-aws-ec2.md) - * [On AWS ECS \(Coming Soon\)](deploying-airbyte/on-aws-ecs.md) - * [On Azure\(VM\)](deploying-airbyte/on-azure-vm-cloud-shell.md) - * [On GCP \(Compute Engine\)](deploying-airbyte/on-gcp-compute-engine.md) - * [On Kubernetes \(Beta\)](deploying-airbyte/on-kubernetes.md) + * [On AWS (EC2)](deploying-airbyte/on-aws-ec2.md) + * [On AWS ECS (Coming Soon)](deploying-airbyte/on-aws-ecs.md) + * [On Azure(VM)](deploying-airbyte/on-azure-vm-cloud-shell.md) + * [On GCP (Compute Engine)](deploying-airbyte/on-gcp-compute-engine.md) + * [On Kubernetes (Beta)](deploying-airbyte/on-kubernetes.md) * [On Oracle Cloud Infrastructure VM](deploying-airbyte/on-oci-vm.md) * [Operator Guides](operator-guides/README.md) * [Upgrading Airbyte](operator-guides/upgrading-airbyte.md) @@ -23,9 +23,9 @@ * [Using the Airflow Airbyte Operator](operator-guides/using-the-airflow-airbyte-operator.md) * [Windows - Browsing Local File Output](operator-guides/locating-files-local-destination.md) * [Transformations and Normalization](operator-guides/transformation-and-normalization/README.md) - * [Transformations with SQL \(Part 1/3\)](operator-guides/transformation-and-normalization/transformations-with-sql.md) - * [Transformations with dbt \(Part 2/3\)](operator-guides/transformation-and-normalization/transformations-with-dbt.md) - * [Transformations with Airbyte \(Part 3/3\)](operator-guides/transformation-and-normalization/transformations-with-airbyte.md) + * [Transformations with SQL (Part 1/3)](operator-guides/transformation-and-normalization/transformations-with-sql.md) + * [Transformations with dbt (Part 2/3)](operator-guides/transformation-and-normalization/transformations-with-dbt.md) + * [Transformations with Airbyte (Part 3/3)](operator-guides/transformation-and-normalization/transformations-with-airbyte.md) * [Scaling Airbyte](operator-guides/scaling-airbyte.md) * [Connector Catalog](integrations/README.md) * [Sources](integrations/sources/README.md) @@ -90,7 +90,7 @@ * [Microsoft Dynamics Customer Engagement](integrations/sources/microsoft-dynamics-customer-engagement.md) * [Microsoft Dynamics GP](integrations/sources/microsoft-dynamics-gp.md) * [Microsoft Dynamics NAV](integrations/sources/microsoft-dynamics-nav.md) - * [Microsoft SQL Server \(MSSQL\)](integrations/sources/mssql.md) + * [Microsoft SQL Server (MSSQL)](integrations/sources/mssql.md) * [Microsoft Teams](integrations/sources/microsoft-teams.md) * [Mixpanel](integrations/sources/mixpanel.md) * [Monday](integrations/sources/monday.md) @@ -154,7 +154,7 @@ * [DynamoDB](integrations/destinations/dynamodb.md) * [Elasticsearch](integrations/destinations/elasticsearch.md) * [Chargify](integrations/destinations/chargify.md) - * [Google Cloud Storage \(GCS\)](integrations/destinations/gcs.md) + * [Google Cloud Storage (GCS)](integrations/destinations/gcs.md) * [Google PubSub](integrations/destinations/pubsub.md) * [Kafka](integrations/destinations/kafka.md) * [Keen](integrations/destinations/keen.md) @@ -170,12 +170,9 @@ * [Redshift](integrations/destinations/redshift.md) * [S3](integrations/destinations/s3.md) * [Snowflake](integrations/destinations/snowflake.md) -<<<<<<< HEAD * [Cassandra](integrations/destinations/cassandra.md) -======= * [Scylla](integrations/destinations/scylla.md) ->>>>>>> b92e2f803bb0610d05681ac0c4aec8e8fdee6e42 - * [Custom or New Connector](integrations/custom-connectors.md) + * [Custom or New Connector](integrations/custom-connectors.md) * [Connector Development](connector-development/README.md) * [Tutorials](connector-development/tutorials/README.md) * [Python CDK Speedrun: Creating a Source](connector-development/tutorials/cdk-speedrun.md) @@ -192,7 +189,7 @@ * [Building a Python Source](connector-development/tutorials/building-a-python-source.md) * [Building a Python Destination](connector-development/tutorials/building-a-python-destination.md) * [Building a Java Destination](connector-development/tutorials/building-a-java-destination.md) - * [Connector Development Kit \(Python\)](connector-development/cdk-python/README.md) + * [Connector Development Kit (Python)](connector-development/cdk-python/README.md) * [Basic Concepts](connector-development/cdk-python/basic-concepts.md) * [Defining Stream Schemas](connector-development/cdk-python/schemas.md) * [Full Refresh Streams](connector-development/cdk-python/full-refresh-stream.md) @@ -200,7 +197,7 @@ * [HTTP-API-based Connectors](connector-development/cdk-python/http-streams.md) * [Python Concepts](connector-development/cdk-python/python-concepts.md) * [Stream Slices](connector-development/cdk-python/stream-slices.md) - * [Connector Development Kit \(Javascript\)](connector-development/cdk-faros-js.md) + * [Connector Development Kit (Javascript)](connector-development/cdk-faros-js.md) * [Airbyte 101 for Connector Development](connector-development/airbyte101.md) * [Testing Connectors](connector-development/testing-connectors/README.md) * [Source Acceptance Tests Reference](connector-development/testing-connectors/source-acceptance-tests-reference.md) @@ -232,7 +229,7 @@ * [High-level View](understanding-airbyte/high-level-view.md) * [Workers & Jobs](understanding-airbyte/jobs.md) * [Technical Stack](understanding-airbyte/tech-stack.md) - * [Change Data Capture \(CDC\)](understanding-airbyte/cdc.md) + * [Change Data Capture (CDC)](understanding-airbyte/cdc.md) * [Namespaces](understanding-airbyte/namespaces.md) * [Json to Avro Conversion](understanding-airbyte/json-avro-conversion.md) * [Glossary of Terms](understanding-airbyte/glossary.md) @@ -253,4 +250,3 @@ * [On Setting up a New Connection](troubleshooting/new-connection.md) * [On Running a Sync](troubleshooting/running-sync.md) * [On Upgrading](troubleshooting/on-upgrading.md) - diff --git a/docs/connector-development/ux-handbook.md b/docs/connector-development/ux-handbook.md index d8ef910ab90a..3bbe45d0be72 100644 --- a/docs/connector-development/ux-handbook.md +++ b/docs/connector-development/ux-handbook.md @@ -1,18 +1,22 @@ -# Connector Development UX Handbook +# UX Handbook -![Connector UX Handbook](https://imgs.xkcd.com/comics/ui_vs_ux.png) +## Connector Development UX Handbook -## Overview -The goal of this handbook is to allow scaling high quality decision making when developing connectors. +![Connector UX Handbook](https://imgs.xkcd.com/comics/ui\_vs\_ux.png) + +### Overview + +The goal of this handbook is to allow scaling high quality decision making when developing connectors. The Handbook is a living document, meant to be continuously updated. It is the best snapshot we can produce of the lessons learned from building and studying hundreds of connectors. While helpful, this snapshot is never perfect. Therefore, this Handbook is not a replacement for good judgment, but rather learnings that should help guide your work. -## How to use this handbook +### How to use this handbook 1. When thinking about a UX-impacting decision regarding connectors, consult this Handbook. 2. If the Handbook does not answer your question, then consider proposing an update to the Handbook if you believe your question will be applicable to more cases. -## Definition of UX-impacting changes +### Definition of UX-impacting changes + UX-impacting changes are ones which impact how the user directly interacts with, consumes, or perceives the product. **Examples**: @@ -25,97 +29,116 @@ UX-impacting changes are ones which impact how the user directly interacts with, 6. Wait time for human-at-keyboard 7. Anything that negatively impacts the runtime of the connector (e.g: a change that makes the runtime go from 10 minutes to 20 minutes on the same data size) 8. Any other change which you deem UX-impacting - 1. The guide can’t cover everything, so this is an escape hatch based on the developer’s judgment. + 1. The guide can’t cover everything, so this is an escape hatch based on the developer’s judgment. -**Examples of UX-impacting changes**: +**Examples of UX-impacting changes**: -1. Adding or removing an input field to/from spec.json +1. Adding or removing an input field to/from spec.json 2. Adding or removing fields from the output schema 3. Adding a new stream or category of stream (e.g: supporting views in databases) 4. Adding OAuth support -**Examples of non-UX-impacting changes**: +**Examples of non-UX-impacting changes**: + 1. Refactoring without changing functionality -2. Bugfix (e.g: pagination doesn’t work correctly) +2. Bugfix (e.g: pagination doesn’t work correctly) + +### Guiding Principles -## Guiding Principles -Would you trust AWS or Docker if it only worked 70, 80, or 90% of the time or if it leaked your business secrets? Yeah, me neither. You would only build on a tool if it worked at least 99% of the time. Infrastructure should give you back your time, rather than become a debugging timesink. +Would you trust AWS or Docker if it only worked 70, 80, or 90% of the time or if it leaked your business secrets? Yeah, me neither. You would only build on a tool if it worked at least 99% of the time. Infrastructure should give you back your time, rather than become a debugging timesink. The same is true with Airbyte: if it worked less than 99% of the time, many users will stop using it. Airbyte is an infrastructure component within a user’s data pipeline. Our users’ goal is to move data; Airbyte is an implementation detail. In that sense, it is much closer to Terraform, Docker, or AWS than an end application. -### Trust & Reliability are the top concerns -Our users have the following hierarchy of needs: -![needs](../../.gitbook/assets/ux_hierarchy_pyramid.png) +#### Trust & Reliability are the top concerns + +Our users have the following hierarchy of needs: + +![](../.gitbook/assets/ux\_hierarchy\_pyramid2.png) + +**Security** -#### Security Users often move very confidential data like revenue numbers, salaries, or confidential documents through Airbyte. A user therefore must trust that their data is secure. This means no leaking credentials in logs or plain text, no vulnerabilities in the product, no frivolous sharing of credentials or data over internal slack channels, video calls, or other communications etc. -#### Data integrity -Data replicated by Airbyte must be correct and complete. If a user moves data with Airbyte, then all of the data must be present, and it must all be correct - no corruption, incorrect values, or wrongly formatted data. +**Data integrity** + +Data replicated by Airbyte must be correct and complete. If a user moves data with Airbyte, then all of the data must be present, and it must all be correct - no corruption, incorrect values, or wrongly formatted data. Some tricky examples which can break data integrity if not handled correctly: - -* Zipcodes for the US east coast should not lose their leading zeros because of being detected as integer + +* Zipcodes for the US east coast should not lose their leading zeros because of being detected as integer * Database timezones could affect the value of timestamps -* Esoteric text values (e.g: weird UTF characters) +* Esoteric text values (e.g: weird UTF characters) + +**Reliability** -#### Reliability -A connector needs to be reliable. Otherwise, a user will need to spend a lot of time debugging, and at that point, they’re better off using a competing product. The connector should be able to handle large inputs, weirdly formatted inputs, all data types, and basically anything a user should throw at it. +A connector needs to be reliable. Otherwise, a user will need to spend a lot of time debugging, and at that point, they’re better off using a competing product. The connector should be able to handle large inputs, weirdly formatted inputs, all data types, and basically anything a user should throw at it. In other words, a connector should work 100% of the time, but 99.9% is occasionally acceptable. -#### Ease of use -People love and trust a product that is easy to use. This means that it works as quickly as possible, introduces no friction, and uses sensible defaults that are good enough for 95% of users. +#### Speed -An important component of usability is predictability. That is, as much as possible, a user should know before running a connector what its output will be and what the different tables will mean. +Sync speed minimizes the time needed for deriving value from data. It also provides a better user experience as it allows quick iteration on connector configurations without suffering through long wait times. + +**Ease of use** + +People love and trust a product that is easy to use. This means that it works as quickly as possible, introduces no friction, and uses sensible defaults that are good enough for 95% of users. + +An important component of usability is predictability. That is, as much as possible, a user should know before running a connector what its output will be and what the different tables will mean. Ideally, they would even see an ERD describing the output schema they can expect to find in the destination. (This particular feature is tracked [here](https://github.com/airbytehq/airbyte/issues/3731)). -#### Feature Set -Our connectors should cover as many use cases as is feasible. While it may not always work like that given our incremental delivery preference, we should always strive to provide the most featureful connectors which cover as much of the underlying API or database surface as possible. +**Feature Set** -There is also a tension between featureset and ease of use. The more features are available, the more thought it takes to make the product easy and intuitive to use. We’ll elaborate on this later. +Our connectors should cover as many use cases as is feasible. While it may not always work like that given our incremental delivery preference, we should always strive to provide the most featureful connectors which cover as much of the underlying API or database surface as possible. -## Airbyte's Target Personas -Without repeating too many details mentioned elsewhere, the important thing to know is Airbyte serves all the following personas: +There is also a tension between featureset and ease of use. The more features are available, the more thought it takes to make the product easy and intuitive to use. We’ll elaborate on this later. +### Airbyte's Target Personas -| **Persona** | **Level of technical knowledge** | -| ------- | ---------------------------- | -| Data Analyst | Proficient with:

Data manipulation tools like Excel or SQL
Dashboard tools like Looker

Not very familiar with reading API docs and doesn't know what a curl request is. But might be able to generate an API key if you tell them exactly how. | -| Analytics Engineer | Proficient with:

SQL & DBT
Git
A scripting language like Python
Shallow familiarity with infra tools like Docker

Much more technical than a data analyst, but not as much as a data engineer| -| Data Engineer| Proficient with:

SQL & DBT
Git
2 or more programming languages
Infra tools like Docker or Kubernetes
Cloud technologies like AWS or GCP
Building or consuming APIs
orhestartion tools like Airflow

The most technical persona we serve. Think of them like an engineer on your team| +Without repeating too many details mentioned elsewhere, the important thing to know is Airbyte serves all the following personas: +| **Persona** | **Level of technical knowledge** | +| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Data Analyst |

Proficient with:

Data manipulation tools like Excel or SQL
Dashboard tools like Looker

Not very familiar with reading API docs and doesn't know what a curl request is. But might be able to generate an API key if you tell them exactly how.

| +| Analytics Engineer |

Proficient with:

SQL & DBT
Git
A scripting language like Python
Shallow familiarity with infra tools like Docker

Much more technical than a data analyst, but not as much as a data engineer

| +| Data Engineer |

Proficient with:

SQL & DBT
Git
2 or more programming languages
Infra tools like Docker or Kubernetes
Cloud technologies like AWS or GCP
Building or consuming APIs
orhestartion tools like Airflow

The most technical persona we serve. Think of them like an engineer on your team

| +Keep in mind that the distribution of served personas will differ per connector. Data analysts are highly unlikely to form the majority of users for a very technical connector like say, Kafka. -Keep in mind that the distribution of served personas will differ per connector. Data analysts are highly unlikely to form the majority of users for a very technical connector like say, Kafka. +## Specific Guidelines + +### Input Configuration -# Specific Guidelines -## Input Configuration _aka spec.json_ -#### Avoid configuration completely when possible +**Avoid configuration completely when possible** + Configuration means more work for the user and more chances for confusion, friction, or misconfiguration. If I could wave a magic wand, a user wouldn’t have to configure anything at all. Unfortunately, this is not reality, and some configuration is strictly required. When this is the case, follow the guidelines below. -#### Avoid exposing implementation details in configuration +**Avoid exposing implementation details in configuration** + If a configuration controls an implementation detail (like how many retries a connector should make before failing), then there should be almost no reason to expose this. If you feel a need to expose it, consider it might be a smell that the connector implementation is not robust. -Put another way, if a configuration tells the user how to do its job of syncing data rather than what job to achieve, it’s a code smell. +Put another way, if a configuration tells the user how to do its job of syncing data rather than what job to achieve, it’s a code smell. + +For example, the memory requirements for a database connector which syncs a table with very wide rows (50mb rows) can be very different than when syncing a table with very narrow rows (10kb per row). In this case, it may be acceptable to ask the user for some sort of “hint”/tuning parameter in configuration (hidden behind advanced configuration) to ensure the connector performs reliably or quickly. But even then, this option would strictly be a necessary evil/escape hatch. It is much more preferable for the connector to auto-detect what this setting should be and never need to bother the user with it. + +**Minimize required configurations by setting defaults whenever possible** -For example, the memory requirements for a database connector which syncs a table with very wide rows (50mb rows) can be very different than when syncing a table with very narrow rows (10kb per row). In this case, it may be acceptable to ask the user for some sort of “hint”/tuning parameter in configuration (hidden behind advanced configuration) to ensure the connector performs reliably or quickly. But even then, this option would strictly be a necessary evil/escape hatch. It is much more preferable for the connector to auto-detect what this setting should be and never need to bother the user with it. +In many cases, a configuration can be avoided by setting a default value for it but still making it possible to set other values. Whenever possible, follow this pattern. -#### Minimize required configurations by setting defaults whenever possible -In many cases, a configuration can be avoided by setting a default value for it but still making it possible to set other values. Whenever possible, follow this pattern. +**Hide technical or niche parameters under an “Advanced” section** -#### Hide technical or niche parameters under an “Advanced” section -Sometimes, it’s inevitable that we need to expose some advanced or technical configuration. For example, the option to upload a TLS certificate to connect to a database, or the option to configure the number of retries done by an API connector: while these may be useful to some small percentage of users, it’s not worth making all users think or get confused about them. +Sometimes, it’s inevitable that we need to expose some advanced or technical configuration. For example, the option to upload a TLS certificate to connect to a database, or the option to configure the number of retries done by an API connector: while these may be useful to some small percentage of users, it’s not worth making all users think or get confused about them. Note: this is currently blocked by this [issue](https://github.com/airbytehq/airbyte/issues/3681). -#### Add a “title” and “description” property for every input parameter +**Add a “title” and “description” property for every input parameter** + This displays this information to the user in a polished way and gives less technical users (e.g: analysts) confidence that they can use this product. Be specific and unambiguous in the wording, explaining more than just the field name alone provides. -For example, the following spec: +For example, the following spec: + ```json { "type": "object", @@ -126,10 +149,13 @@ For example, the following spec: } } ``` -produces the following input field in the UI: -![bad](../../.gitbook/assets/ux_username_bad.png) + +produces the following input field in the UI: + +![](../.gitbook/assets/ux\_username\_bad.png) Whereas the following specification: + ```json { "type": "object", @@ -145,93 +171,109 @@ Whereas the following specification: produces the following UI: -![good](../../.gitbook/assets/ux_username_good.png) +![](../.gitbook/assets/ux\_username\_good.png) The title should use Pascal Case “with spaces” e.g: “Attribution Lookback Window”, “Host URL”, etc... -#### Clearly document the meaning and impact of all parameters -All configurations must have an unmistakable explanation describing their purpose and impact, even the obvious ones. Remember, something that is obvious to an analyst may not be obvious to an engineer, and vice-versa. +**Clearly document the meaning and impact of all parameters** -For example, in some Ads APIs like Facebook, the user’s data may continue to be updated up to 28 days after it is created. This happens because a user may take action because of an ad (like buying a product) many days after they see the ad. In this case, the user may want to configure a “lookback” window for attributing. +All configurations must have an unmistakable explanation describing their purpose and impact, even the obvious ones. Remember, something that is obvious to an analyst may not be obvious to an engineer, and vice-versa. -Adding a parameter “attribution_lookback_window” with no explanation might confuse the user more than it helps them. Instead, we should add a clear title and description which describes what this parameter is and how different values will impact the data output by the connector. +For example, in some Ads APIs like Facebook, the user’s data may continue to be updated up to 28 days after it is created. This happens because a user may take action because of an ad (like buying a product) many days after they see the ad. In this case, the user may want to configure a “lookback” window for attributing. + +Adding a parameter “attribution\_lookback\_window” with no explanation might confuse the user more than it helps them. Instead, we should add a clear title and description which describes what this parameter is and how different values will impact the data output by the connector. + +**Document how users can obtain configuration parameters** -#### Document how users can obtain configuration parameters If a user needs to obtain an API key or host name, tell them exactly where to find it. Ideally you would show them screenshots, though include a date and API version in those if possible, so it’s clear when they’ve aged out of date. -#### Fail fast & actionably -A user should not be able to configure something that will not work. If a user’s configuration is invalid, we should inform them as precisely as possible about what they need to do to fix the issue. +**Fail fast & actionably** + +A user should not be able to configure something that will not work. If a user’s configuration is invalid, we should inform them as precisely as possible about what they need to do to fix the issue. + +One helpful aid is to use the json-schema “pattern” keyword to accept inputs which adhere to the correct input shape. + +### Output Data & Schemas + +#### Strongly Favor ELT over ETL + +Extract-Load-Transform (ELT) means extracting and loading the data into a destination while leaving its format/schema as unchanged as possible, and making transformation the responsibility of the consumer. By contrast, ETL means transforming data before it is sent to the destination, for example changing its schema to make it easier to consume in the destination. + +When extracting data, strongly prefer ELT to ETL for the following reasons: -One helpful aid is to use the json-schema “pattern” keyword to accept inputs which adhere to the correct input shape. +**Removes Airbyte as a development bottleneck** -## Output Data & Schemas -### Strongly Favor ELT over ETL -Extract-Load-Transform (ELT) means extracting and loading the data into a destination while leaving its format/schema as unchanged as possible, and making transformation the responsibility of the consumer. By contrast, ETL means transforming data before it is sent to the destination, for example changing its schema to make it easier to consume in the destination. +If we get into the habit of structuring the output of each source according to how some users want to use it, then we will invite more feature requests from users asking us to transform data in a particular way. This introduces Airbyte’s dev team as an unnecessary bottleneck for these users. -When extracting data, strongly prefer ELT to ETL for the following reasons: +Instead, we should set the standard that a user should be responsible for transformations once they’ve loaded data in a destination. -#### Removes Airbyte as a development bottleneck -If we get into the habit of structuring the output of each source according to how some users want to use it, then we will invite more feature requests from users asking us to transform data in a particular way. This introduces Airbyte’s dev team as an unnecessary bottleneck for these users. +**Will always be backwards compatible** -Instead, we should set the standard that a user should be responsible for transformations once they’ve loaded data in a destination. +APIs already follow strong conventions to maintain backwards compatibility. By transforming data, we break this guarantee, which means we may break downstream flows for our users. -#### Will always be backwards compatible -APIs already follow strong conventions to maintain backwards compatibility. By transforming data, we break this guarantee, which means we may break downstream flows for our users. +**Future proof** -#### Future proof -We may have a vision of what a user needs today. But if our persona evolves next year, then we’ll probably also need to adapt our transformation logic, which would require significant dev and data migration efforts. +We may have a vision of what a user needs today. But if our persona evolves next year, then we’ll probably also need to adapt our transformation logic, which would require significant dev and data migration efforts. + +**More flexible** -#### More flexible Current users have different needs from data. By being opinionated on how they should consume data, we are effectively favoring one user persona over the other. While there might be some cases where this is warranted, it should be done with extreme intentionality. -#### More efficient +**More efficient** + With ETL, if the “T” ever needs to change, then we need to re-extract all data for all users. This is computationally and financially expensive and will place a lot of pressure on the source systems as we re-extract all data. -### Describe output schemas as completely and reliably as possible -Our most popular destinations are strongly typed like Postgres, BigQuery, or Parquet & Avro. +#### Describe output schemas as completely and reliably as possible + +Our most popular destinations are strongly typed like Postgres, BigQuery, or Parquet & Avro. -Being strongly typed enables optimizations and syntactic sugar to make it very easy & performant for the user to query data. +Being strongly typed enables optimizations and syntactic sugar to make it very easy & performant for the user to query data. To provide the best UX when moving data to these destinations, Airbyte source connectors should describe their schema in as much detail as correctness allows. -In some cases, describing schemas is impossible to do reliably. For example, MongoDB doesn’t have any schemas. To infer the schema, one needs to read all the records in a particular table. And even then, once new records are added, they also must all be read in order to update the inferred schema. At the time of writing, this is infeasible to do performantly in Airbyte since we do not have an intermediate staging area to do this. In this case, we should do the “best we can” to describe the schema, keeping in mind that reliability of the described schema is more important than expressiveness. +In some cases, describing schemas is impossible to do reliably. For example, MongoDB doesn’t have any schemas. To infer the schema, one needs to read all the records in a particular table. And even then, once new records are added, they also must all be read in order to update the inferred schema. At the time of writing, this is infeasible to do performantly in Airbyte since we do not have an intermediate staging area to do this. In this case, we should do the “best we can” to describe the schema, keeping in mind that reliability of the described schema is more important than expressiveness. -That is, we would rather not describe a schema at all than describe it incorrectly, as incorrect descriptions **will** lead to failures downstream. +That is, we would rather not describe a schema at all than describe it incorrectly, as incorrect descriptions **will** lead to failures downstream. -To keep schema descriptions reliable, [automate schema generation](https://docs.airbyte.io/connector-development/cdk-python/schemas#generating-schemas-from-openapi-definitions) whenever possible. +To keep schema descriptions reliable, [automate schema generation](https://docs.airbyte.io/connector-development/cdk-python/schemas#generating-schemas-from-openapi-definitions) whenever possible. -### Be very cautious about breaking changes to output schemas -Assuming we follow ELT over ETL, and automate generation of output schemas, this should come up very rarely. However, it is still important enough to warrant mention. +#### Be very cautious about breaking changes to output schemas -If for any reason we need to change the output schema declared by a connector in a backwards breaking way, consider it a necessary evil that should be avoided if possible. Basically, the only reasons for a backwards breaking change should be: +Assuming we follow ELT over ETL, and automate generation of output schemas, this should come up very rarely. However, it is still important enough to warrant mention. + +If for any reason we need to change the output schema declared by a connector in a backwards breaking way, consider it a necessary evil that should be avoided if possible. Basically, the only reasons for a backwards breaking change should be: * a connector previously had an incorrect schema, or * It was not following ELT principles and is now being changed to follow them -Other breaking changes should probably be escalated for approval. +Other breaking changes should probably be escalated for approval. + +### Prerequisite Configurations & assumptions + +**Document all assumptions** -## Prerequisite Configurations & assumptions -#### Document all assumptions If a connector makes assumptions about the underlying data source, then these assumptions must be documented. For example: for some features of the Facebook Pages connector to work, a user must have an Instagram Business account linked to an Instagram page linked to their Facebook Page. In this case, the externally facing documentation page for the connector must be very clear about this. -#### Provide how-tos for prerequisite configuration -If a user needs to set up their data source in a particular way to pull data, then we must provide documentation for how they should do it. +**Provide how-tos for prerequisite configuration** + +If a user needs to set up their data source in a particular way to pull data, then we must provide documentation for how they should do it. For example, to set up CDC for databases, a user must create logical replication slots and do a few other things. These steps should be documented with examples or screenshots wherever possible (e.g: here are the SQL statements you need to run, with the following permissions, on the following screen, etc.) -## External Documentation +### External Documentation + This section is concerned with the external-facing documentation of a connector that goes in [https://docs.airbyte.io](https://docs.airbyte.io) e.g: [this one](https://docs.airbyte.io/integrations/sources/amazon-seller-partner) -#### Documentation should communicate persona-impacting behaviors -When writing documentation ask: who is the intended target persona for a piece of documentation, and what information do they need to understand how this connector impacts their workflows? +**Documentation should communicate persona-impacting behaviors** + +When writing documentation ask: who is the intended target persona for a piece of documentation, and what information do they need to understand how this connector impacts their workflows? + +For example, data analysts & analytics engineers probably don’t care if we use Debezium for database replication. To them, the important thing is that we provide Change Data Capture (CDC) -- Debezium is an implementation detail. Therefore, when communicating information about our database replication logic, we should emphasize the end behaviors, rather than implementation details. -For example, data analysts & analytics engineers probably don’t care if we use Debezium for database replication. To them, the important thing is that we provide Change Data Capture (CDC) -- Debezium is an implementation detail. Therefore, when communicating information about our database replication logic, we should emphasize the end behaviors, rather than implementation details. +**Example**: Don’t say: “Debezium cannot process UTF-16 character set“. -**Example**: -Don’t say: “Debezium cannot process UTF-16 character set“. - Instead, say: “When using CDC, UTF-16 characters are not currently supported” A user who doesn’t already know what Debezium is might be left confused by the first phrasing, so we should use the second phrasing. -*: _this is a fake example. AFAIK there is no such limitation in Debezi-- I mean, the Postgres connector._ +\*: _this is a fake example. AFAIK there is no such limitation in Debezi-- I mean, the Postgres connector._ diff --git a/docs/integrations/sources/file.md b/docs/integrations/sources/file.md index 3170dcffdbe4..01bdcd0efaef 100644 --- a/docs/integrations/sources/file.md +++ b/docs/integrations/sources/file.md @@ -2,57 +2,57 @@ ## Features -| Feature | Supported? | -| :--- | :--- | -| Full Refresh Sync | Yes | -| Incremental Sync | No | -| Replicate Incremental Deletes | No | -| Replicate Folders \(multiple Files\) | No | -| Replicate Glob Patterns \(multiple Files\) | No | +| Feature | Supported? | +| ---------------------------------------- | ---------- | +| Full Refresh Sync | Yes | +| Incremental Sync | No | +| Replicate Incremental Deletes | No | +| Replicate Folders (multiple Files) | No | +| Replicate Glob Patterns (multiple Files) | No | -This source produces a single table for the target file as it replicates only one file at a time for the moment. Note that you should provide the `dataset_name` which dictates how the table will be identified in the destination \(since `URL` can be made of complex characters\). +This source produces a single table for the target file as it replicates only one file at a time for the moment. Note that you should provide the `dataset_name` which dictates how the table will be identified in the destination (since `URL` can be made of complex characters). ### Storage Providers -| Storage Providers | Supported? | -| :--- | :--- | -| HTTPS | Yes | -| Google Cloud Storage | Yes | -| Amazon Web Services S3 | Yes | -| SFTP | Yes | -| SSH / SCP | Yes | -| local filesystem | Local use only (inaccessible for Airbyte Cloud) | +| Storage Providers | Supported? | +| ---------------------- | ----------------------------------------------- | +| HTTPS | Yes | +| Google Cloud Storage | Yes | +| Amazon Web Services S3 | Yes | +| SFTP | Yes | +| SSH / SCP | Yes | +| local filesystem | Local use only (inaccessible for Airbyte Cloud) | ### File / Stream Compression | Compression | Supported? | -| :--- | :--- | -| Gzip | Yes | -| Zip | No | -| Bzip2 | No | -| Lzma | No | -| Xz | No | -| Snappy | No | +| ----------- | ---------- | +| Gzip | Yes | +| Zip | No | +| Bzip2 | No | +| Lzma | No | +| Xz | No | +| Snappy | No | ### File Formats -| Format | Supported? | -| :--- | :--- | -| CSV | Yes | -| JSON | Yes | -| HTML | No | -| XML | No | -| Excel | Yes | -| Excel Binary Workbook | Yes | -| Feather | Yes | -| Parquet | Yes | -| Pickle | No | +| Format | Supported? | +| --------------------- | ---------- | +| CSV | Yes | +| JSON | Yes | +| HTML | No | +| XML | No | +| Excel | Yes | +| Excel Binary Workbook | Yes | +| Feather | Yes | +| Parquet | Yes | +| Pickle | No | -## Getting Started \(Airbyte Cloud\) +## Getting Started (Airbyte Cloud) Setup through Airbyte Cloud will be exactly the same as the open-source setup, except for the fact that local files are disabled. -## Getting Started \(Airbyte Open-Source\) +## Getting Started (Airbyte Open-Source) 1. Once the File Source is selected, you should define both the storage provider along its URL and format of the file. 2. Depending on the provider choice and privacy of the data, you will have to configure more options. @@ -61,27 +61,27 @@ Setup through Airbyte Cloud will be exactly the same as the open-source setup, e * In case of GCS, it is necessary to provide the content of the service account keyfile to access private buckets. See settings of [BigQuery Destination](../destinations/bigquery.md) * In case of AWS S3, the pair of `aws_access_key_id` and `aws_secret_access_key` is necessary to access private S3 buckets. -* In case of AzBlob, it is necessary to provide the `storage_account` in which the blob you want to access resides. Either `sas_token` [\(info\)](https://docs.microsoft.com/en-us/azure/storage/blobs/sas-service-create?tabs=dotnet) or `shared_key` [\(info\)](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal) is necessary to access private blobs. +* In case of AzBlob, it is necessary to provide the `storage_account` in which the blob you want to access resides. Either `sas_token` [(info)](https://docs.microsoft.com/en-us/azure/storage/blobs/sas-service-create?tabs=dotnet) or `shared_key` [(info)](https://docs.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?tabs=azure-portal) is necessary to access private blobs. ### Reader Options -The Reader in charge of loading the file format is currently based on [Pandas IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). It is possible to customize how to load the file into a Pandas DataFrame as part of this Source Connector. This is doable in the `reader_options` that should be in JSON format and depends on the chosen file format. See pandas' documentation, depending on the format: +The Reader in charge of loading the file format is currently based on [Pandas IO Tools](https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html). It is possible to customize how to load the file into a Pandas DataFrame as part of this Source Connector. This is doable in the `reader_options` that should be in JSON format and depends on the chosen file format. See pandas' documentation, depending on the format: -For example, if the format `CSV` is selected, then options from the [read\_csv](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table) functions are available. +For example, if the format `CSV` is selected, then options from the [read\_csv](https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html#io-read-csv-table) functions are available. -* It is therefore possible to customize the `delimiter` \(or `sep`\) to `\t` in case of tab separated files. +* It is therefore possible to customize the `delimiter` (or `sep`) to in case of tab separated files. * Header line can be ignored with `header=0` and customized with `names` * etc We would therefore provide in the `reader_options` the following json: -```text +``` { "sep" : "\t", "header" : 0, "names": "column1, column2"} ``` -In case you select `JSON` format, then options from the [read\_json](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-json-reader) reader are available. +In case you select `JSON` format, then options from the [read\_json](https://pandas.pydata.org/pandas-docs/stable/user\_guide/io.html#io-json-reader) reader are available. -For example, you can use the `{"orient" : "records"}` to change how orientation of data is loaded \(if data is `[{column -> value}, … , {column -> value}]`\) +For example, you can use the `{"orient" : "records"}` to change how orientation of data is loaded (if data is `[{column -> value}, … , {column -> value}]`) #### Changing data types of source columns @@ -91,27 +91,27 @@ Normally, Airbyte tries to infer the data type from the source, but you can use Here are a list of examples of possible file inputs: -| Dataset Name | Storage | URL | Reader Impl | Service Account | Description | -| :--- | :--- | :--- | :--- | :--- | :--- | -| epidemiology | HTTPS | [https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv](https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv) | | | [COVID-19 Public dataset](https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program?filter=solution-type:dataset&id=7d6cc408-53c8-4485-a187-b8cb9a5c0b56) on BigQuery | -| hr\_and\_financials | GCS | gs://airbyte-vault/financial.csv | smart\_open or gcfs | {"type": "service\_account", "private\_key\_id": "XXXXXXXX", ...} | data from a private bucket, a service account is necessary | -| landsat\_index | GCS | gcp-public-data-landsat/index.csv.gz | smart\_open | | Using smart\_open, we don't need to specify the compression \(note the gs:// is optional too, same for other providers\) | +| Dataset Name | Storage | URL | Reader Impl | Service Account | Description | +| ------------------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| epidemiology | HTTPS | [https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv](https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv) | | | [COVID-19 Public dataset](https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-program?filter=solution-type:dataset\&id=7d6cc408-53c8-4485-a187-b8cb9a5c0b56) on BigQuery | +| hr\_and\_financials | GCS | gs://airbyte-vault/financial.csv | smart\_open or gcfs | {"type": "service\_account", "private\_key\_id": "XXXXXXXX", ...} | data from a private bucket, a service account is necessary | +| landsat\_index | GCS | gcp-public-data-landsat/index.csv.gz | smart\_open | | Using smart\_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers) | Examples with reader options: -| Dataset Name | Storage | URL | Reader Impl | Reader Options | Description | -| :--- | :--- | :--- | :--- | :--- | :--- | -| landsat\_index | GCS | gs://gcp-public-data-landsat/index.csv.gz | GCFS | {"compression": "gzip"} | Additional reader options to specify a compression option to `read_csv` | -| GDELT | S3 | s3://gdelt-open-data/events/20190914.export.csv | | {"sep": "\t", "header": null} | Here is TSV data separated by tabs without header row from [AWS Open Data](https://registry.opendata.aws/gdelt/) | -| server\_logs | local | /local/logs.log | | {"sep": ";"} | After making sure a local text file exists at `/tmp/airbyte_local/logs.log` with logs file from some server that are delimited by ';' delimiters | +| Dataset Name | Storage | URL | Reader Impl | Reader Options | Description | +| -------------- | ------- | ----------------------------------------------- | ----------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| landsat\_index | GCS | gs://gcp-public-data-landsat/index.csv.gz | GCFS | {"compression": "gzip"} | Additional reader options to specify a compression option to `read_csv` | +| GDELT | S3 | s3://gdelt-open-data/events/20190914.export.csv | | {"sep": "\t", "header": null} | Here is TSV data separated by tabs without header row from [AWS Open Data](https://registry.opendata.aws/gdelt/) | +| server\_logs | local | /local/logs.log | | {"sep": ";"} | After making sure a local text file exists at `/tmp/airbyte_local/logs.log` with logs file from some server that are delimited by ';' delimiters | Example for SFTP: -| Dataset Name | Storage | User | Password | Host | URL | Reader Options | Description | -| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | -| Test Rebext | SFTP | demo | password | test.rebext.net | /pub/example/readme.txt | {"sep": "\r\n", "header": null, "names": \["text"\], "engine": "python"} | We use `python` engine for `read_csv` in order to handle delimiter of more than 1 character while providing our own column names. | +| Dataset Name | Storage | User | Password | Host | URL | Reader Options | Description | +| ------------ | ------- | ---- | -------- | --------------- | ----------------------- | ----------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | +| Test Rebext | SFTP | demo | password | test.rebext.net | /pub/example/readme.txt | {"sep": "\r\n", "header": null, "names": \["text"], "engine": "python"} | We use `python` engine for `read_csv` in order to handle delimiter of more than 1 character while providing our own column names. | -Please see \(or add\) more at `airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py` for further usages examples. +Please see (or add) more at `airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py` for further usages examples. ## Performance Considerations and Notes @@ -122,20 +122,19 @@ In order to read large files from a remote location, this connector uses the [sm ## Changelog -| Version | Date | Pull Request | Subject | -| :--- | :--- | :--- | :--- | -| 0.2.7 | 2021-10-28 | [7387](https://github.com/airbytehq/airbyte/pull/7387) | Migrate source to CDK structure, add SAT testing. | -| 0.2.6 | 2021-08-26 | [5613](https://github.com/airbytehq/airbyte/pull/5613) | Add support to xlsb format | -| 0.2.5 | 2021-07-26 | [4953](https://github.com/airbytehq/airbyte/pull/4953) | Allow non-default port for SFTP type | -| 0.2.4 | 2021-06-09 | [3973](https://github.com/airbytehq/airbyte/pull/3973) | Add AIRBYTE\_ENTRYPOINT for Kubernetes support | -| 0.2.3 | 2021-06-01 | [3771](https://github.com/airbytehq/airbyte/pull/3771) | Add Azure Storage Blob Files option | -| 0.2.2 | 2021-04-16 | [2883](https://github.com/airbytehq/airbyte/pull/2883) | Fix CSV discovery memory consumption | -| 0.2.1 | 2021-04-03 | [2726](https://github.com/airbytehq/airbyte/pull/2726) | Fix base connector versioning | -| 0.2.0 | 2021-03-09 | [2238](https://github.com/airbytehq/airbyte/pull/2238) | Protocol allows future/unknown properties | -| 0.1.10 | 2021-02-18 | [2118](https://github.com/airbytehq/airbyte/pull/2118) | Support JSONL format | -| 0.1.9 | 2021-02-02 | [1768](https://github.com/airbytehq/airbyte/pull/1768) | Add test cases for all formats | -| 0.1.8 | 2021-01-27 | [1738](https://github.com/airbytehq/airbyte/pull/1738) | Adopt connector best practices | -| 0.1.7 | 2020-12-16 | [1331](https://github.com/airbytehq/airbyte/pull/1331) | Refactor Python base connector | -| 0.1.6 | 2020-12-08 | [1249](https://github.com/airbytehq/airbyte/pull/1249) | Handle NaN values | -| 0.1.5 | 2020-11-30 | [1046](https://github.com/airbytehq/airbyte/pull/1046) | Add connectors using an index YAML file | - +| Version | Date | Pull Request | Subject | +| ------- | ---------- | ------------------------------------------------------ | ------------------------------------------------- | +| 0.2.7 | 2021-10-28 | [7387](https://github.com/airbytehq/airbyte/pull/7387) | Migrate source to CDK structure, add SAT testing. | +| 0.2.6 | 2021-08-26 | [5613](https://github.com/airbytehq/airbyte/pull/5613) | Add support to xlsb format | +| 0.2.5 | 2021-07-26 | [4953](https://github.com/airbytehq/airbyte/pull/4953) | Allow non-default port for SFTP type | +| 0.2.4 | 2021-06-09 | [3973](https://github.com/airbytehq/airbyte/pull/3973) | Add AIRBYTE\_ENTRYPOINT for Kubernetes support | +| 0.2.3 | 2021-06-01 | [3771](https://github.com/airbytehq/airbyte/pull/3771) | Add Azure Storage Blob Files option | +| 0.2.2 | 2021-04-16 | [2883](https://github.com/airbytehq/airbyte/pull/2883) | Fix CSV discovery memory consumption | +| 0.2.1 | 2021-04-03 | [2726](https://github.com/airbytehq/airbyte/pull/2726) | Fix base connector versioning | +| 0.2.0 | 2021-03-09 | [2238](https://github.com/airbytehq/airbyte/pull/2238) | Protocol allows future/unknown properties | +| 0.1.10 | 2021-02-18 | [2118](https://github.com/airbytehq/airbyte/pull/2118) | Support JSONL format | +| 0.1.9 | 2021-02-02 | [1768](https://github.com/airbytehq/airbyte/pull/1768) | Add test cases for all formats | +| 0.1.8 | 2021-01-27 | [1738](https://github.com/airbytehq/airbyte/pull/1738) | Adopt connector best practices | +| 0.1.7 | 2020-12-16 | [1331](https://github.com/airbytehq/airbyte/pull/1331) | Refactor Python base connector | +| 0.1.6 | 2020-12-08 | [1249](https://github.com/airbytehq/airbyte/pull/1249) | Handle NaN values | +| 0.1.5 | 2020-11-30 | [1046](https://github.com/airbytehq/airbyte/pull/1046) | Add connectors using an index YAML file | diff --git a/docs/understanding-airbyte/glossary.md b/docs/understanding-airbyte/glossary.md index e69136025958..e09919cba1a0 100644 --- a/docs/understanding-airbyte/glossary.md +++ b/docs/understanding-airbyte/glossary.md @@ -2,11 +2,15 @@ ### Airbyte CDK -The Airbyte CDK \(Connector Development Kit\) allows you to create connectors for Sources or Destinations. If your source or destination doesn't exist, you can use the CDK to make the building process a lot easier. It generates all the tests and files you need and all you need to do is write the connector-specific code for your source or destination. We created one in Python which you can check out [here](../connector-development/cdk-python/) and the Faros AI team created a Javascript/Typescript one that you can check out [here](../connector-development/cdk-faros-js.md). +The Airbyte CDK (Connector Development Kit) allows you to create connectors for Sources or Destinations. If your source or destination doesn't exist, you can use the CDK to make the building process a lot easier. It generates all the tests and files you need and all you need to do is write the connector-specific code for your source or destination. We created one in Python which you can check out [here](../connector-development/cdk-python/) and the Faros AI team created a Javascript/Typescript one that you can check out [here](../connector-development/cdk-faros-js.md). ### DAG -DAG stands for **Directed Acyclic Graph**. It's a term originally coined by math graph theorists that describes a tree-like process that cannot contain loops. For example, in the following diagram, you start at A and can choose B or C, which then proceed to D and E, respectively. This kind of structure is great for representing workflows and is what tools like [Airflow](https://airflow.apache.org/) use to orchestrate the execution of software based on different cases or states. ![](../.gitbook/assets/glossary_dag_example.png) +DAG stands for **Directed Acyclic Graph**. It's a term originally coined by math graph theorists that describes a tree-like process that cannot contain loops. For example, in the following diagram, you start at A and can choose B or C, which then proceed to D and E, respectively. This kind of structure is great for representing workflows and is what tools like [Airflow](https://airflow.apache.org) use to orchestrate the execution of software based on different cases or states. + + + +![](../.gitbook/assets/glossary\_dag\_example.png) ### ETL/ELT @@ -54,5 +58,4 @@ This refers to the functions that a Source or Destination must implement to succ This is only relevant for individuals who want to learn about or contribute to our underlying platform. {% endhint %} -[Temporal](https://temporal.io/) is a development kit that lets you create workflows, parallelize them, and handle failures/retries gracefully. We use it to reliably schedule each step of the ELT process, and a Temporal service is always deployed with each Airbyte installation. - +[Temporal](https://temporal.io) is a development kit that lets you create workflows, parallelize them, and handle failures/retries gracefully. We use it to reliably schedule each step of the ELT process, and a Temporal service is always deployed with each Airbyte installation.