Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend web_backend/connections/get to include diff of catalog #13232

Closed
timroes opened this issue May 26, 2022 · 9 comments · Fixed by #13918
Closed

Extend web_backend/connections/get to include diff of catalog #13232

timroes opened this issue May 26, 2022 · 9 comments · Fixed by #13918
Assignees

Comments

@timroes
Copy link
Collaborator

timroes commented May 26, 2022

For the connection isolation changes, we're requiring information about which streams have changed and how, when we get a new catalog. When the user uses the "Refresh source schema" button in the UI we're doing a call to v1/web_backend/connections/get and set the withRefreshedCatalog: true in the query, which will fetch a new catalog and return that as syncCatalog in the response. To implement the designs we have for showing changes, we'll also need this API to have a new key syncCatalogChanges, that will contain the changes between the new catalog and the previously synced one.

I'd expect the format to be as follows:

syncCatalogChanges: null, if there was no previous catalog (i.e. if we're just creating the connection for the first time).
Otherwise syncCataglogChanges should be an object in the following format:

{
  streamsRemoved: ['stream1', 'stream2'], // Names of the streams that have been removed
  streamsAdded: ['streamNew1'], // Names of the streams added,
  streamsChanged: { // an Object containing one entry per stream that has changes
    stream3: { // Name of the stream that changed
      fieldsAdded: [ // An array of fields that have been added including their jsonSchema. 
        { name: 'newField1', schema: { type: 'string'} },
      ],
      fieldsRemoved: [ // Same format as fieldsAdded but for removed fields
        { name: 'removedField1', schema: { type: 'boolean'} }
      ],
      fieldsChanged: [ // a list of all field changes
        // type change contain previous and current schema of that field
        { name: 'fieldWithTypeChange', schemaChange: { from: { type: 'string' }, to: { type: 'boolean' } }},
      ]
    }
  }
}

@benmoriceau Please let me know if there are any questions around the API.

cc @andyjih

@alovew
Copy link
Contributor

alovew commented May 27, 2022

Thanks for this ticket @timroes! @lmossman and I were discussing this yesterday and I don't think we need an oldName field here, since we don't process table name changes. If a table name changes, we would read it as a new table being added and another one being removed, so those would be listed under streamsRemoved and streamsAdded.

@timroes
Copy link
Collaborator Author

timroes commented May 27, 2022

Thanks for that info. We had that in the designs so I thought it was planned to track. If not we can of course drop this. Does the same apply for field names?

@alovew
Copy link
Contributor

alovew commented May 27, 2022

We can track field name changes. We discussed possibly pulling the field name change display out into a separate feature, but given this new API format maybe it makes sense to include it here. @lmossman what do you think?

@timroes
Copy link
Collaborator Author

timroes commented May 31, 2022

@alovew I think it will be fine even extending the API later to include field name changes, since it's just an extension to the API, not breaking any backwards compatibility.

@benmoriceau
Copy link
Contributor

Think will be complicated to do. This means that the connection/get endpoint will have to run and wait the end of the run of the discover catalog job and the process the diff. Processing the diff will be doable but fetching the new catalog should not be in this endpoint because of the time it will take.

@timroes
Copy link
Collaborator Author

timroes commented Jun 1, 2022

Sorry I missed the crutial part here. This should only be calculated as long as withRefreshedCatalog is true in the request (updated that in the description above), in which case this API already as of today does refetch the catalog (also just for double checking, we're talking about the v1/web_backend/connections/get API, not the v1/connections/get endpoint!).

Since it's current behavior is already to refretch the catalog with that parameter set to true I think we're not adding any problem into that. Whether or not that is an ideal API scenario I'd leave up for discussion on whether it would make more sense to separate those APIs away from each other, but I think that might be out of the scope of this project, and might be part of further API refactorings we would want to do.

@lmossman
Copy link
Contributor

lmossman commented Jun 1, 2022

We can track field name changes. We discussed possibly pulling the field name change display out into a separate feature, but given this new API format maybe it makes sense to include it here. @lmossman what do you think?

Sorry, didn't see this earlier. I think field name changes would be hard for us to detect, for the same reason that stream name changes are hard to detect - in either case, we would be comparing the new schema with the old schema and we would just see that some field/stream name is missing from the old schema and a new field/stream name is present in the new schema. I don't think there is any way we can tell that it was changed.

The only thing we can reasonably detect is field type changes, since we can compare the types of fields with the same name in a given stream

@benmoriceau benmoriceau self-assigned this Jun 9, 2022
@cgardens cgardens assigned cgardens and unassigned benmoriceau Jun 15, 2022
@teallarson
Copy link
Contributor

teallarson commented Jun 23, 2022

Two things I noticed working on the frontend:

  1. I'm seeing a 500 error on the API call when I refresh the source schema and there was a transformed field. Internal Server Error: No enum constant io.airbyte.api.model.generated.FieldTransform.TransformTypeEnum.UPDATE_FIELD
  2. Any non-enabled stream shows up as an "add_stream" on every API response, regardless of if it's "new". Is that the expected behavior?

@timroes
Copy link
Collaborator Author

timroes commented Jun 27, 2022

@cgardens Can you have a look at the above questions, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants