-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalization not working on "HTTP Request" connector #2459
Comments
Can you share what error you're getting? You should be able to see it in the logs in the UI. @ChristopheDuong for visibility. |
The "problem" being reported here is that the catalog being sent by the source is too simple (just object) So normalization is not able to unnest anything even though the data from the HTTP source contains arrays of rows/nested structures:
|
The job completes successfully. Only the normalization is not happening as expected. Multiple JSON records gets loaded as one record. |
@meenagangadharan this is an implementation decision we made in the connector since it is not possible to know just from the URL what the output schema should be reliably. It seems like some reasonable paths forward to address your request are:
Would having a solution like 2 fix your problem? |
Hi , I am the data architect / PO managing @meenagangadharan . The main problem is the fact that individual rows are not being created when importing what is returned.
From what I can see, our API does not return data this way? |
@bashyroger The data for API looks has this structure: @sherifnada, If we are to proceed with providing a Schema file. Are you expecting in any particular format ? Can you give an example of what it looks like ? It could be helpful. |
@sherifnada, If we are to proceed with providing a Schema file. Are you expecting in any particular format ? Can you give an example of what it looks like ? It could be helpful. |
You have examples of how JSON schema files look like here: |
@sherifnada @ChristopheDuong, The API that I am working on currently has the Open API for Schema specification. I need to convert it to the format specified here. Will there be a new version, where we can optionally update the Schema..? It would be very helpful if we can have the option to consider the Open API metadata. |
@meenagangadharan We could potentially support OpenAPI and JSON Schema as the format. I'm realizing there is a couple of complications with this approach though:
I think we can definitely do this given the caveats, but I want to make sure that these are OK caveats. A simpler alternative than all this is to split the JSON in the destination using DBT or a custom SQL query. This is more in line with ELT as described here which is what Airbyte is going to be best at. Is this something that would work for you? LMK which of the two approaches would make the most sense for your usecase. Also, would you be open to contributing this feature to the connector? (it's written in Python ) |
@sherifnada, I am looking for the data to be split and loaded as multiple rows in Bigquery as the initial step through Airbyte. Further unnesting the JSON record and loading into multiple columns can be done using DBT/SQL. |
Hi @sherifnada , as @meenagangadharan mentioned it 'is ok' for us to extract the keys / flatten / apply the schema after the initial raw data import. Not doing this will create problems when importing a large data set as some database (Like BigQuery / Redshift) have a limit on how large a JSON containing cell can be. So being able to least importing multiple rows when schema metadata is not there is paramount... |
@bashyroger great point about the max row size. We'll prioritize working on this soon. More generally FWIW, it might be much more straightforward to implement a custom connector for your use case rather than try to fit the HTTP connector to all use cases. it's tricky to implement a generic HTTP source connector that will work for most APIs as they all contain slightly different permutations of features (auth, iteration, pagination, rate limiting, data extraction like in this case, different schemas, etc..) -- it's absolutely the gold standard we want to reach (we want to expose a library which allows you to effectively build a connector via YAML) but it's hard to make an accurate estimate about timeline right now. |
@sherifnada, @ChristopheDuong, I am working on to build a custom connector. Hope this works out. Looking for any support if required in the process! Thanks. |
@meenagangadharan please reach out here or to me on Slack -- happy to support however I can! |
Hi @sherifnada @ChristopheDuong, I created the a connector source-reducept and tested it using Docker and python commands. As mentioned in Step 8 - I am trying to execute the gradle build command using Getting the below error - Can you please help me on this.
Deprecated Gradle features were used in this build, making it incompatible with Gradle 7.0. BUILD FAILED in 7m 32s |
@meenagangadharan can you run the command with the |
Hi @sherifnada, @ChristopheDuong, cc @bashyroger, While testing through python and docker commands , it executed and displayed output records. In the config.json, the hardcoded contents exists. It would be overwritten while passing it from here..?
|
Done. Link is generated as - https://gradle.com/s/pg3j7u5ogwlas |
Hi @sherifnada, |
@meenagangadharan are you happy with the connectors' behavior locally? If so, you can just maintain it on your own -- no need to submit it to Airbyte. In this case there is no need to pass the standard tests. Just publish the connector to a docker repository like dockerhub (or just publish it locally on the node that you are running using |
@meenagangadharan Airbyte doesn't currently offer the ability to delete connector definitinos. The best way would be to reset to a clean airbyte instance, or wait until a future version offers that ability. For now you can probably just rename all your connectors to indicate they are the old ones. |
I edited the code to remove pagination and kept it simple to extract the data with columns/rows separated Earlier with pagination too it extracted all records but end of it, its still placed as 1 record. |
I am extracting the r['data'] part from the response and keeping it as a "dict" since it is the required format of AirbyteMessage. It throws the below error: Code: |
@meenagangadharan let's move this conversation to Slack. Any questions will be answered quicker since all team members are there and messages are better surfaced to us. I'm going to close this ticket for now. |
Expected Behavior
I am loading data from a private REST API to Bigquery. Normalization is enabled. Using the latest version 0.17.1-alpha.
Job completes successfully. The JSON data extracted from the API is expected to be split and loaded into multiple records of the Bigquery table
Current Behavior
The output JSON are loaded as a single record in the data field of Bigquery table.
Logs
catalog.json shows the below content -
docker run -it --rm --volume airbyte_workspace:/data busybox cat /data/7/0/destination_catalog.json
{"streams":[{"stream":{"name":"/usersdata","json_schema":{"type":"object","$schema":"http://json-schema.org/draft-07/schema#","properties":{"data":{"type":"object"}},"additionalProperties":true},"supported_sync_modes":["full_refresh"],"default_cursor_field":[]},"sync_mode":"full_refresh","cursor_field":[]}]}
Steps to Reproduce
{
"data": [
{
"id": 4971,
"email": "meena@gmail.com",
"first_name": "Meena",
},
],
"links": {
"first": "#####################",
"last": "#####################",
"prev": null,
"next": "#####################"
},
"meta": {
"current_page": 1,
"from": 1,
"last_page": 126,
"links": [
{
"url": null,
"label": "« Previous",
"active": false
},
{
"url": "#####################",
"label": "1",
"active": true
},
{
"url": "https:#####################",
"label": "2",
"active": false
}
],
"path": "https:#####################,
"per_page": 50,
"to": 50,
"total": 6270
}
}
Severity of the bug for you
High
Airbyte Version
Found in the .env file in the root of the project- 0.17.1-alpha
Connector Version (if applicable)
HTTP Request - 0.2.1
Additional context
Airbyte is installed on GCP
The text was updated successfully, but these errors were encountered: