Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle records in reverse chronological order #20

Closed
castorm opened this issue May 25, 2020 · 7 comments · Fixed by #21
Closed

Handle records in reverse chronological order #20

castorm opened this issue May 25, 2020 · 7 comments · Fixed by #21
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers
Milestone

Comments

@castorm
Copy link
Owner

castorm commented May 25, 2020

Problem

Some APIs don't offer control over the ordering of results.

This might be especially common on APIs that offer a snapshot of the N most recent records.

The source connector bases its offset advancement on chronologically ordered results, so latest committed offset would be that of the oldest record instead of the newest. Ignoring this would result in undesired consequences:

  • when offset is used in the query, offset would advance one record at a time and it would require as many queries as records in the response.
  • inability to deduplicate, which would only make the problem above worst.

Potential Solutions

  • automatically sorting records in chronological order, this would be the simplest from user perspective as it doest require even knowing of this concern, however, it would penalize performance, especially when records are reversed
  • allowing the user to specify the sorting direction of records in the response.
@castorm castorm self-assigned this May 25, 2020
@castorm castorm added good first issue Good for newcomers and removed prioritization labels May 25, 2020
@castorm
Copy link
Owner Author

castorm commented May 25, 2020

Unveiled as part of #19 as a result of inspecting NY Times API

@castorm castorm added the enhancement New feature or request label May 25, 2020
@rumline
Copy link

rumline commented May 25, 2020

Perhaps a 3-way behavior with two configuration settings?

connector.sorting.enabled as boolean (defaults to false)
connector.sorting.direction as forward or reverse

Allow the end user to not take the performance penalty if they know the source API will sort correctly for them. Also allowing the end user to choose the direction of connector sorting will cover strange use cases easily.

@castorm
Copy link
Owner Author

castorm commented May 25, 2020

I'm still thinking about it, I was considering the possibility of providing different ordering criteria, or whether to make this extensible, but to be honest, I think I might have been over-engineering in my head.

What you suggest makes perfect sense, I think I'll do something in those lines. If anything I would probably go for reducing it to a single property with a default value, for the sake of simplicity. As in direction=Forward|Reverse, defaulting to Forward, which is the current behavior.

Thanks for your input, and stay tuned! :)

@rumline
Copy link

rumline commented May 25, 2020

If using just a single property, then make it take None so the connector does no work at all. You are relying on a known, well written API. You still get 3-way behavior that way.

@castorm
Copy link
Owner Author

castorm commented May 25, 2020

Yes, you are right, I misunderstood the purpose of the flag the first time I read it.

Actually I think it's a 4-way behavior:

  • Not sorted (server-sorted)
    • As-is (your None)
    • Reversed
  • Chronologically sorted
    • Oldest first (your Forward)
    • Newest first (your Reverse)

@castorm
Copy link
Owner Author

castorm commented May 25, 2020

I think this categorization just made my mind, I think the cleanest would be for it to be two different properties in the end. Using SQL-like naming (default in parenthesis):

order.by: (None)|Timestamp
order.direction: (Asc)|Desc

I guess it's a matter of picking the right naming now.

@castorm castorm linked a pull request May 26, 2020 that will close this issue
@castorm castorm added this to the v0.7.4 milestone May 26, 2020
@castorm
Copy link
Owner Author

castorm commented May 26, 2020

This has been released as part of v0.7.4

Documentation can be found here:
https://github.com/castorm/kafka-connect-http#sorter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants