Skip to content

2023 05 30

Georg Krause edited this page Nov 24, 2024 · 1 revision

2023-05-30 9pm in the middle of the night

Endpoints

  • GET/PUT /episodes
    • returns only episodes changed
    • parameter since
  • GET/PUT /episodes/{guid-hash}
    • Don't allow this endpoint to prevent problems with duplicate GUIDs
  • GET /subscriptions/{guid}/episodes
    • parameter since
    • parameter guid?
  • GET/PUT /subscriptions/{guid}/episodes/{fetch-hash} (hash: SHA1?)
    • if fetch-hash clash, server expected to return BAD REQUEST
    • Hash here, because GUIDs can be any String

We want to explain in the specs why we have endpoints 'under' subscriptions, and why we might refuse updates. (i.e. how this will help avoid gPodder API pitfalls.)

Episode endpoint

The episode endpoint is required to synchronize playback positions and played status for specific episodes. At a minimum, the endpoint should accept and return the following:

  1. The episode's Podcast GUID (most recent)
  2. The episode's GUID (sent by the client if found in the RSS feed, or generated by the server if not): String (not necessarily GUID/URL formatted).`
  3. A Status field containing lifecycle statuses. E.g.:
    • New
    • Played
    • Ignored
    • Queued
  4. A Playback position marker, updated by a PUT request
  5. A timestamp of the last time the episode was played/paused (used for conflict resolution on the playback position)
  6. A Favorite field to mark episodes
  7. A timestamp for the last time some metadata (except playback position) was updated

We discussed if it makes sense to use episode numbers, but it's not part of the feed anyways so we don't have this information and don't need it anyways

https://www.rssboard.org/rss-specification#ltguidgtSubelementOfLtitemgt

Episode identification

Fetch-hash vs GUID

Discussion whether to generate a new (static?) identifier per episode and use that for synchronisation (clients would have to store it additionally per episode?) or to use existing GUIDs as sync identifier and generate them if none is present (one endpoint needs the GUIDs to be passed by their hash/base64 then for REST-compliancy)

Fetch-hash

Fetch-hash creation: SHA1/MD5 hash of

  1. <guid> https://www.rssboard.org/rss-specification#ltguidgtSubelementOfLtitemgt

x. <link> https://www.rssboard.org/rss-specification#hrelementsOfLtitemgt x. <enclosure> (aka media file URL) https://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt

Priority of latter 2 tbd: <link> might be less likely to be unique, while <enclosure> might be less stable (more likely to change).

Consideration: why not BASE64? (REST-compliant, can be "unhashed", so hash wouldn't have to be stored on the server)

Good practice/required: store all 3 (GUID, link, media file URL). This will allow for later matching of episodes if one or two of these are missing. For example, if a totally new client is connecting to a server, and an episode doesn't have a GUID and the <link> has changed, matching would still be possible based on media file URL. (If we don't do this, finding the right episode locally might be hard when receiving a fetch-hash that's not unique, or a GUID that's missing. We know the podcast and within each podcast there'll be only a limited set of 'wrong' episodes, so a client would only have to create hashes for a few episodes in order to find a match. But still, not very economic.)

Matching proposal in pseudo-code (click to expand)
are_episodes_equal(client-episode c, server-episode s):
  // this filters out any potential GUID duplicates
  if c.podcast_guid != s.podcast_guid then
    return False
  
  // if GUID is present, decide exclusively according to it
  if c.guid not empty then
    return c.guid == s.guid
  
  // if enclosure matches, probably the same (since they share the media file)
  if c.enclosure not empty && c.enclosure == s.enclosure then
    return True
  
  // case: no media file
  if c.enclosure empty then
    // no guid, enclosure or link -> not matchable
    if c.link empty then
      return False
      
    // no media file, but episode URL matches - very probably the same
    // (how large is the error here?)
    if c.link == l.link then
      return True
      
  // All other cases: not matching
  return False

?? Each field that is empty/not present in the RSS is stored & sent empty. The fetch-hash is only used when sending a request about a specific episode. (that wouldn't work well in case of batch updates - see below) Payloads don't contain fetch-hashes, only the three separate fields.

Two options for identifying episodes in communication: [I don't think these are the only options, see here]

  • For each episode (e.g. in queue; batch update), all three fields/tags are included. Lot of (unnecessary) data exchange.
  • Each episode gets a calculated fetch-hash, which is used for communication. Clients can decide to store or generate on the fly. (Generating on-the-fly is dangerous, episode identifier should be static even if episode changes)

Server creates fetch-hash, similar to creation of Podcast GUID, based on the logic described above.

Why do we trust the server to create the hash, more than the client? Because for each person, there's probably just 1 server in the game, more likely multiple clients. So if the server messes it up, there's still a single outcome for each user.

GUID

Why shouldn't the server just create a GUID (seed: available payloads or whole episode, can also be just random) and send this back to the client? (the client would map using <enclosure> and <link> and then store this GUID) [Advantage: less payload fields, only <enclosure>, <link> and <guid> and after first sync only <guid> (guid-hash only for PUT /subs../{guid}/epis../{guid-hash})] [Further advantage: easier to implement for clients, they probably already have an episode_guid field in their DB]

Only create GUID if none is present, otherwise use existing one. Identify episode always by podcast_guid+episode_guid (e.g. when referencing queue items, settings, ...) [PodcastIndex seems to handle this the same way]

The workflow if a new client connects could then be:

  1. Get subscriptions & fetch feeds
  2. Get episodes
  3. Feed with GUIDs: map by GUID
  4. Feed without GUIDs: map by matching algorithm [above], then store GUID from sync server

Deduplication

Two options: a. agree on a deduplication logic as part of the spec which is to be executed at server level (hard to 'enforce') b. let clients figure out deduplication, and spec the calls that will allow clients to merge episodes.

To be discussed further. Latter is easier for us :-) Latter should be in the spec in either case, so that we don't have to change the whole spec if some podcast feeds mess up in a way we never anticipated. Clients can adapt a lot faster.

New GUID/Fetch-hash logic

Necessary for changing GUIDs, can also be used for deduplication?

Options:

  1. PUT /episodes with additional field old_fetch-hash (or old_guid)
  2. PUT /subscriptions/{guid}/episodes/{guid-/fetch-hash} with additional field new_fetch-hash (or new_guid)

Case where both episodes are contained in the feed (episode didn't change, but podcasters published twice): To mark duplicate, additional boolean is_duplicate so that the server handles fetch-hash/guid of both as aliases (tombstoning one, if one of them is requested, return aliases in field/array aliases/duplicate_fetch-hashes/guids)

In both cases, server changes fetch-hash/GUID of episode entry, sets fetch-hash/GUID_changed timestamp and creates tombstone for old value [On GET /episodes, old value is in fetch-hash/guid and new value in new_fetch-hash/new_guid, same behaviour as in Subscriptions]

Case to handle:

  1. Client 1 marks {fetch-hash2/guid2} as new guid of {fetch-hash1/guid1}
  2. Client 2 receives & stores this
  3. Client 2 marks {fetch-hash1/guid1} as new guid of {fetch-hash2/guid2}

(could happen through e.g. slightly different podcast feed, e.g. one feed contains MP3s, the other AACs, but podcast GUID is the same)

Excursus Database Schema in the specs

  • We should focus on the format of the communications, not how the database is stored
  • We have all field data types specified anyways in the API endpoint specification
  • We can leave the proposed database schema as an example
tags: project-management meeting-notes OpenPodcastAPI