Support for remote syncing with Dropbox #6

clintharris · 2021-04-06T10:04:14Z

Add a new plugin that allows oplog entry data to sync'ed with a user's Dropbox account.

Summary:

The Dropbox API endpoints for searching files is very limited (esp. compared to that of Google Drive).
Would probably need to use a custom, manually-maintained index file to be able to find and order oplog entries.

Dropbox API

Unfortunately the Dropbox API is pretty limited in terms of search and retrieval of files.

POST files/search
- path specifies path to folder to search
- query can be used for very primitive filename matching: bat c matches "bat cave" but not "batman car"
- mode: 'filename' to limit search to file names and not content
- no option for sorting/ordering results
POST files/search_v2: it's primitive
- query: string to search for (file name or contents). no boolean logic, regex patterns, etc.
- options.path: /Folder
- options.filename_only: true: to limit query search to file names
- No way to sort by HL timestamp; only possible to order by relevance or last_modified_time (which cannot be specified by client).
POST /properties/search allows very basic/limited searching of files by metadata props.
- Doesn't seem like there's any advantage to this over the files/search endpoints given that its the same basic string matching and doesn't offer any way to sort the results.

Custom Index File

The general idea here is that if the remote storage service doesn't provide adequate methods of searching for, and ordering the results of, oplog entry files by HLC timestamp, then some separate data structure is manually maintained by the plugin itself. This data structure can then be downloaded and used by the plugin to figure out which oplog entry files to download. It seem likely that this type of solution would be slow (e.g., possibly only downloading one oplog entry file at a time), and involve a lot of work. It really should only be pursued in the interest of offering a new remote storage service that app users want to use as a home for their data (e.g., someone really doesn't want to keep their data in Google Drive and really prefers to use Dropbox instead, for whatever reason). It seems like using an HTTP-based email API would be easier if that API makes it possible to do more advanced searching and ordering by custom times.

each client updates and uploads an "index" file: an ordered list of all the HLC timestamps (i.e., filenames) for oplog entries it has created
these files should all have a standard extension to be easily discovered by via filename pattern searching (e.g., {client ID}.index.txt)
each line in the file has the HLC timestamp for an entry created by that client and whatever additional data is needed to retrieve the corresponding file for that oplog entry. For example:
- 2021-04-05T12:29:02.790Z 0000: this is enough info to download the file from Dropbox by filename.
- 2021-04-05T12:29:02.790Z 0000 | { dropboxFileId: 'a4ayc_80_OEAAAAAAAAAY' }: this basically maps the HLC timestamp to some structured metadata--in this case, the Dropbox file ID.
the file lines are sorted by HLC time
this file should be uploaded to the server on each sync (overwriting the existing file if one exists)
other clients can download this (pre-sorted) list, somehow jump to a specific HLC time (e.g., iterate lines), and begin going through all following lines to get the info about which files need to be requested, one-by-one.

Custom index downsides, ideas for improvement

Using a manually maintained oplog index like this has some real downsides and should really be avoided if the remote storage service offers a way to filter and order files. Index files could get BIG (e.g., adding 1M lines of timestamps to a text file resulted in a 45MB file). Some thoughts on ways to reduce the data transfer and/or "search by time" operations.

compression could help; this could be deferred to server and browser (assuming server compression is enabled), or maybe done in the browser (JSZip compressed 45MB realistic text file to 134 KB in 1.2 sec)
Partition the indices into separate files using a file name format that allows the client to list the files and decide which part(s) of the index to download. For example, maybe putting timestamp ranges in index filenames: {nodeId}.{firstTimeStamp__lastTimeStamp}.index.txt)
storing the index as some other data structure other than "one timestamp per line" could make the "get all entries after time X" faster
- a tree where the paths to leaves are an encoded version of the times (similar to the way the merkle trees in the crdt-example-app) might be an option
- maybe store the file as a serialized, compressed SQLite database? this seems like it would have all kinds of problems and would require the client to do something like use a WASM module to run sqlite in the browser...

The text was updated successfully, but these errors were encountered:

clintharris added the enhancement New feature or request label Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for remote syncing with Dropbox #6

Support for remote syncing with Dropbox #6

clintharris commented Apr 6, 2021 •

edited

Loading

Support for remote syncing with Dropbox #6

Support for remote syncing with Dropbox #6

Comments

clintharris commented Apr 6, 2021 • edited Loading

Dropbox API

Custom Index File

Custom index downsides, ideas for improvement

clintharris commented Apr 6, 2021 •

edited

Loading