Bounty: Public database syncing system (150$) #373

ajayyy · 2021-09-29T20:48:59Z

There should be a system that allows anyone to be able to have a near-realtime (5 mins optimal, 20 mins at least) mirror at least the sponsorTimes table.

This was discussed in Discord(message) / Matrix(message).

Possible options:

Use a system similar to MusicBrainz where a Postgres replication feed gets mirrored in a read-only way for other users to download. There then should be a way for the server to automatically read from this replication feed and import it into their local database. This would have to be done in a secure way that only replicated the public tables.
Reuse the existing Webhook implementation and create a recieving end for this that can be configured to save this info to the database.
Host the CSVs via IPFS and have a wrapper to automatically retrieve latest versions of this using IPFS. This should work as IPFS stores files as chunks, and will know which chunks and new and which are old. This also makes it so when someone is first downloading the entire db, everyone's mirror will act as a seeder, even if they are on a slightly outdated version of the file (since they have access to the specific chunks).

The solution should be able to be run easily when starting up the SponsorBlock server. Docker would probably be needed to ease deployment if it requires external dependencies outside of just node.

Bounty has been solved. Bounty of $150 USD via PayPal to whoever solves this issue. To avoid duplicate work, make sure to comment on this issue with what you are doing.

The text was updated successfully, but these errors were encountered:

mchangrh · 2021-09-29T20:54:22Z

I'm not sure how well useful IPFS chunking would be as a large amount of the CSV is modified in day-to-day operations, most notably votes. You can try this yourself by chunking and deduping different days from my historical archive

http://sb-archive.mchang.xyz/mirror/

ajayyy · 2021-09-29T20:59:47Z

Postgres approach is probably the best, but it does restrict all mirrors to have to use Postgres instead of Sqlite

mchangrh · 2021-09-29T21:18:14Z

braindump while I finish my readings

https://gist.github.com/mchangrh/d8e2837b372756c867429eb20e946798

mchangrh · 2021-09-30T02:25:13Z

I think I have a very basic implementation that would only require

rsync
cron

and high bandwidth RSync+nginx server(s) for hosting but that's quite obvious + I am willing to provide (offtopic)

https://gist.github.com/mchangrh/3c3a9e870dfc1b8aee574c791dab83ad

tl;dr

master (Ajay)

base_backup every 12 hours
ship WALs to storage once done ( can also ship via rclone to "cloud storage")
prod storage w/ ssh/ webhook/ anything

storage (single/ geodistributed) / T1 Mirror

retain full history OR last base + WAL
rsync, nginx
(on prod) - generate archive (tar with base + all WAL)
(on prod) - generate .db for sqlite3 (wal replays are too cumbersome on stock sqlite)
(on prod) - prod mirrors

active mirror

(on prod) | cron - get latest WAL(s)
any new WALs are automatically replayed

mirror from scratch / back from the dead

docker-compose pull etc...
get latest archive OR base + WAL
restore base, start in recovery to stream in WALs

ajayyy · 2021-09-30T02:31:22Z

high bandwidth RSync+nginx server(s)

Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much.

mchangrh · 2021-09-30T02:33:40Z

Is there a reason this would require more bandwith than now? As, kind of the point of this is to reduce bandwith used by many people downloading CSVs. I think the 16mb WALs should make it not take too much.

the point would be to offload the bandwidth from the main server on to these mirrors, but they would probably be serving exports as well as these base_dumps and WALs, but if they are slower than the main server or just generally bad, they'll probably just fall back on the main server

ajayyy · 2021-09-30T02:35:25Z

I guess that part really depends on the amount of uptake on it past services like sb.ltn.fi, but that's a good point. Seperation of concerns is nice

mchangrh · 2021-09-30T02:42:09Z

the main advantage is that the tier 1 mirrors can be dumb

small amount of storage (1-3GB) with minimal CPU usage and some RAM since all they have are

nginx server
rsync server
nginx WH listener / sshd.

if the primary storage server isn't generating the archive, then tar which is minimal and db generation can be on a docker runner

stranger-danger-zamu · 2021-10-05T23:08:42Z

I found the addon today and it's great!

You could also abuse HTTP range requests and host an index to the CSV lines based on time. So the program just asks the next CSV slice based on it's newest ID and the server just sends the correct range (and possibly a URL). Then the program just downloads just the new lines.

The beauty of this approach is that any plain old nginx instance could do this without any major changes. Additionally since the server is serving as an index, then we can point to mirrors and round robin the load.

Real talk though. Compress those CSV files. Even gzip is good enough:

original file	original size	gzip'd size	xz'd size
categoryVotes.csv	3.4M	1.2M	996K
lockCategories.csv	3.0M	292K	216K
sponsorTimes.csv	635M	272M	208M
unlistedVideos.csv	5.5M	1.8M	1.2M
userNames.csv	5.1M	2.6M	2.4M
vipUsers.csv	4.0K	4.0K	4.0K
warnings.csv	84K	32K	28K

GZIP and XZ are both installed on most systems and anyone downloading these files from the website should know how to deal with it.

If they are being used programmatically for updating, there are ways of just opening the compressed file and reading it in all the popular languages.

mchangrh · 2021-10-05T23:56:22Z

would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too

stranger-danger-zamu · 2021-10-06T20:36:10Z

would this work for modified lines? A lot of the delta is not just new segments being added but votes and views on existing segments too

I don't see why you couldn't just have the update logic just overwrite the record if it encounters the same ID. (Essentially just upsert each record into based on the ID).

While duplication is an issue, you can also provide a periodic base file and if someone if too far back they just redownload the entire base file. These base files would just be snapshots of the database dumped periodically to compress the updated fields.

It definitely keeps it simple which I think is the biggest benefit.

mchangrh · 2021-10-21T04:41:06Z

1/2 done - distribution and downloading of files

gist explaining flags and why lz4 was chosen (w/ benchmarks)
https://gist.github.com/mchangrh/3d4a967732f8994cfcfdc05a8e22dc4f

docker container for mirroring csv files
https://github.com/mchangrh/sb-mirror

aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves)

whizzzkid · 2021-10-22T06:35:12Z

It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun

mchangrh · 2021-10-22T07:03:31Z

It might not be a bad idea to migrate to a decentralized database like https://github.com/amark/gun

Very cool but I don't know if it would fit SponsorBlock since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal.

mchangrh · 2021-10-22T08:01:40Z

1/2 done - distribution and downloading of files

gist explaining flags and why lz4 was chosen (w/ benchmarks) https://gist.github.com/mchangrh/3d4a967732f8994cfcfdc05a8e22dc4f

docker container for mirroring csv files https://github.com/mchangrh/sb-mirror

aports merge for lz4 https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26616 (if closed, will probably use debian as a base instead or build ourselves)

2/2 done - with mode=mirror in config, almost all of the work is done

only things missing would be

MSSQL support with BULKIMPORT
possibly finagling SQLite to not do version updates
(improbable but doable - will probably do for fun tomorrow) - runner to start up node server automatically

If everything else is okay, then I'll do a quick change from alpine to debian for rsync's lz4 - pending aports merge

whizzzkid · 2021-10-22T17:34:33Z

@mchangrh I think that's a misconception, there is no "single source of truth" which means there are multiple sources of truth, as each node just stores the graph that concerns them. The entire database is a sum of all nodes. It resolves the need for distributing and syncing databases, it's handled automatically by the Gun protocol.

since there would need to be an absolute point of truth, having a publicly writeable database without the application layer would not be ideal.

Not really, by that logic crypto currencies could never exist. We can reach out to the gun team on gitter to see how they can help this project.

mchangrh · 2021-10-22T18:09:25Z

Not really, by that logic crypto currencies could never exist.

Yes but sponsorblock isn't a cryptocurrency, it doesn't need a single source of truth, the source of truth is done by verifying with other peers. Quite a while ago, when the BTC market was booming, there were concerns since there was a single pool that mined 3 consequitive blocks. If a malicious actor wanted, they could spin up a majority of gun peers and override most of the database. It's cool tech but doesn't fit in with master/slave and single source of truth that sponsorblock requires.

This is required by sponsorblock since the vipUsers table should only be accessible to the admin, a bad actor could change the vipUsers table and remove, add locked segments or ban users.

It's very cool but even if we only use gun for only sponsorTimes there would still be the issues of distributing the other tables.

ajayyy · 2021-10-28T01:27:22Z

This is now complete! Check out @mchangrh's project to see how to host a mirror yourself: https://github.com/mchangrh/sb-mirror

ajayyy added the help wanted Extra attention is needed label Sep 29, 2021

ajayyy pinned this issue Sep 29, 2021

mchangrh mentioned this issue Oct 6, 2021

Fix NGINX mime type #377

Merged

mchangrh mentioned this issue Oct 21, 2021

changes for mirror mode #389

Merged

ajayyy removed the help wanted Extra attention is needed label Oct 27, 2021

ajayyy closed this as completed in 19f7bbc Oct 28, 2021

ajayyy unpinned this issue Oct 28, 2021

ajayyy mentioned this issue Nov 5, 2022

Make a P2P network for SponsorBlock database ajayyy/SponsorBlock#1570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bounty: Public database syncing system (150$) #373

Bounty: Public database syncing system (150$) #373

ajayyy commented Sep 29, 2021 •

edited

Loading

mchangrh commented Sep 29, 2021

ajayyy commented Sep 29, 2021

mchangrh commented Sep 29, 2021

mchangrh commented Sep 30, 2021 •

edited

Loading

ajayyy commented Sep 30, 2021

mchangrh commented Sep 30, 2021

ajayyy commented Sep 30, 2021 •

edited

Loading

mchangrh commented Sep 30, 2021

stranger-danger-zamu commented Oct 5, 2021 •

edited

Loading

mchangrh commented Oct 5, 2021

stranger-danger-zamu commented Oct 6, 2021 •

edited

Loading

mchangrh commented Oct 21, 2021

whizzzkid commented Oct 22, 2021

mchangrh commented Oct 22, 2021

mchangrh commented Oct 22, 2021

whizzzkid commented Oct 22, 2021

mchangrh commented Oct 22, 2021 •

edited

Loading

ajayyy commented Oct 28, 2021

Bounty: Public database syncing system (150$) #373

Bounty: Public database syncing system (150$) #373

Comments

ajayyy commented Sep 29, 2021 • edited Loading

mchangrh commented Sep 29, 2021

ajayyy commented Sep 29, 2021

mchangrh commented Sep 29, 2021

mchangrh commented Sep 30, 2021 • edited Loading

tl;dr

master (Ajay)

storage (single/ geodistributed) / T1 Mirror

active mirror

mirror from scratch / back from the dead

ajayyy commented Sep 30, 2021

mchangrh commented Sep 30, 2021

ajayyy commented Sep 30, 2021 • edited Loading

mchangrh commented Sep 30, 2021

stranger-danger-zamu commented Oct 5, 2021 • edited Loading

mchangrh commented Oct 5, 2021

stranger-danger-zamu commented Oct 6, 2021 • edited Loading

mchangrh commented Oct 21, 2021

whizzzkid commented Oct 22, 2021

mchangrh commented Oct 22, 2021

mchangrh commented Oct 22, 2021

whizzzkid commented Oct 22, 2021

mchangrh commented Oct 22, 2021 • edited Loading

ajayyy commented Oct 28, 2021

ajayyy commented Sep 29, 2021 •

edited

Loading

mchangrh commented Sep 30, 2021 •

edited

Loading

ajayyy commented Sep 30, 2021 •

edited

Loading

stranger-danger-zamu commented Oct 5, 2021 •

edited

Loading

stranger-danger-zamu commented Oct 6, 2021 •

edited

Loading

mchangrh commented Oct 22, 2021 •

edited

Loading