Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GigaChannel: BitTorrent is not enough #4677

Closed
ichorid opened this issue Jul 13, 2019 · 9 comments · Fixed by #7726
Closed

GigaChannel: BitTorrent is not enough #4677

ichorid opened this issue Jul 13, 2019 · 9 comments · Fixed by #7726

Comments

@ichorid
Copy link
Contributor

ichorid commented Jul 13, 2019

Long story short: metadata requires crowdsourcing, which requires low-latency delivery system, which BitTorrent can't provide. So, in addition to BitTorrent, we need something like a DHT-based pub-sub.

Motivation

Our primary objective is to provide relevant information to humans. Humans should be able to process it efficiently. Human's information processing capabilities are very limited, and they can't process more than a few dozens of lines of text at once. No one ever looks beyond the 3rd page of Google. Therefore, we must limit the information that we show to the user. There are three primary ways to do this:

  1. Filter the information on some criteria (text tokens/tags/type, etc.).
  2. Sort it based on some criteria (popularity/date/size, etc.).
  3. Organize it in balanced trees (ontologies), so each level is small enough not to overwhelm the human.

All three ways work together nicely, complementing each other. For example, one can search for a word foo in a local database, sort it based on recency, click on a most promising entry and then browse the collection holding it to look for similar entries.

Knowledge creation

People add information to the system. The information comes in the form of independent entries, possibly organized into collections. Every person has their domain of information facilitated by the public key infrastructure, so no one has (direct) power over other's creations.

A person can either add some original information (e.g., a personal podcast) or copy it from others. Humans can't meaningfully produce more than a dozen original entries per day. Therefore, the influx of truly original content per person is minimal. However, when a person copies stuff in their collection and shares it, they effectively produce new information. The act of selection is an information-producing event. We will call this information the grouping information.

The problem of duplicates

When one can't browse personal channels, there is no grouping information. Therefore, if entry E comes both from peers A and B, it makes no sense to store it twice, and the second one can be dropped on receiving it into a local DB. As soon as we start to account for any kind of grouping information, cutting entries means losing/distorting this information. Of course, we can store database relationships instead of duplicate entries themselves. This storage scheme will help with the indexing, but will still result in the same O(n) linear storage requirements. Thus, we must drop grouping information based on some criteria, lest it overwhelms the system.

Ontology tree balancing

When users can only produce single-level channels, it is impossible to build an ontology that includes more than about a 100^2 entries, even if all users cooperate on this. A human user can't look through more than a hundred channels, and can't look through more than a couple hundred entries in a channel. A perfect ontology is a balanced tree (or even better, a perfect encoding). Thus, users must be able to create multi-level channels, which means more grouping information.

Group selection

When users figure out their grouping information is dropped when it clashes with others, they will start organizing in communities to coordinate their efforts. This organization will require some instruments for collective authoring and communication.

Scaling

At some point, there will be channels with thousands of collective authors and sub-channels (collections). These collections will be updated a few times per day, which will result in several updates per second for the root channel. This is essentially the same problem that is faced by the Bitcoin ledger. Therefore, there should be only loose connections between the root channel and the sub-channels.

The BitTorrent problem

The power of torrents comes from it exploiting "the network effect": one can start sharing the data as soon as one got at least one piece of the torrent. This is made possible by hashing the data as a whole, so anyone can check any piece for the correctness and immediately share it. Even with "mutable torrents" support, sharing a single collectively-authored channel means either:
a. seeding thousands of tiny torrents for each sub-channel;
b. downloading the updated version several times per second, while simultaneously serving thousands of previous versions.

As a system, BitTorrent was never intended for this kind of usage: low latency updates, low volume transfers.

The solution

BitTorrents should only be used for bulk transfers of big channels. The channel infohashes should be automatically updated, say, daily. For the online updates, we should employ something like a DHT-bases pub-sub system (e.g. PolderCast). Then, two systems will together cover all the bases:

  • Pub-sub: low-latency, low throughput;
  • BitTorrent: high-latency, high throughput;
@synctext
Copy link
Member

Therefore, if entry E comes both from peers A and B , it makes no sense to store it twice, and the second one can be dropped on receiving it into a local DB.

We must use duplicates and simplicity until we have exceeded 1 million users. Please don't future engineer this stuff before we have actual waisting of Terabyte hard disks with Tribler 9.

The above assumption on crowdsourcing need to be validated in the real world first. Linus is a single person managing thousands of crowdsourcers. Complex systems evolve into unexpected solution with remarkable efficiency.

@synctext
Copy link
Member

synctext commented Jul 14, 2019

( as background, I dislike DHTs and pub/sub with religious passion due to their fundamental incentive misalignment)
Channels should be atomic until that simple design starts to waste way to much cheap disk space. Otherwise we risk repeating the Dispersy mistakes. Dispersy channels 1.0 where also not atomic.

@ichorid
Copy link
Contributor Author

ichorid commented Jul 14, 2019

One reason why Git became so successful is that Linus designed it with separation of influence in mind, so developers can merge their work into the main tree without interference from others.

I do not insist on using DHT or any specific technology at all. I merely point out that:

  • structured collections of information will always win over unstructured ones;
  • structuring large collections of information requires collective effort;
  • collective effort means simultaneous editing;
  • simultaneous editing means super-frequent updates or splitting stuff into small independent parts;
  • BitTorrent was designed for sharing big indivisible things which a rarely updated.

@ichorid
Copy link
Contributor Author

ichorid commented Jul 14, 2019

Regarding the database size:

User adoption of a social platform is like a nuclear reactor: there are catalysts increasing reactivity (e.g. useful information provided by the system), and inhibitors decreasing it (e.g. bad UI, etc). To "blow up", the reaction should become self-sustainable, meaning that the rate of the catalysis should prevail over the inhibition. Sometimes, when the system is on the threshold of becoming self-sustainable, it only needs a small "push". One analogy is how a nuclear bomb works: enriched uranium is (relatively) stable by itself, but if you compress it with a small explosion, it gets to a critical density and the fission reaction becomes self-sustainable. Buying ads for a start-up social platform can be seen as this kind of "push".

Applying this to Channels:

  • Information volume provided by Channels system is the catalyst.
  • Database size is the inhibitor (one of many inhibitors).

It is very possible, that we will not be able to ever reach the "critical density" of information, because of the inhibition caused by the database size growth rate and its side effects. The "hard" inhibition threshold of the database size that will repel 99% potential users could lie much lower than said 1TB, (say, at 100GB DB size). However, the self-sustainable level of the catalyst (useful information) could require 1TB databases with our current technology. In that case, no ads or features or performance tweaks will ever get us to 1TB real-world databases. We have now stuck in a vicious circle: no 1TB databases - no improvement of database density; no improvement of database density - no big enough userbase to generate content - no 1TB databases.

That would be a pretty said scenario.

@Dmole
Copy link
Contributor

Dmole commented Sep 14, 2020

Is low-latency really required though?

Could use one master mutable torrent that only mutates on the creation of a new channel,
with only one mutable torrent per channel,
with per file hashes for each torrent-file in the channel permitting cross channel swarms,
with a cache timestamp in the UI to indicate when the view was last updated.

That way peers only get updates when viewing the channel list and there is a new channel,
or when viewing a channel with new content.

Be nice to include an index.html in each channel to.

@ichorid
Copy link
Contributor Author

ichorid commented Sep 14, 2020

Well, two years ago we had a discussion with @synctext about using mutable torrents and cross-torrent swarms. For some reason, he dismissed the idea of using anything but vanilla BitTorrent protocol 🤷

One way or another, we can't use "one swarm to rule them all" for an obvious reason that someone (us) having to maintain it. This is pure centralization and it is exactly the thing we try to fight here by developing Tribler. Also, this will never scale for the same reason Bitcoin does not scale. The current system of gossiping around subscribed channels is doing well enough to spread popular channels.

Having said all that, it would be very nice to eventually have some channel swarms share common metadata elements, like pictures, etc.

@Dmole , could you please further explain what is "per file hashes"?

@Dmole
Copy link
Contributor

Dmole commented Sep 14, 2020

...The current system of gossiping around subscribed channels is doing well enough to spread popular channels....

Glad it's working out.

... mutable torrents ... anything but vanilla BitTorrent protocol ...
...centralization...

I understand the general desire for compatibility but normal v1 torrents can be used inside mutable torrents just for channels. Avoiding centralization would require trusting peers to not poison the list of channels (every client would have the pk)... which may not be a practical issue.

... @Dmole , could you please further explain what is "per file hashes"?

https://blog.libtorrent.org/2020/09/bittorrent-v2/

@ichorid
Copy link
Contributor Author

ichorid commented Sep 14, 2020

https://blog.libtorrent.org/2020/09/bittorrent-v2/

Nice, Arvid even cites Tribler as the source of Mekle tree inspiration! 😄 :

@ichorid
Copy link
Contributor Author

ichorid commented Sep 28, 2021

https://github.com/Tribler/tribler/discussions/5721 describes the solution

@ichorid ichorid removed their assignment Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants