Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gossiping torrent popularity to scale to millions of torrents #4256

Closed
xoriole opened this issue Feb 26, 2019 · 14 comments
Closed

Gossiping torrent popularity to scale to millions of torrents #4256

xoriole opened this issue Feb 26, 2019 · 14 comments
Assignees

Comments

@xoriole
Copy link
Contributor

xoriole commented Feb 26, 2019

A Tribler Giga channel can scale to billions of torrents (#3971) and is shared in the network, however the health of the torrent (being a dynamic value) is not included in the channel and is not propagated. This leads to having access to large number of torrents but without information if it is alive or not. So, we need a mechanism to share the health of the torrents in the Giga channel so popular content gets to the surface so users can browse through them.

To find out the popularity of the torrents in the Giga channel and to dissiminate the popular torrents in the network, we start with a simple approach based on random selections.

Some key points:

  • Do DHT health check for a random torrent every 4 mins ?? No need to store the metainfo.
  • Do only weekly update of the torrent
  • Gossip daily top 5 and random 5 torrents that you have checked and are alive.

After the first deployment, we will know how well the random approach works and the optimizations can follow.

@qstokkink
Copy link
Contributor

@devos50
Copy link
Contributor

devos50 commented Feb 26, 2019

First initial results of a DHT lookup experiment to establish the amount of traffic it takes to do torrent lookups. In this small experiment, I do multiple lookups of random torrents in an academic dataset using the torrent health endpoint of Tribler (nothing else is running). After every lookup, I query how much bytes the DHT in libtorrent did process (both incoming and outgoing) using the dht.dht_bytes_in and dht.dht_bytes_out fields provided by a libtorrent session. The following graph shows the lookup ID on the horizontal axis and the bytes uploaded/downloaded on the vertical axis, in KB.

results

As we see, the average number of bytes to do a DHT lookup increases. In addition, we require around 8.5KB of non-DHT traffic per lookup (as reported by the net.sent_bytes and net.recv_bytes).

In total, the session processed around 80MB of incoming DHT traffic and 120MB of outgoing DHT traffic. Full libtorrent DHT statistics after 3638 lookups:

        "dht.dht_allocated_observers": 1023,
        "dht.dht_announce_peer_in": 9413,
        "dht.dht_announce_peer_out": 29168,
        "dht.dht_bytes_in": 81144356,
        "dht.dht_bytes_out": 123248854,
        "dht.dht_find_node_in": 89649,
        "dht.dht_find_node_out": 0,
        "dht.dht_get_in": 0,
        "dht.dht_get_out": 0,
        "dht.dht_get_peers_in": 188270,
        "dht.dht_get_peers_out": 264774,
        "dht.dht_immutable_data": 0,
        "dht.dht_invalid_announce": 415,
        "dht.dht_invalid_get": 0,
        "dht.dht_invalid_get_peers": 0,
        "dht.dht_invalid_put": 0,
        "dht.dht_messages_in": 549747,
        "dht.dht_messages_out": 630264,
        "dht.dht_messages_out_dropped": 106,
        "dht.dht_mutable_data": 0,
        "dht.dht_node_cache": 163,
        "dht.dht_nodes": 391,
        "dht.dht_peers": 267,
        "dht.dht_ping_in": 39504,
        "dht.dht_ping_out": 7798,
        "dht.dht_put_in": 0,
        "dht.dht_put_out": 0,
        "dht.dht_torrents": 182,

Net stats:

        "net.has_incoming_connections": 1,
        "net.limiter_down_bytes": 0,
        "net.limiter_down_queue": 0,
        "net.limiter_up_bytes": 0,
        "net.limiter_up_queue": 0,
        "net.on_accept_counter": 146622,
        "net.on_disk_counter": 2210,
        "net.on_disk_queue_counter": 0,
        "net.on_lsd_counter": 198,
        "net.on_lsd_peer_counter": 0,
        "net.on_read_counter": 728848,
        "net.on_tick_counter": 118829,
        "net.on_udp_counter": 1199577,
        "net.on_write_counter": 168701,
        "net.recv_bytes": 29734253,
        "net.recv_failed_bytes": 0,
        "net.recv_ip_overhead_bytes": 1796640,
        "net.recv_payload_bytes": 0,
        "net.recv_redundant_bytes": 22488,
        "net.recv_tracker_bytes": 0,
        "net.sent_bytes": 1306179,
        "net.sent_ip_overhead_bytes": 3207080,
        "net.sent_payload_bytes": 0,
        "net.sent_tracker_bytes": 0,

Regarding CPU usage during these lookups, it differs between 5% and 10% and is relatively stable.

@devos50
Copy link
Contributor

devos50 commented Mar 5, 2019

Tomorrow I will plot all possible statistics regarding DHT traffic and try to figure out why the traffic increases. A possible solution is to reset the DHT session after a specific amount of lookups.

@devos50
Copy link
Contributor

devos50 commented Mar 6, 2019

Findings so far:

  • Getting accurate health information from an info hash only is hard.
  • Our current way to get torrent health data, based on the information in the libtorrent DHT, is as follows: start downloading the torrent with the speicific infohash in upload mode. As soon as we have received the metadata from the DHT, we take a snapshot of the current list of peers we are connected to, and remove the download. This will only report (a select number of) peers in the swarm and does not consider seeders. I've tested this manually.
  • The get_peers method seems to be exposed. My next step is to try and use this method to get peers that are actively seeding the torrent. This also avoids that you join the torrent swarm.

@devos50
Copy link
Contributor

devos50 commented Mar 6, 2019

Update: I managed to get the peers from the DHT. Now we should find a way to classify them as seeder or leecher.

@qstokkink
Copy link
Contributor

qstokkink commented Mar 6, 2019

#109, also #143

@devos50
Copy link
Contributor

devos50 commented Mar 6, 2019

@qstokkink thanks for the links!

@devos50
Copy link
Contributor

devos50 commented Mar 18, 2019

I performed 1300 BEP33 lookups with the modified libtorrent code (see #4321 for details).

Some results:

Message sent and received

message

Here we plot the statistics on messages being sent and received in the DHT, as reported by the libtorrent session. Note that there are many outgoing get_peers messages which is not unexpected since we are doing DHT lookups frequently. Also, we receive numerous incoming get_peers message.

Bandwidth usage

bytes

Note how BEP33 does not require any libtorrent session bandwidth, outside for the DHT traffic. Since we are not actually joining the swarm, libtorrent does not have to maintain connections to other peers, only to DHT nodes. The bandwidth uses increases roughly linear and there seems to be no additional bandwidth overhead after each lookup.

After 1300 lookups, each lookup requires 13266 bytes (13.0KB) of incoming traffic and 10731 bytes (10.5KB) of outgoing traffic. In total, each lookup requires 23997 bytes (23.4KB). Most of the traffic is probably related to the bloom filters being sent around. These bloom filters are 256 bytes and there are two of them in each incoming get_peers response.

DHT statistics

statistics

We now plot several statistics of the DHT, like the number of nodes in our routing table and the torrents tracked. The number of peers tracked remains around 100. The number of nodes in our routing table quickly increases after startup to around 400. We do no track mutable/immutable data items.

Dead torrents

Of the 1300 lookups, 823 torrents did not have any seeders and they are considered dead (61.9%).

Conclusions

I think we should implement BEP33 lookups in Tribler. The tradeoffs are as follows:

Advantages:

  • It is a relatively easy method to estimate the size of libtorrent swarms and the results are a reliable approximation of the real size since we can combine multiple bloom filters. I manually verified this by doing lookups for some very healthy torrents (1000+ seeders/leechers), dead torrents (0 seeders) and almost dead torrents (a few seeders and/or leechers). In all situations, BEP33 gave me reliable results.
  • Results come in quick, usually a few seconds after calling dht_get_peers.
  • BEP33 bloom filters are already filled by libtorrent and uTorrent implementations.
  • We do not actually join the swarm when doing a health lookup and libtorrent does not maintain peer connections to others.

Disadvantages:

  • It requires modifications to the libtorrent code. This means that BEP33 lookups are not supported in Debian, at least not until we support lookups in the upstream code (which requires a PR). We can however ship a modified libtorrent core on Windows and Mac (which accounts for 80-90% of our user base).
  • We require a dedicated libtorrent session for these lookups and we have to manually decode every incoming DHT packet and extract the bloom filter out of it. This might influence performance.
  • The size of the bloom filter (256 bytes) limits users to tracking at most 6000 seeders/leechers. I think this is not really a problem here since the difference between 10000 and 6000 seeders is not really impactful on the end user.

@synctext
Copy link
Member

BEP33 for the win!
Given the scalability and bandwidth efficiency this is very useful. Finally we can make progress with relevance ranking through BEP33 popularity.

@devos50
Copy link
Contributor

devos50 commented Mar 29, 2019

With #4434 and Tribler/gumby#409, I consider the second iteration of the PopularityCommunity done. The algorithm is relatively simple: the overlay performs a random walk through the network. Every five seconds, you gossip at most 5 random torrents checked by you, and the top 5 torrents you have checked (based on seeders) to a randomly connected peer.

A new validation experiment has been added to verify the correctness of the popularity community. I plot the total number of torrents that have health information (based on the last_check field) over the duration of the experiment below. Note that torrent health dissemination only starts after 45 seconds (when the random walk in the popularity community starts). The slow start can be explained by the fact that peers do not have at least health information of five torrents to gossip around. The total experiment duration is two minutes.

num_healths

I now think we should observe how this mechanism behaves in the wild.

@devos50
Copy link
Contributor

devos50 commented May 13, 2019

I consider this issue appropriately addressed for the 7.3 release. As said before, we should observe how it works in the wild. Therefore I move this issue to 7.4 so we can re-visit and improve our torrent selection algorithm if necessary.

@devos50
Copy link
Contributor

devos50 commented Jul 9, 2020

Since this issue is more leaning towards a full research project, I'm unassigning myself. My recommendations for the next steps:

  • Further testing and better integration of BEP33 across all platforms.
  • Setup a crawler to get some performance numbers from the deployed network, and iterate on these results.

Highly related to #3868

@synctext
Copy link
Member

Would like to make this feature a key issue of a next release. Such as 7.6 "Bug Fix" and then: 7.7 "search&popularity".

@qstokkink
Copy link
Contributor

The functionality of the OP has been implemented for several years now. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants