-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content popularity community: performance evaluation #3868
Comments
ToDo:
|
Bumping this issue. The key selling point of Tribler 7.6 is popularity community maturing (good enough for coming 2 years) and superior keyword search using relevance ranking. goal: 100k swarm tracking. This has priority on channel improvements. Our process is to bump each critical features to a superior design and move to the next. Key lesson within distributed systems is: you can't get it perfect the first time (unless you have 20 years of failure experience). iteration and relentless improving deployed code is key. After we this close performance evaluation issue we can build upon it. We need to know how well it performs and tweak it for 100k swarm tracking. We can do 1st version of real-time relevance ranking. Read our 2010 work for background: Improving P2P keyword search by combining .torrent metadata and user preference in a semantic overlay
Repeating key research questions from above (@ichorid):
Concrete graphs from a single crawl:
|
See also #4256 for BEP33 measurements&discussion |
Please check out @grimadas tool for crawling+analysing Trustchain and enhance this for the popularity community: |
Hopefully we can soon add the health of the ContentPopularity Community to our overall dashboard. |
Currently, a peer shares its most popular 5 and random 5 torrents checked by the peer to its connected neighbors. Since, a peer starts sharing them from the beginning, its not always the case the popular torrents are shared. This results in sharing torrents that doesn't have enough seeders (see SEEDERS_ZERO count), and this does not contribute much in sharing of popular torrents. So, two things that could improve sharing popular torrents seems like:
https://jenkins-ci.tribler.org/job/Test_tribler_popularity/plot/ |
Nice work! I assume that this experiment is using the live overlay? As a piece of advice, I would first try to keep the mechanism simple for now, while analyzing the data from the raw network (as you did right now). Extending the mechanism with (arbitrary) rules might lead to bias results, which I learned the hard way when designing the matchmaking mechanism in our decentralized market. Sharing of the 5 popular and 5 random torrents might look like a naive sharing policy, but it might be a solid starting point to get at least a basic popularity gossip system up and running. Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them 👍 . |
@devos50 Yes, it is using live overlay.
Yes, good point. I'll create experiments to test the specific changes. |
Thnx @xoriole! We now have our first deployment measurement infrastructure, impressive.
Can we (@kozlovsky @drew2a @xoriole) come up with a dashboard graph to quantify how far we are to our Key Performance Indicator: the goal of tracking 100k swarms? To kickstart the brainstorm:
As @devos50 indicated, this sort of tuning is best preserved for last. You want to have an unbiased view of your raw data for as long as possible. Viewing raw data improves accurate understanding. {Very unscientific: we design this gossip stuff with intuition. If we have 100+ million users people would be interested in our design principle.} Repeating long-term key research questions from above (@ichorid):
|
For every popular torrent, there are a thousand of dead ones. Therefore, information about what is alive is much more precious and scarce then about what is dead. It will be much more efficient to only share torrents that are well seeded. Though, the biggest questions are:
|
It would be very nice if we find (or develop) some Python-based Mainline DHT implementation, to precisely control the DHT packets parameters.
|
Writing down our objectives here:
background |
Initial documentation of deployed Tribler 7.12 algorithms
|
Repeating the Popularity community experiment here. Similar to the experiment done in September, here we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community. The numbers represented in the graph are count values and the scale used in the graph is logarithmic for better comparison since the variation in the values is large. A. Based on countB. Normalized in percentagesSeeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 % Observerations
|
I would point out another reason - lower number of users using Tribler might skew the results or atleast give an erratic response. I do not know why but it seems that userbase has decreased quite a lot. For newer torrents, I get downloading/uploading speeds of around 20MBPS in qBitTorrent (without VPN) but on Tribler I hardly cross maximum of 4MBPS (without hops). Is this because of low number of users or unable to connect to peers or cooperative downloading - that I have no technical knowledge of? |
@absolutep Interesting thought, thx! We need to measure that and compensate for that. @xoriole The final goal of this work is to either write or contribute the technical content to a (technical/scientific) paper, like: https://github.com/Tribler/tribler/files/10186800/LTR_Thesis_v1.1.pdf |
Discussed progress, next sprint: how good is are the popularity statistics with latest 12.1 Tribler (filtered results, compared to ground truth)? DHT self-attack issue to investigate next? |
Comparing the results from the naked libtorrent and the Tribler, I found that the results of the torrent check of the popular torrents received via the popularity community when checked locally results in dead torrents which is likely not the case. This is because of the issue in torrent checker (DHT Session checker). After BEP33 is removed, the earlier way of getting the health response mostly returns in zero seeders and zero or some leechers, this in the UI shows as |
Could this bug (#6131) relate to the described issues? |
Yes, it is same bug |
While working on #7286 I've found a strange behavior that may shed light on some of the other oddities. If If Maybe it is a bug that @xoriole describes above. UPDATED 03.02.22 after verification from @kozlovsky I also found that one automatic check in There are three automatic checks: tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py Lines 72 to 75 in 87916f7
The first ( tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py Lines 159 to 163 in 87916f7
CC: @kozlovsky |
Also, I'm posting an algorithm example of getting seeders' and leechers' in case there is more than one source of information available.
Proof: tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py Lines 320 to 324 in 87916f7
Intuitively it is not the correct algorithm. Maybe we should use the Something like: from statistics import mean
DHT_response = {'seeders': 10, 'leechers': 23}
tracker_response = {'seeders': 4, 'leechers': 37}
result = {'seeders': None, 'leechers': None}
for key in result.keys():
result[key] = mean({DHT_response[key], tracker_response[key]})
print(result) # {'seeders': 7, 'leechers': 30} Or we might prioritize the sources. Let's say:
|
I suspect you are right, and this check does not store the received results in the DB
I think these checks work properly. The function they call is not actually async: @task
async def check_torrent_health(self, infohash, timeout=20, scrape_now=False):
... This function appears to be async, but it is actually synchronous. The We should probably replace the |
New PR for torrent checking. After the release is done of Tribler 7.13 we can re-measure the popularity community. ToDo @xoriole. The new code deployment will hopefully fix issues and improve the general health and efficiency of popularity info for each swarm. |
Lets focus on the latest 7.13 release and re-measure. Can we find 1 million unique swarm in 50 days? How many long-tail swarms are we finding. Did we achieve the key goal of tracking 100k swarms for the {rough} popularity. If we succeeded we can move on and focus fully on bug fixing, tooling, metadata enrichment, tagging and semantic search. @egbertbouman Measure for 24 hours the metadata at runtime we get from the network:
|
Just did a simple experiment with Tribler running idle for 24h in order to get some idea about which mechanisms in Tribler are primarily responsible for getting torrent information. It turns out that the vast majority of torrents are discovered through the lists of random torrents gossiped within the When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the While examining the database after the experiment had completed, it turned out that a little under 80% of all torrent were "free-for-all" torrents, meaning that they did not have a |
Also, if you'd look at the top channels, they all were updated 2-3 years ago. That means their creators basically dumped a lot of stuff and then forgot about it, lose interest in it. But their creations continue to dominate the top, which is now a bunch of 🧟 🧟 🧟 "Free-for-all" torrents gossip is the most basic form of collective authoring, and it trumped over the channels in real usage (e.g., search and top torrents) as @egbertbouman demonstrated above. This means that without a collective editing feature Channels are useless and misleading. Overall, I'd say Channels must be removed (especially because of its clumsy "Channel Torrent" engine), and replaced with some more bottom-up system, e.g. collectively edited tags. |
😲 😲 😲
Now we understand a big source of performance loss. Gigachannels is way too agressive in the first seconds and first minute of startup. No rate control or limit of IO or networking (32 torrents/second). Should be smoother and first rely on |
Web3Discover: Trustworthy Content Discovery for Web3@xoriole would be great to write a scientific arxiv paper on this, beyond traditional developer role. Also contributing to performance evaluation and scientific core. See IPFS example reporting, tooling, and general pubsub work plus "GossipSub: Attack-Resilient Message Propagation in the Filecoin and ETH2.0 Networks". Problem description, design, Graphs, performance analysis, and vital role for search plus relevance ranking. |
The problem remains relevant to this day. I see two main issues and possible directions for improvement:
|
For context, the long-term megalomaniac objectives (update Sep 2022):
@arvidn indicated: tracking popularity is known to be a hard problem.
We deployed the first version into Tribler #3649 , after prior Master thesis research #2783. However, we lack documentation or specification of the deployed protocol.
Key research questions:
Concrete graphs from a single crawl:
Implementation of
on_torrent_health_response(self, source_address, data)
ToDo @xoriole : document deployed algorithm in 20+ lines (swarm check algorithm, pub/sub, hash selection algorithm,handshakes, search integration, etc.).
The text was updated successfully, but these errors were encountered: