content popularity community: performance evaluation #3868

synctext · 2018-09-11T09:41:20Z

For context, the long-term megalomaniac objectives (update Sep 2022):

Layer	Description
User experience	perfect search in 500 ms and asynchronously updated ✔️
Relevance ranking	balance keyword matching and swarm health
Remote search	trustworthy peer which has the swarm info by random probability
Popularity community	distribute the swarm sizes
Torrent checking

After completing the above, next item: Add tagging and update relevance ranking. Towards perfect metadata.
De-duplication of search results.
Also find non-matching info. Search for Linux, find items tagged Linux, Biggest Ubuntu swarm is shown first.
Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.
personalised search
3+ years ahead: row bundling

@arvidn indicated: tracking popularity is known to be a hard problem.

I spent some time on this (or a similar) problem at BitTorrent many years ago. We eventually gave
up once we realized how hard the problem was. (specifically, we tried to pass around, via gossip,
which swarms are the most popular. Since the full set of torrents is too large to pass around,
we ended up with feedback loops because the ones that were considered popular early on got
disproportional reach).

Anyway, one interesting aspect that we were aiming for was to create a "weighted" popularity,
based on what your peers in the swarms you participated in thought was popular. in a sense,
"what is popular in your cohort".

We deployed the first version into Tribler #3649 , after prior Master thesis research #2783. However, we lack documentation or specification of the deployed protocol.

Key research questions:

What is the real deployed system behavior?
What is the resource consumption?
What is the accuracy and quality in general of the information?
How can we attack or defend this IPv8 community?

Concrete graphs from a single crawl:

messages and bandwidth in time
hashes discovery and duplicates
distribution of discovered popularity of swarms
conduct swarm popularity check and compare results in real-time
behavior of pub/sub mechanism for popularity feed
dynamics of trust-based pub/sub auto-subscribe

Implementation of on_torrent_health_response(self, source_address, data)
ToDo @xoriole : document deployed algorithm in 20+ lines (swarm check algorithm, pub/sub, hash selection algorithm,handshakes, search integration, etc.).

The text was updated successfully, but these errors were encountered:

xoriole · 2018-10-04T13:44:54Z

Popularity Community
Introduction
Popularity community is a dedicated community to disseminate popular/live contents across the network. The content could be anything for eg. health of a torrent, a list of popular torrents or even search results. The way of dissemination of the content follows the publish-subscribe model. Each peer in the community is both a publisher and a subscriber. A peer subscribes to a set of neighboring peers to receive their content updates while it publishes its content updates to the peers subscribing it.

Every peer maintains a list of subscribing and publishing peers with whom it exchanges content. All contents from non-subscribed publishers are basically refused. Selection of peers to subscribe or to publish greatly influences the dissemination of the content both genuine and spam. Therefore, we try to select based on a simple trust score. Trust score indicates the number of times we have interacted with the node as indicated by the number of mutual Trustchain blocks. Higher the trust score better the chance of being selected (as publisher or subscriber).

Research questions
...

synctext · 2019-07-02T12:34:24Z

ToDo:
describe the simplified top-N algorithm that is more light-weight (no pub/sub).
As-simple-as-possible gossip. Measure and plot 4 graphs listed above

messages and bandwidth in time
hashes discovery and duplicates
distribution of discovered popularity of swarms
conduct swarm popularity check and compare results in real-time

synctext · 2020-06-28T09:08:55Z

Bumping this issue. The key selling point of Tribler 7.6 is popularity community maturing (good enough for coming 2 years) and superior keyword search using relevance ranking. goal: 100k swarm tracking.

This has priority on channel improvements. Our process is to bump each critical features to a superior design and move to the next. Key lesson within distributed systems is: you can't get it perfect the first time (unless you have 20 years of failure experience). iteration and relentless improving deployed code is key.

After we this close performance evaluation issue we can build upon it. We need to know how well it performs and tweak it for 100k swarm tracking. We can do 1st version of real-time relevance ranking. Read our 2010 work for background: Improving P2P keyword search by combining .torrent metadata and user preference in a semantic overlay

Repeating key research questions from above (@ichorid):

What is the real deployed system behavior?
What is the resource consumption?
What is the accuracy and quality in general of the information?
How can we attack or defend this IPv8 community?

Concrete graphs from a single crawl:

messages and bandwidth in time
hashes discovery and duplicates
distribution of discovered popularity of swarms
conduct swarm popularity check and compare results in real-time
behavior of pub/sub mechanism for popularity feed
dynamics of trust-based pub/sub auto-subscribe

synctext · 2020-06-28T09:10:45Z

See also #4256 for BEP33 measurements&discussion

synctext · 2020-06-30T09:37:15Z

Please check out @grimadas tool for crawling+analysing Trustchain and enhance this for the popularity community:
https://github.com/Tribler/trustchain_etl

synctext · 2020-08-31T10:56:43Z

Hopefully we can soon add the health of the ContentPopularity Community to our overall dashboard.

xoriole · 2020-09-13T18:00:28Z

PEERS_CONNECTED : Number of currently connected peers
PEERS_UNIQUE: Number of unique peers encountered during the measurement period (10 mins)
TORRENTS: Number of torrents received
SEEDERS_MAX: Seeder count of the most popular torrent received
SEEDERS_AVG: Avg seeder count of the received torrents. (Higher the better; how to increase this?)
SEEDERS_ZERO: Number of torrents received with zero seeders

Currently, a peer shares its most popular 5 and random 5 torrents checked by the peer to its connected neighbors. Since, a peer starts sharing them from the beginning, its not always the case the popular torrents are shared. This results in sharing torrents that doesn't have enough seeders (see SEEDERS_ZERO count), and this does not contribute much in sharing of popular torrents. So, two things that could improve sharing popular torrents seems like:

not sharing zero seeder torrents
increasing the initial buffer time before sharing is started

https://jenkins-ci.tribler.org/job/Test_tribler_popularity/plot/

devos50 · 2020-09-14T07:46:50Z

Nice work! I assume that this experiment is using the live overlay?

As a piece of advice, I would first try to keep the mechanism simple for now, while analyzing the data from the raw network (as you did right now). Extending the mechanism with (arbitrary) rules might lead to bias results, which I learned the hard way when designing the matchmaking mechanism in our decentralized market. Sharing of the 5 popular and 5 random torrents might look like a naive sharing policy, but it might be a solid starting point to get at least a basic popularity gossip system up and running.

Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them 👍 .

xoriole · 2020-09-14T07:51:12Z

@devos50 Yes, it is using live overlay.

Also, we have a DAS5 experiment where popularity scores are gossiped around (which might actually be broken after some channel changes). This might be helpful to test specific changes to the algorithm before deploying them.

Yes, good point. I'll create experiments to test the specific changes.

synctext · 2020-09-14T07:51:22Z

Thnx @xoriole! We now have our first deployment measurement infrastructure, impressive.

What is the real deployed system behaviour?

Can we (@kozlovsky @drew2a @xoriole) come up with a dashboard graph to quantify how far we are to our Key Performance Indicator: the goal of tracking 100k swarms? To kickstart the brainstorm:

Unique hash discovery after joining community for 1 hour (plus duplicates)?
messages and bandwidth in time
- For 1 hour measure the amount of messages and type
- Indicate their contents (identify ZERO_SEEDERS)
- plot bandwidth usage
What is the accuracy and quality in general of the information?
- validate the popularity check results
- sample slowly the discovered swarms (DHT hammering will get server blocked)
- compare client and our Jenkins re-check result
- plot the delta for each swarm and sort by largest difference (Y-Axis delta of popularity, X-Axis: swarms sorted by delta)
distribution of discovered popularity of swarms (dead swarms, suspicious large swarms)

increasing the initial buffer time before sharing is started

As @devos50 indicated, this sort of tuning is best preserved for last. You want to have an unbiased view of your raw data for as long as possible. Viewing raw data improves accurate understanding. {Very unscientific: we design this gossip stuff with intuition. If we have 100+ million users people would be interested in our design principle.}

Repeating long-term key research questions from above (@ichorid):

What is the resource consumption?
How can we attack or defend this IPv8 community?
DHT node spam from exit nodes #3065 Fix for DHT spam using additional deployed service infrastructure

ichorid · 2020-09-14T08:27:39Z

not sharing zero seeder torrents

For every popular torrent, there are a thousand of dead ones. Therefore, information about what is alive is much more precious and scarce then about what is dead. It will be much more efficient to only share torrents that are well seeded.

Though, the biggest questions are:

should we use the information received from other peers when sending our own gossip packets? (probably not, or to some limited extend)
should we recheck the torrent health eventually? (probably not, as popular torrents tend to rise quickly and fall slowly)

ichorid · 2020-09-14T08:41:18Z

What is the resource consumption?

DHT node spam from exit nodes #3065 Fix for DHT spam using additional deployed service infrastructure

It would be very nice if we find (or develop) some Python-based Mainline DHT implementation, to precisely control the DHT packets parameters.

How can we attack or defend this IPv8 community?

⚔️ attack	🛡️ defence
spam stuff around	pull-based gossip
fake data	cross-check data with others
biased torrent selection	pseudo-random infohash selection (e.g. only send infohashes sharing some number of last bytes)

xoriole · 2022-09-13T14:38:52Z

Popularity community experiment
The purpose of the experiment is to see how the torrent health information received via the popularity community differs when checked locally by joining the swarm.

From the popularity community, we constantly receive a set of tuples (infohash, seeders, leechers, last_checked) representing the popular torrent with their health (seeders, leechers) information. This health information is supposed to be obtained by the sender by checking the torrent themselves so the expectation is that the information is relatively accurate and fresh.

In the graph below, we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community.

First considering the seeders. Since the variation in the number of seeders for different torrents is high, a logarithmic scale is used.

Similarly for the leechers, again logarithmic scale is used.

Here each individual torrent is unrelated to each other and could be more or less popular depending on what content they represent so seeders, leechers, and peers (= seeders + leechers) are represented in the percentage of their reported value in an attempt to normalize them.

Seeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 %

Observerations

Similar to frozen experiment, the checked seeders values are much lower than the reported seeders values. However, the average seeders % is 13.60% which is a bit higher than the frozen experiment average seeder value (9.42%). This makes sense because since these are popular torrents the seeders are expected to be higher than normal search experiments like frozen.
Checked leechers values are normally higher than the reported leecher values. This also resembles with frozen experiment with similar average values 296.44% (this experiment) and 225.17% (frozen experiment).
Comparing the peer values, the average peers % is 28.27% compared to the frozen experiment 35.26%. This is likely because the total peers reported for popular torrents is higher than for the torrents returned from the search results.
In overall, the percentages do not differ by too much in both experiments.

synctext · 2022-09-30T07:43:25Z

Writing down our objectives here:

Layer	Description
Relevance ranking	It is show to the user within 500 ms and asynchronously updated
Remote search	trustworthy peer which has the swarm info by random probability
Popularity community	distribute the swarm sizes
Torrent checking

Add tagging and update relevance ranking. Towards perfect metadata.
Added to that is adversarial information retrieval for our Web3 search science. After above is deployed and tagging is added. Cryptographic protection of above info. Signed data needs to have overlap with your web-of-trust, unsolved hard problem.

background
Getting this all to work is similar to making a distributed Google. Everything needs to work and needs to work together. Already in 2017 we tried to find the ground-truth on the perfect matching swarm for a query. We have a minimal swarm crawler (2017). "Roughly 15-KByte-ish of cost for sampling a swarm (also receive bytes?). Uses magnet links only. 160 Ubuntu swarms crawled":

Documented torrent checking algorithm?
Documented popularity community torrent selection and UDP/IPv8 packet format?
Readthedocs Example "latest/search_architecture.html"

synctext · 2022-12-02T09:23:04Z

Initial documentation of deployed Tribler 7.12 algorithms

Random Torrentchecking. Every 2 minutes check popularity of a random swarm. Critical decisions: which swarm to check (e.g. random). No bias for dead swarm or fresh swarms in any way.
Popular Torrentchecking

tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py

Line 198 in 7301695

def torrents_to_check(self):
Unknown quality of dead swarms 💀. The cause could be channels, dead swarms inside subscribed channels. Concept of pre-view torrent might be the cause. Remote search results are also checked.
No algorithms to purge (remove) dead swarms in Tribler.
{repeating} Redo experiments with newer Tribler code: "The purpose of the experiment is to see how the torrent health information received via the popularity community differs when checked locally by joining the swarm."
- "Naked Libtorrent" is operational as a minimal codebase to join swarms through exit node, measure connected peers, and estimate total swarm size. Strictly limited to 60 seconds per swarm.
  - Bep-33 swarm count (not used)
  - Full DHT lookup peer identities (not used)
  - Tracker peer identities (not used, infohash only)
  - PEX-gossip peer identities
    - failed to connect peers
    - Connected peer identities (e.g. responsive)

xoriole · 2022-12-09T06:09:37Z

Repeating the Popularity community experiment here.

Similar to the experiment done in September, here we show how the reported (or received) health info and checked health info differ for the 24 popular torrents received via the community.

The numbers represented in the graph are count values and the scale used in the graph is logarithmic for better comparison since the variation in the values is large.

A. Based on count

B. Normalized in percentages

Seeders % = ( checked seeders / reported seeders ) x 100 %
Leechers % = ( checked leechers / reported leechers ) x 100 %
Peers % = ( ( checked seeders + checked leechers) / ( reported seeders + reported leechers ) ) x 100 %

Observerations

Seeders
The average number of checked seeders per torrent is similar for both measurements.

Measurement Avg Seeders count Avg Seeders %

Septemebr 104 13.6

December 108 2.49
Leechers
The average number of checked leechers is lower than found in September.

Measurement Avg Leechers count Avg Leechers %

Septemebr 143.58 296.44

December 105.91 40.02

However, this is less significant since we're more interested.Lessons learned: Even though the torrent is reported to be alive and popular, it could still be dead as we found out by checking. This gap between reported and checked requires fixing the checking mechanism within Tribler.Lessons learned: Even though the torrent is reported to be alive and popular, it could still be dead as we found out by checking. This gap between reported and checked requires fixing the checking mechanism within Tribler.
Peers
The average number of checked leechers is lower than found in September.

Measurement Avg Peers count Avg Peers %

Septemebr 244.04 28.27

December 214.45 4.35
In overall, the seeders, leechers, and peers percentage has decreased significantly compared to the September measurement. One likely explanation for this change is that the popular torrents recorded in this experiment have a lower standard deviation of the reported values of health information (new version of Tribler) compared to that of the measurements taken in September. That is, more diverse popular torrents are being distributed in the new version of Tribler 7.12.1. This is different from the earlier Tribler version where we observed a few popular torrents were distributed multiple times.
Lessons learned: Even though the torrent is reported to be alive and popular, it could still be dead as we found out by checking. This gap between reported and checked requires fixing the checking mechanism within Tribler.
In my opinion, the dissemination of popular torrents via the popularity community is satisfactory looking at the results.

absolutep · 2022-12-09T07:30:04Z

In overall, the seeders, leechers, and peers percentage has decreased significantly compared to the September measurement.

I would point out another reason - lower number of users using Tribler might skew the results or atleast give an erratic response.

I do not know why but it seems that userbase has decreased quite a lot.

For newer torrents, I get downloading/uploading speeds of around 20MBPS in qBitTorrent (without VPN) but on Tribler I hardly cross maximum of 4MBPS (without hops).

Is this because of low number of users or unable to connect to peers or cooperative downloading - that I have no technical knowledge of?

synctext · 2022-12-09T08:43:54Z

@absolutep Interesting thought, thx! We need to measure that and compensate for that.

@xoriole The final goal of this work is to either write or contribute the technical content to a (technical/scientific) paper, like: https://github.com/Tribler/tribler/files/10186800/LTR_Thesis_v1.1.pdf
We're very much not ready for machine learning. But for publication results its strangely easy to mix measurement of a 17 years deployed system with simplistic Python Jupyter notebooks with machine learning. Key performance indicator: zombies in top-N (1000). Agree with key point you raised: stepping out of the engineering mindset. Basically we're spreading data nicely and fast, its only a bit wrong (e.g. 296.44% 😂 )
Lesson learned: started simple, working, and inaccurate. Evolved complexity: we need a filter step and measure again later in time (e.g. re-measure, re-confirm popularity). Reactive, pro-active, or emergent design. Zero trust architecture: trust nobody but yourself. We have no idea actually. ~~So just build, deploy, and watch what happens.~~ Actually we need to know the root cause of failure. Without understanding the reason for wrong statistics, we're getting nowhere. Can we reproduce the BEP33 error, for instance? Therefore, analysis of 1 month system dynamics and faults. Scientific related work (small sample from this blog on Google Youtube):

Scientific problem is item ranking. What would be interesting to know is: how fast does the frontpage of Youtube change with the most-popular videos? Scientific article by Google: Deep Neural Networks for YouTube Recommendations.

synctext · 2023-01-09T14:29:08Z

Discussed progress, next sprint: how good is are the popularity statistics with latest 12.1 Tribler (filtered results, compared to ground truth)? DHT self-attack issue to investigate next?

xoriole · 2023-01-19T13:09:35Z

Comparing the results from the naked libtorrent and the Tribler, I found that the results of the torrent check of the popular torrents received via the popularity community when checked locally results in dead torrents which is likely not the case. This is because of the issue in torrent checker (DHT Session checker). After BEP33 is removed, the earlier way of getting the health response mostly returns in zero seeders and zero or some leechers, this in the UI shows as

drew2a · 2023-01-19T13:17:55Z

Could this bug (#6131) relate to the described issues?

xoriole · 2023-01-19T13:44:39Z

Could this bug (#6131) relate to the described issues?

Yes, it is same bug

drew2a · 2023-01-27T14:41:45Z

While working on #7286 I've found a strange behavior that may shed light on some of the other oddities.

If TorrentChecker performs a check via a tracker, then returned values always look ok-ish (like 'seeders': 10, 'leechers': 77).

If TorrentChecker performs a check via DHT, then returned seeders are always equal to 0 (like 'seeders': 0, 'leechers': 56)

Maybe it is a bug that @xoriole describes above.

UPDATED 03.02.22 after verification from @kozlovsky

I also found that one automatic check in TorrentChecker was broken.
~~I also have found that literally all automatic checks in TorrentChecker were broken.~~

There are three automatic checks:

tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py

Lines 72 to 75 in 87916f7

    
           self.register_task("tracker_check", self.check_random_tracker, interval=TRACKER_SELECTION_INTERVAL) 
        
           self.register_task("torrent_check", self.check_local_torrents, interval=TORRENT_SELECTION_INTERVAL) 
        
           self.register_task("user_channel_torrent_check", self.check_torrents_in_user_channel, 
        
                              interval=USER_CHANNEL_TORRENT_SELECTION_INTERVAL)

The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:

tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py

Lines 159 to 163 in 87916f7

    
           try: 
        
               await self.connect_to_tracker(session) 
        
               return True 
        
           except: 
        
               return False

~~The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function).~~

~~The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function).~~

CC: @kozlovsky

drew2a · 2023-01-30T12:34:17Z

Also, I'm posting an algorithm example of getting seeders' and leechers' in case there is more than one source of information available.

TorrentChecker checks the seeders' and leechers' for an infohash.
TorrentChecker sends a DHT request and a request to a tracker.
TorrentChecker receives two answers. One from DHT and one from the tracker:
- DHT_response= {"seeders": 10, "leechers"=23}
- tracker_response={"seeders": 4, "leechers"=37})
TorrentChecker picks the answer with the maximum seeders' value. Therefore the result is:
- result={"seeders": 10, "leechers"=23}
TorrentChecker saves this information to the DB (and propagates it through PopularityCommunity later).

Proof:

tribler/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py

Lines 320 to 324 in 87916f7

    
           # More leeches is better, because undefined peers are marked as leeches in DHT 
        
           if s > torrent_update_dict['seeders'] or \ 
        
                   (s == torrent_update_dict['seeders'] and l > torrent_update_dict['leechers']): 
        
               torrent_update_dict['seeders'] = s 
        
               torrent_update_dict['leechers'] = l

Intuitively it is not the correct algorithm. Maybe we should use the mean function instead of the max.

Something like:

from statistics import mean

DHT_response = {'seeders': 10, 'leechers': 23}
tracker_response = {'seeders': 4, 'leechers': 37}

result = {'seeders': None, 'leechers': None}
for key in result.keys():
    result[key] = mean({DHT_response[key], tracker_response[key]})

print(result)  # {'seeders': 7, 'leechers': 30}

Or we might prioritize the sources. Let's say:

Tracker (more important)
DHT (less important)

kozlovsky · 2023-02-03T15:03:10Z

I also have found that literally all automatic checks in TorrentChecker were broken.
The first (check_random_tracker) is broken because it performs the check, but didn't save the results into DB:

I suspect you are right, and this check does not store the received results in the DB

The second (check_local_torrents) is broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function):
The third (check_torrents_in_user_channel) is also broken because it calls an async function in a sync way (which doesn't leads to the execution of the called function):

I think these checks work properly. The function they call is not actually async:

@task
async def check_torrent_health(self, infohash, timeout=20, scrape_now=False):
    ...

This function appears to be async, but it is actually synchronous. The @task decorator converts an async function to a sync function that starts an async task in the background. This is non-intuitive, and PyCharm IDE does not understand this.

We should probably replace the @task decorator with something that IDE better supports to avoid further confusion.

synctext · 2023-04-06T11:41:34Z

New PR for torrent checking. After the release is done of Tribler 7.13 we can re-measure the popularity community. ToDo @xoriole. The new code deployment will hopefully fix issues and improve the general health and efficiency of popularity info for each swarm.

synctext · 2023-09-04T07:57:47Z

The graph below shows the received number of torrents (unique & total), total messages and peers discovered per day by the crawler running Popularity Community in observer mode for 95 days. The crawler is running with an extended discovery booster which leads to discovering more torrents.

Lets focus on the latest 7.13 release and re-measure. Can we find 1 million unique swarm in 50 days? How many long-tail swarms are we finding. Did we achieve the key goal of tracking 100k swarms for the {rough} popularity. If we succeeded we can move on and focus fully on bug fixing, tooling, metadata enrichment, tagging and semantic search.

@egbertbouman Measure for 24 hours the metadata at runtime we get from the network:

unknown hashes and duplicates from PopularityCommunity
- seed/leech of each swarm
- (in later experiment) determine ground truth
unknown hashes and duplicates from FreeForAll in giga-channels (? through inheretance of Gigachannel, https://github.com/Tribler/tribler/blob/main/src/tribler/core/components/metadata_store/remote_query_community/remote_query_community.py)
(???)unknown hashes and duplicates from channels pre-views and sampling ???
Overlap between these sources of metadata?

egbertbouman · 2023-09-19T11:34:22Z

Just did a simple experiment with Tribler running idle for 24h in order to get some idea about which mechanisms in Tribler are primarily responsible for getting torrent information. It turns out that the vast majority of torrents are discovered through the lists of random torrents gossiped within the PopularityCommunity.

When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents). The effect of preview torrent quickly fades as preview torrents are only collected onup the discovery of new channels.

While examining the database after the experiment had completed, it turned out that a little under 80% of all torrent were "free-for-all" torrents, meaning that they did not have a public_key/signature and were put on the network after a user manually requested the torrentinfo.

ichorid · 2023-09-19T11:51:22Z

Also, if you'd look at the top channels, they all were updated 2-3 years ago. That means their creators basically dumped a lot of stuff and then forgot about it, lose interest in it. But their creations continue to dominate the top, which is now a bunch of 🧟 🧟 🧟

"Free-for-all" torrents gossip is the most basic form of collective authoring, and it trumped over the channels in real usage (e.g., search and top torrents) as @egbertbouman demonstrated above. This means that without a collective editing feature Channels are useless and misleading.

Overall, I'd say Channels must be removed (especially because of its clumsy "Channel Torrent" engine), and replaced with some more bottom-up system, e.g. collectively edited tags.

synctext · 2023-10-03T12:21:43Z

😲 😲 😲
650 torrents are collected by Tribler in the first 20 seconds after startup?

When zooming in on the first 10 minutes, we see that a lot of discovered torrents are discovered by the GigaChannelCommunity (preview torrents).

Now we understand a big source of performance loss. Gigachannels is way too agressive in the first seconds and first minute of startup. No rate control or limit of IO or networking (32 torrents/second). Should be smoother and first rely on RemoteSearch results. Tribler will get unresponsive I believe on slow computers with this aggressive content discovery. Great work on the performance analysis! We should do that with all our features.
No fix needed now! Only awareness that we need to monitor our code consistently for performance.

synctext · 2023-10-10T09:49:12Z

Web3Discover: Trustworthy Content Discovery for Web3

@xoriole would be great to write a scientific arxiv paper on this, beyond traditional developer role. Also contributing to performance evaluation and scientific core. See IPFS example reporting, tooling, and general pubsub work plus "GossipSub: Attack-Resilient Message Propagation in the Filecoin and ETH2.0 Networks".

Problem description, design, Graphs, performance analysis, and vital role for search plus relevance ranking.
Timing: finish before Aug 2024 (end contract).

grimadas · 2024-08-30T10:36:39Z

The problem remains relevant to this day. I see two main issues and possible directions for improvement:

Zero-trust architecture is expensive. If I only trust my own node, I need to check all torrents, which is a costly operation.
Solution: Enhance the torrent checker with reputation information. Include the source when you receive information about torrents checked by others. When verifying torrent health, cross-check with other reported sources. If the reported information is significantly inaccurate, downgrade the source's reputation. If the information seems "good enough," slightly increase the source's reputation.
This approach enables us to have torrent information enhanced with network health checks, not just checks from our own node.
Challenge: Torrent health is always dynamic data. We need a reliable estimation to determine what is significantly inaccurate and what is "good enough."
Biased health checks are not optimal. Currently, many peers perform redundant work by repeatedly checking the same popular torrents. Can we improve this process? Perhaps we could integrate information about who has already checked the torrents and how much we trust them.
Simplistic gossip mechanism: Randomly send information about 5 popular, checked torrents. While this could lead to repetition and over-redundancy, we need to demonstrate that this is indeed a problem (using a graph). With baseline metrics, we can identify opportunities for further improvement.

synctext added the type: enhancement label Sep 11, 2018

synctext added this to the V7.2: Credit mining and trading milestone Sep 11, 2018

synctext assigned xoriole Sep 11, 2018

xoriole modified the milestones: V7.3: Gigachannels, V7.4: Python 3 Feb 28, 2019

qstokkink modified the milestones: V7.4: Python 3, V7.4: Libtorrent wrapper refactor Feb 28, 2019

ichorid modified the milestones: V7.4: Python 3, V7.5: core refactoring Aug 27, 2019

ichorid modified the milestones: V7.5: core refactoring, V7.6: Collective authoring Mar 2, 2020

ichorid modified the milestones: V7.6: Faster, Quicker and more Responsive, Next-next release Jun 11, 2020

synctext assigned ichorid and unassigned xoriole Jun 28, 2020

devos50 mentioned this issue Jul 9, 2020

Gossiping torrent popularity to scale to millions of torrents #4256

Closed

synctext mentioned this issue Sep 1, 2020

DHT node spam from exit nodes #3065

Open

ichorid unassigned ichorid and xoriole Feb 17, 2023

drew2a assigned xoriole Feb 20, 2023

synctext mentioned this issue Sep 4, 2023

Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586

Open

drew2a mentioned this issue Sep 21, 2023

The big migration: from the Channels to the Knowledge Graph #7398

Closed

synctext mentioned this issue Apr 11, 2024

phd placeholder: "Decentralized Machine Learning Systems for Information Retrieval" #7290

Open

qstokkink unassigned xoriole Aug 19, 2024

qstokkink removed this from the Backlog milestone Aug 23, 2024

grimadas self-assigned this Aug 30, 2024

grimadas mentioned this issue Oct 3, 2024

Integrate Kalman Filter-based Torrent Health Estimation #8188

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content popularity community: performance evaluation #3868

content popularity community: performance evaluation #3868

synctext commented Sep 11, 2018 •

edited

Loading

xoriole commented Oct 4, 2018 •

edited

Loading

synctext commented Jul 2, 2019

synctext commented Jun 28, 2020 •

edited

Loading

synctext commented Jun 28, 2020 •

edited

Loading

synctext commented Jun 30, 2020

synctext commented Aug 31, 2020

xoriole commented Sep 13, 2020 •

edited

Loading

devos50 commented Sep 14, 2020

xoriole commented Sep 14, 2020

synctext commented Sep 14, 2020 •

edited by xoriole

Loading

ichorid commented Sep 14, 2020

ichorid commented Sep 14, 2020

xoriole commented Sep 13, 2022 •

edited

Loading

synctext commented Sep 30, 2022 •

edited

Loading

synctext commented Dec 2, 2022 •

edited

Loading

xoriole commented Dec 9, 2022 •

edited

Loading

absolutep commented Dec 9, 2022

synctext commented Dec 9, 2022 •

edited

Loading

synctext commented Jan 9, 2023 •

edited

Loading

xoriole commented Jan 19, 2023 •

edited

Loading

drew2a commented Jan 19, 2023

xoriole commented Jan 19, 2023

drew2a commented Jan 27, 2023 •

edited

Loading

drew2a commented Jan 30, 2023 •

edited

Loading

kozlovsky commented Feb 3, 2023

synctext commented Apr 6, 2023 •

edited

Loading

synctext commented Sep 4, 2023 •

edited

Loading

egbertbouman commented Sep 19, 2023

ichorid commented Sep 19, 2023

synctext commented Oct 3, 2023 •

edited

Loading

synctext commented Oct 10, 2023 •

edited

Loading

grimadas commented Aug 30, 2024 •

edited

Loading

content popularity community: performance evaluation #3868

content popularity community: performance evaluation #3868

Comments

synctext commented Sep 11, 2018 • edited Loading

xoriole commented Oct 4, 2018 • edited Loading

synctext commented Jul 2, 2019

synctext commented Jun 28, 2020 • edited Loading

synctext commented Jun 28, 2020 • edited Loading

synctext commented Jun 30, 2020

synctext commented Aug 31, 2020

xoriole commented Sep 13, 2020 • edited Loading

devos50 commented Sep 14, 2020

xoriole commented Sep 14, 2020

synctext commented Sep 14, 2020 • edited by xoriole Loading

ichorid commented Sep 14, 2020

ichorid commented Sep 14, 2020

xoriole commented Sep 13, 2022 • edited Loading

synctext commented Sep 30, 2022 • edited Loading

synctext commented Dec 2, 2022 • edited Loading

xoriole commented Dec 9, 2022 • edited Loading

A. Based on count

B. Normalized in percentages

Observerations

absolutep commented Dec 9, 2022

synctext commented Dec 9, 2022 • edited Loading

synctext commented Jan 9, 2023 • edited Loading

xoriole commented Jan 19, 2023 • edited Loading

drew2a commented Jan 19, 2023

xoriole commented Jan 19, 2023

drew2a commented Jan 27, 2023 • edited Loading

drew2a commented Jan 30, 2023 • edited Loading

kozlovsky commented Feb 3, 2023

synctext commented Apr 6, 2023 • edited Loading

synctext commented Sep 4, 2023 • edited Loading

egbertbouman commented Sep 19, 2023

ichorid commented Sep 19, 2023

synctext commented Oct 3, 2023 • edited Loading

synctext commented Oct 10, 2023 • edited Loading

Web3Discover: Trustworthy Content Discovery for Web3

grimadas commented Aug 30, 2024 • edited Loading

synctext commented Sep 11, 2018 •

edited

Loading

xoriole commented Oct 4, 2018 •

edited

Loading

synctext commented Jun 28, 2020 •

edited

Loading

synctext commented Jun 28, 2020 •

edited

Loading

xoriole commented Sep 13, 2020 •

edited

Loading

synctext commented Sep 14, 2020 •

edited by xoriole

Loading

xoriole commented Sep 13, 2022 •

edited

Loading

synctext commented Sep 30, 2022 •

edited

Loading

synctext commented Dec 2, 2022 •

edited

Loading

xoriole commented Dec 9, 2022 •

edited

Loading

synctext commented Dec 9, 2022 •

edited

Loading

synctext commented Jan 9, 2023 •

edited

Loading

xoriole commented Jan 19, 2023 •

edited

Loading

drew2a commented Jan 27, 2023 •

edited

Loading

drew2a commented Jan 30, 2023 •

edited

Loading

synctext commented Apr 6, 2023 •

edited

Loading

synctext commented Sep 4, 2023 •

edited

Loading

synctext commented Oct 3, 2023 •

edited

Loading

synctext commented Oct 10, 2023 •

edited

Loading

grimadas commented Aug 30, 2024 •

edited

Loading