Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586

Open
synctext opened this issue Sep 4, 2023 · 40 comments
Open
Assignees

Comments

@synctext
Copy link
Member

synctext commented Sep 4, 2023

ToDo: determine phd focus and scope

Phd Funding project: https://www.tudelft.nl/en/2020/tu-delft/eur33m-research-funding-to-establish-trust-in-the-internet-economy
Duration: 1 Sep 2023 - 1 sep 2027

First weeks: reading and learning. See this looong Tribler reading list of 1999-2023 papers, the "short version". Long version is 236 papers 😄 . Run Tribler from the sources.

Before doing fancy decentralised machine learning, learn-to-rank; first have stability, semantic search, and classical algorithms deployed. Current Dev team focus: #3868

update: Sprint focus? reading more Tribler articles and get this code going again: https://github.com/devos50/decentralized-rules-prototype

Dreams from a young man 👴 From IETF Journal Oct 2012 "Moving Toward a Censorship-free
Internet (page16)
", using phone-to-phone communication as used during Arab Spring uprising.

Wise words on difficulty of Distributed Systems for young engineers/scientists (also discussion on Hacker News)

@pneague
Copy link

pneague commented Oct 3, 2023

I have taken to understanding the work done by Martijn on ticket 42. I read through it and downloaded the code attached.

The last version of the code had a couple of functions not yet implemented so I reverted to the 22-06-2022 version (instead of the last version uploaded on 27-06-2022).

The 22-06-2022 had a few outdated functions and small bugs as well here and there, but since they were small I was able to solve them.

I have downloaded the required dataset and then successfully run the parser and scenario_creating functions implemented by Martijn. After that I ran the experiment itself based on the above-mentioned scenario, resulting in a couple of csv's and graphs.

I understand the general idea of the experiments and how they work, however the code still eludes me since it's not commented to a significant amount.
Here's an example of the graph of an experiment run with Martijn's code so far:
image

@synctext
Copy link
Member Author

synctext commented Oct 3, 2023

Hmmm, very difficult choice.
For publications we should focus on something like Web3AI: deploying decentralised artificial intelligence

@pneague
Copy link

pneague commented Oct 18, 2023

Re-read papers regarding learn-to-rank and learned how to use the IPV8. With it I created an algorithm which simulates a number of nodes and sends messages to one another. From here I worked with Marcel and started implementing a system whereby one node sends a query to the swarm and then receives recommendations of content back from it. The progress is detailed in ticket 7290.
The idea at the moment is that we implement a version of Mixture-of-Experts (https://arxiv.org/pdf/2002.04013.pdf) whereby one node sends the query to other nodes which are nearby and receives recommendations. These are then aggregated to create a shortened and sorted list of recommendations for the querying node.

There are 2 design choices:
We could send the query-doc_inferior-doc_superior around as gossip or we (as we do at the moment) send the updates around every run. We'll look deeper into these ideas.

One issue discovered was regarding the size of the IPV8 network packet which is currently smaller than the entire model serialized with Pytorch, Marcel is currently working on that. We have 720k weights at the moment, and the maximum network packet size for IPV8 is 2.7MB so we have to fit in as many weight updates as possible.

You can see a demonstration of the prototype below:
Alt Text

I'm currently working on how to aggregate the recommendations of the swarm (for example, what happens if the recommendations of each node which received the query are entirely different). My branch on Marcel's repository: https://github.com/mg98/p2p-ol2r/tree/petrus-branch

@synctext
Copy link
Member Author

synctext commented Oct 18, 2023

It's beyond amazing what you acomplished in 6 weeks after starting your phd. 🦄 🦄 🦄
Is the lab now All-In on Distributed AI? 🎲

Can we upgrade to transformers? That is the cardinal question for scientific output. We had Distributed AI in unusable form deployed already in 2012 within our Tribler network. Doing model updates is too complex compared to simple starting with sending training triplets around in a IPv8 community. The key is simplicity, ease of deployment, correctness, and ease of debugging. Nobody has a self-organising live AI with lifelong learning, as you have today in embryonic form. We even removed our deployed clicklog code in 2015 because it was not good enough. Options:

For a Youtube alternative smartphone app we have a single simple network primitive :
Query, content-item-clicked, content-item-NOT-clicked, clicked-item-popularity,signature and in TikTok form without queries and added viewing attention time: content-item-long-attention, long-attention-time, content-item-low-attention, low-attention-time, long-attention-item-popularity,signature. Usable for content discovery, cold starts, content recommendation, and obviously semantic search.

Next sprint goal: get a performance graph!
We need to get a paper on this soon, because the field is moving at lightning speed. So up and running before X-Mas, Tribler test deployment, and usage of NanoGPT in Jan, paper in Feb 🚀

@pneague
Copy link

pneague commented Nov 2, 2023

After looking into what datasets we could use for training a hypothetical model, I found ORCAS which consists of almost 20 million queries and the relevant website link given the query. It is compiled by Microsoft and it represents searches made on Bing in a period of a few months (with a few caveats to preserve privacy, such as showing only queries which have been searched a number of times and not showing a user_ID and stuff like that).

The data seems good, but the fact that we have links instead of titles of documents made it impossible to use the triplet model we have right now (where we need to calculate the 768 dimension embedding of the title of the document: since we don't have a document-title and only a link we cannot do that).

So I was looking for another model architecture to be usable in our predicament and I found Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

I got to implement an intermediary version of the network whereby the same encoder that Marcel used (the allenai/specter language model) encodes a query and the output is the probability for each document individually. The rest of the architecture is left unmodified:
layers = [
('lin1', nn.Linear(768, 256)), # encoded query, 768 dimensions
('relu1', nn.ReLU()),
('lin2', nn.Linear(256, 256)),
('relu2', nn.ReLU()),
('lin3', nn.Linear(256, 256)),
('relu3', nn.ReLU()),
('lin4', nn.Linear(256, number_of_documents)), # output probabilities
]
In my preliminary tests so far, when we have 884 documents (i.e. 884 output neurons) we can perform 50 searches in 4 seconds (so about one search per 0.08 seconds). When we have 1066561 documents, 50 searches get completed in 200 seconds (one search per 4 seconds). Under some circumstances this may be acceptable for Tribler users but people with older computers might experience significant difficulties. I will need to look at ways of reducing the computation time required.

Moving forward, I'm looking to finally implement a good number of peers in a network that send each other the query and answer (from ORCAS) and get the model to train.

@qstokkink
Copy link
Contributor

Cool stuff 👍 Could you tell me more about your performance metrics? I have two questions:

  1. Are these are SIMD results (i.e., one batch of 50 searches take 200 seconds but a batch with 1 search also takes 200 seconds)?
  2. What hardware did you use (e.g., CPU, some crappy laptop GPU, HPC node with 10 Tesla V100's, ..)?

This matters a lot for deployment in Tribler.

@pneague
Copy link

pneague commented Nov 3, 2023

  1. They are not SIMD. One search actually takes 1/50'th of the mentioned time
  2. I used a Mac laptop with M2 Pro Chip

But keep in mind, this is extremely preliminary, I did not implement NanoGPT with this setup so that's bound to increase computing requirements

@synctext
Copy link
Member Author

synctext commented Nov 8, 2023

Paper idea to try out for 2 weeks:

LLM for search related work example on Github called vimGPT:

vimgpt.mov

@pneague
Copy link

pneague commented Nov 22, 2023

I got the T5 LLM to generate the ID's of ORCAS documents.
Current Setup:

  • From entire dataset, I took 100 documents which have around 600 queries associated with them each, yielding around 60k query-document pairs. No query-document pair appears more than once.
  • I split the dataset into train/test with a split factor of 50%
  • Two agents read the same data from the disk, initially the train set
  • They send each other sequentially every row of the data (which at this point looks like [query, doc_id] )
  • They train on the message received but not the one sent (as they both have the same data I'm avoiding training on the same data twice)
  • The model predicts the doc_id given a query
  • After all train_dataset has been iterated through, I count this as an epoch and I iterate through it all over again. I count the number of times the doc_id was guessed by the model and this is how I calculated accuracy
  • After each 'epoch', if accuracy on train set reaches >=90% I saved the model and tokenizer
  • Training took about 12 hours
  • Then I calculate accuracy on the test set using the same method (but without training on the new data)
  • This way, accuracy on the test set was found to be 93%, proving that the model has a high potential to generalize

I was looking for what to do moving forward.

I found a paper survey on the use of LLM's in the context of information retrieval. It was very informational, there's a LOT of research in this area at the moment. Made a list of 23 papers which were referenced there that I'm planning to go through at an accelerated pace. At the moment I'm still wondering what to do next to make the work I've already performed publishable by the conference on the 5'th of Jan.

@synctext
Copy link
Member Author

synctext commented Nov 22, 2023

update
Please try to think a bit already about the next step/article idea for upcoming summer 🌞 🍹 ? Can you think of something where users donate their GPU to Tribler and get a boost in their MeritRank as a reward 🥇 ➕ the Marcel angle of "active learning" by donating perfect metadata. Obviously we need the ClickLog deployment and crawling deployed first.

@pneague
Copy link

pneague commented Dec 12, 2023

In the past weeks I've managed to introduce 10 users who send each other query-doc_id pairs.

The mechanism implemented is the following:

  • a number of 100 documents per available peer is selected from the entire ORCAS dataset from the beginning to act as the actual dataset
  • from this, the new dataset is split into train/test datasets, keeping the ratio of each document in the dataset equal (so if there are 20 queries for a document, 10 will be in the train set and 10 in the test set). I've hardcoded that no documents appear which have only 1 query associated with them, meaning they would have appeared only on the train or test sets. The test set is excluded from training, only the data from the train set is sampled in the training process;
  • from the documents available, each peer samples a random number between 80 and 120 documents that act as the peers own dataset. Peers may sample documents which have already been sampled by somebody else. In total for the experiment with 10 peers, 661 documents were sampled by at least 1 peer out of 1000 (100 docs per peer * 10 peers);
  • each peer initiates its own T5 model (small version) and sets it to train mode;
  • training is now performed in batches of 32. Each peer has a list (corresponding to the batch-data) containing the query and another list containing the doc_id. When the list reaches 32 items, the peer trains its model on the data from those 2 lists and then resets them;
  • every 0.1 seconds, each peer selects a random query-doc_id pair from its own dataset and sends it to another random peer, but does not append to its own current_batch_list. This is done to not agglomerate the training with a peers own data more than the data of the other peers. So each peer appends data (equal to 32 / nbr_of_peers_currently_identified) to its own batch_list when the it is empty. This way we can more or less control that the data fed into the model of each peer is approximately equal probability to come from any peer in the network, including the current peer;
  • I've tried experiments with 2, 10, 32 peers so far. The experiments with 2 and 10 peers have performed well. For the case with 10 peers, training was finished within 6 hours and they all have an accuracy of 99-100% on the train set and 90-91% on the test set (for the 661 sampled documents out of 1000). The experiment with 32 peers ran out of RAM memory (as each peer holds its own model) and started performing erratically, I don't think we can trust those results. I've talked with Sandip and got an account for DAS6 as I don't think we can scale the experiments more without a training server. I'll be working to understand how to use it;

For the future I think trying to use DAS6 to perform a test with 100 peers may be worthwhile to check the integrity of the model and the evolution as the number of peers increases.

@synctext
Copy link
Member Author

synctext commented Dec 12, 2023

AI with access to all human knowledge, art, and entertainment.

AGI could help humanity by developing new drugs, treatments for diseases, and turbocharging the global economy.
Who would own this AGI? Our dream is to contribute to this goal by pioneering a new ownership model for AI and novel model for training. AI should be public and contribute to the common good. More then just open weights, full democratic self-governance. Open problem is how to govern such a project and devise a single roadmap with conflicting expert opinions. Current transformer-based AI has significant knowledge gaps, needs thousands or even millions of people to tune. Needs the Wikipedia paradigm! Gemini example: what is the most popular Youtube video. The state-of-the-art AI fails to understand the concept of media popularity, front-page coverage, and the modern attention economy in general.

  • It all starts with Learn-to-Rank in full decentral setting {current ongoing work}
  • Unlock swarm-based data
  • Continuous learning at next level: eternal learning
  • Get a few thousand people to contribute (e.g. like Linux,Wikipedia,Bittorrent,Bitcoin, etc.)

Related: How is AI impacting science? (Metascience 2023 Conference in Washington, D.C., May 2023.)

@synctext
Copy link
Member Author

synctext commented Jan 29, 2024

Public AI with associative democracy

Who owns AI? Who owns The Internet, Bitcoin, and Bittorrent? We applied public infrastructure principles to AI. We build an AI ecosystem which is owned by both nobody and everybody. The results is a democratically self-governing association for AI.

We pioneered 1) a new ownership model for AI, 2) novel model for training, and 3) competitive access to GPU hardware. AI should be public and contribute to the common good. More then just open weights, we envision full democratic self-governance.
Numerous proposals have been made for making AI safe, democratic, and public. Yet, these proposal are often grounded exclusively in either philosophy or technology. Technological experts from the builders of databases, Operating Systems, and clouds rarely interact with the experts whom deep understand the question 'who has control'? Democracy is still a contested concept after centuries. Self-governance is the topic of active research, both in the world of atoms and the world of bits. Complex collective infrastructure with self-governance is an emerging scientific field. Companies such as OpenAI run on selling their AI dream to ageing companies such as Microsoft. There is great need for market competition and a fine-grained supply chain. Furthermore, lack of fine-grained competition in a supply chain ecosystem is hampering progress. Real world performance results irrefutably show that the model architecture is not really that important, it can be classical transformers, Mamba, SSM, or RWKV. The training set dominates the AI effectiveness equation. Each iteration brings more small improvements to a whole ecosystems, all based on human intelligence. Collective engineering on collective infrastructure is the key building blocks towards creating intelligence superior to the human intellect.

AI improvements are a social process! The process of create long-enduring communities is to slowly grow and evolve them. The first permissionless open source machine learning infrastructure was Internet-deployed in 2012.
However, such self-ruled communities only play a minor role in the AI ecosystem today. The dominating AI architecture is fundamentally unfair. AI is expensive and requires huge investments. An exclusive game for the global tech elite. Elon Musk compared the ongoing AI race to a game of poker, with table stakes of a few billion dollars a year. Such steep training costs and limited access to GPUs causes Big Tech to dominate this field. These hurdles notably affect small firms and research bodies, constraining their progress in the field. Our ecosystem splits the ecosystem by creating isolating competitive markets for GPU renting and training set storage. Our novel training model brings significant synergy, similar to the Linux and Wikipedia efforts. By splitting the architecture and having fine-grained competition between efforts the total system efficiency is significantly boosted. It enables independent evolution of dataset gathering, data storage, GPU rental, and AI models.
Our third pioneering element is the democratic access to GPU hardware. One branch of distributed machine learning studies egalitarian architectures, even a tiny smartphone can be used to contribute to the collective. A billion smartphones, in theory, could significantly outsmart expensive hardware. Wikipedia and Linux have proven that you can't compete with free. We mastered the distributed, permissionless, and egalitarian aspects of AI. The next stage of evolution is to add democratic decision making processes. A team of 60 master students is currently attempting to engineering this world-first innovation collectively.
Another huge evolutionary leap is AI with access to all human knowledge, art, and entertainment. Currently datasets and training hardware are expensive to gather and store. For instance, the open access movement to scientific knowledge has not yet succeeded in creating a single repository. The training of next-generation AI requires completion of this task. All creative commons content (text,audio,video,DNA,robotics,3D) should be scripted in a expanding living dataset, similar to SuperGLUE set-of-datasets. Cardinal problem is building trust in the data, accurancy, and legal status. We pioneered in prior work a collective data vault based on passport-grade digital identity.

@pneague
Copy link

pneague commented Jan 30, 2024

In the last few weeks I had run experiments with ensembles of peers. The experiments with more than 10 peers makes the laptop run out of RAM memory and starts acting weirdly so I had to change the direction my work.
The current idea is that T5 small is not able to fit inside its weights that many doc_id's (because it is so small). But we need it to be small for it to run on Tribler peer computers.

So in order to increase the number of retrievable documents I thought of sharding the datasets, with each shard having its own peers. In the experiments performed, each shard consists of 10 peers.

  • Each of the experiments was successfully run, with each peer achieving good results on the shard's test set (as described in a previous entry here).
  • Each shard was trained in an independent run so the laptop I'm using wouldn't run out of RAM memory.
  • Each shard had different doc-id's from the other shards.
  • I've used 5000 documents per shard and let each peer catch a random number of documents between 200 and 300 (as in the previous entry).
  • Documents not chosen by any peer were discarded.
  • After successfully training all models on their respective shards, I experimented with using ensembles to aggregate the results of multiple shards. Initially, the idea was that the system would pick a random number of models, belonging to all shards, and each picked model would vote on a document_id, given a query. But this relied on chance picking the models belonging to the right shard for each tested query. Marcel came up with the idea that we could in principle gossip the shard-number of each peer and then we would know to ask models from each shard given a query.
  • The idea was that models trained on the right data would pick the correct document (as each had a top1 accuracy of 90%), while models not trained on the right data would output either random documents, different from one model to the other, or hallucinate doc_id's, different from one model to the other. So when we see two models voting for the same doc_id, we know that they were trained on data matching the query in question.
  • Another ensemble idea was to get the top5 results for a query with beam-search, and get their model-scores for those 5 beams. After that we could take softmax of the 5 results so we know the confidence that the model has on each of them. Then, instead of summing the number of times a result was suggested by a model, we would sum the confidences of each model for each result.
  • At the moment I'm still running some experiments but here are the accuracy results for each shard:
    Accs_by_shard_and_beam
    The image above depicts the accuracy on the test set of each shard of each peer belonging to that shard. Blue is top1 accuracy, and red is top5 accuracy (obtained with beam-search).

Model Ensemble from different shards drawio
This diagram shows how a 2-shard ensemble would work in the voting and confidence mechanism (in the previous iteration where the models were chosen randomly, without caring how many models we get from each shard)

@synctext
Copy link
Member Author

synctext commented Feb 29, 2024

Solid progress! Operational decentralised machine learning 🚀 🚀 🚀 De-DSI for the win.

Possible next step is enabling unbounded scalability and on-device LLM. See Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis or the knowledge graph direction. We might want to schedule both! New hardware will come for the on-device 1-bit LLM era

update: Nature paper 😲 Uses LLM for parsing of 1200 sentences and 1100 abstracts of scientific papers. Avoids the hard work of PDF knowledge extraction. Structured information extraction from scientific text with large language models this work outputs entities and their relationships as JSON documents or other hierarchical structures

@pneague
Copy link

pneague commented Mar 26, 2024

Fresh results from DAS6 for magnet link prediction:
1000 docs - 90.5%
5000 docs - 77%
10000 docs - 65%

See comparison between predicting docids vs magnet links:
image

When the dataset is relatively small, the accuracies are the same for both top-1 and top-5. As more data appears in the dataset, we can see a divergence in the accuracies posted in both metrics. We hypothesize that the limited number of weights in our model efficiently captures URL patterns in scenarios with sparse data. However, as the data complexity increases, this constraint appears to hinder the model’s ability to accurately recall the exact sequence of tokens in
each URLs. This is merely a guess, and we intend to investigate this further in future work. However, the observed
discrepancy in accuracy levels remains marginal, amounting to merely a few percentage points across a corpus of 10 K
documents.

@pneague
Copy link

pneague commented Apr 10, 2024

Poster for the De-DSI paper:
De-DSI Poster.pdf

@pneague
Copy link

pneague commented May 21, 2024

One of the ideas to further develop the De-DSI paper was to perform the sharding division of documents in a semantically meaningful way. This is what I've done in the past couple of weeks.

So the problem was that if you shard documents randomly, you have 2 similar documents in different shards and when querying all shards for one of the 2 documents, you'd get high confidence on both shards. This leads to a 50/50 chance that the correct shard will have a higher confidence than the incorrect one.

The idea was to perform semantic sharding such that all documents of a type would be in one shard. This would resolve the confusions between shards as each one would know which document needs to be retrieved and the others will have low confidence in their result this way.

So I:

  • Trained 10 T5 models on 1k docs each, with docs having at least 10 queries each and got the ensemble accuracy
  • Got the embeddings of the T5-small of all queries for each document and averaged the query-embeddings to obtain the embedding of the document;
  • Afterwards I used K-means to get the splitting done. K in this case is the number of shards = 10;
  • Trained 10 T5 models on the documents in each cluster calculated with K-means, and got the ensemble results of that;
  • Plotted the accuracy distribution in boxplots for top1-top5. Each boxplot represents the accuracy on the dataset of 10 shards (aggregated) by either the individual shard or the ensemble for both the random-sharding and semantic-sharding setting.

I compared the results and it turns out it doesn't work as hoped
Screenshot 2024-05-21 at 14 22 37

I believe the issue is that if shards have semantically meaningful documents, it is harder to distinguish between them and so the confidence of the correct shard is lower than before. This means that more-different-but-still-slightly-similar documents which are in other shards have a higher chance than before to beat the confidence of the correct document in the correct shard.

I thought about what exactly I could do about this but I haven't come up with anything yet.
Jeremie recommended I look into fully decentralized ML training which is resistant to some kind of attacks. I have an idea on how it may be done but I need to read more on it first as its a new topic to me.

@pneague
Copy link

pneague commented May 27, 2024

In the last few days I've read papers on

  • federated learning and how gradient passing is anonymized;
  • personalized federated learning;

I also thought about how a mixture-of-experts with multi-layered semantic sharding would work. At the moment something that I could try would be:

  • Take 10.000 documents and get the average of the queries of each as the representation of the document as before
  • Use K-means with K = 5 to split the documents into 5 shards, assign to each a number 1-5
  • Then, for each of the 5 K's, consider only the documents belonging to the shard and perform another K-means with K = 5, assign to each a number 1-5
  • Then, for each of the last 5 K's, perform another K-means with K = 4, assign to each a number 1-4
  • Thus ending up with a nested sharding method for 554 = 100 shards
  • Use a master DSI model in a mixture-of-experts method to predict the ID of the shard, e.g. 2-1-4 would represent the shard belonging to cluster 2-1-4;
  • Having the shard I can ask it specifically to give me the prediction, or I could take the ensemble;
  • If using an ensemble, this would still have the issue that slightly less relevant documents present in another shard would outcompete in confidence the correct document present in a shard which has many relevant documents. Thus, I would predict this not to work super well in ensembles, but I am not 100% sure of it

I also haven't found any paper on personalized models in decentralized federated learning, so it would be a gap which is unexplored and thus maybe easy to publish about.

@synctext
Copy link
Member Author

synctext commented May 27, 2024

Focus on finding a phd problem to solve. Avoid "Technology push" that makes much science useless. We need GPU's for training. We need a dataset. We need publishable problem.

Perhaps it is time to dive for 3 weeks into a production system? Some ideas and links

Hipster publishable idea: secure information dissemination for decentralised AI (e.g. MeritRank, clicklog, long-lived ID, sharing data, not unverifiable vector of gradient decent)

@pneague
Copy link

pneague commented Jun 10, 2024

In the last few weeks I looked into methods of estimating reputation and sybil-defense in a graph network by using ML models. There are quite a few methods for doing stuff like this in all types of areas, for example in edge-computing devices, social networks etc.

After talking with Bulat, he suggested we could try to use Meritrank and some kind of model to limit the amount of resources that a sybil attack could sap from the network. The idea is still in the incipient phase and it's not clear to me if it works. Bulat suggested that instead of doing what the other papers have done (for example the papers doing reputation estimation with social networks were using social network information to find sybils), we could try to do this solely by using the graph data. I'm not sure if this is possible but I think it's in the realm of possibility.

Additionally, we would not use a supervised-learning method where we have the sybils clearly mapped, but get a dataset where we assume all members of the graph to be honest, and then perform all types of sybil attacks possible on the network and see if we can limit how much attackers gain from this somehow. We could also implement methods of previous papers and compare our results to theirs in a situation where all types of sybil attack is simulated. Bulat mentioned he doesn't know of a paper taking this approach so far.

I have also talked with Quinten about the dataset his code is collecting. It's interesting but not very rich, even if we may have lots of data. You can see a very small sample meant as an example below: image.

Basically we have query, infohash, score, parent_query_forwarding_pk. The score is calculated as thus:
If you search for a query and you click a link and you don't search for the same query again, you're assumed to be satisfied with the link, so the score = 1.0
If you search for a query and you click a link and you are not satisfied, you search for the query again. If you click the second link and you are satisfied you stop there. Thus, the first link clicked has a score of 0.2 and the second link clicked has a score of 0.8.

This is interesting, and may provide a way to get reputation (for the person who's seeding the content in the first link and for the person who's gossipping the queries). But I am not sure we can do it well if we don't have that many users vs number_of_links_available. We'll have to see how much data we end up with in a few months.

@synctext
Copy link
Member Author

synctext commented Jun 10, 2024

btw about teaching...prepare for helping out with msc students more + master course of Blockchain Engineering.

update : machine learning for 1) personalisation 2) de-DSI content discovery 3) decentralised seeder content discovery {DHT becomes 👉 IPv4 generative AI} 4) sybil protection 5) spam protection 6) learn-to-rank

@pneague
Copy link

pneague commented Jul 2, 2024

In the last few weeks I was in vacation. After that I got a recommendation engine working based on collaborative filtering of the movielens dataset. Nothing too fancy, just an SVD algorithm applied on the movielens-1m data. I've also read a few papers, including a literature review on foundation models in recommendation algorithms.

I got two preliminary ideas for future research that I haven't seen yet implemented:

  • Recommendation engine based on local agent suggestions: If in the near future we will all have LLM agents with whom we'll be interacting to get work done, these LLM's will be able to know quite a lot of information about us. This may allow the agent to suggest items (or even search terms) on a video-providing website. Say I interact a lot with my agent to learn about ancient history, if it would be possible to ask the agent what I am interested in viewing it may say 'ancient history documentaries', thus enhancing the experience of websites like Youtube/Netflix or Tribler

  • AI algorithm to detect spam in p2p network: using LLM's to determine if files are what the title implies that they are

The two ideas could be used together as well I imagine.
I'll think some more on how I could get one or both of these ideas conceivably done if we deem them interesting

@synctext
Copy link
Member Author

synctext commented Jul 2, 2024

Still a few months left to find a great paper idea 🕙
Edge AI and recommenders is hot. rough paper idea: 1) classic collaborative filtering, 2) LLM-based CF, 3) LLM-based CF plus Differentiable Search Index (DSI). Compare performance cost of a "local-first search engine".

"As simple as possible" architecture: 3 items send; 3 recommended items received.

Paper idea: aim to have a recommender without clicklog leakage. No text queries. Peers do not explicitly exchange profiles. Spread real clicklog snippets, from an unknown peer. Focus on unlinkability. They replay old recommendation requests to hide their own request. Use this as a naive approach, with known spam vulnerability.

goal for 19 Aug 2024: Above architecture. 100 IPv8 peers listening, send 3 items to random peer, you get 3 recommended items back. Movielens. Outcome format: single amazing .GIF .... 🎉

update: share the embedding with another user. This could somehow be used to train a model. on-device model. 1 protocol query/response for both real-time search/recommendation and online continual learning in background. Build upon our strength: permisionless gen-AI with full scalability.

Possible goal:

  • 15 Sep 3 readable drafts paper ideas
  • Only part-time experimental work!
  • Each paper has only 3 sections: intro+related_work+Problem_Description + title

@pneague
Copy link

pneague commented Aug 26, 2024

In the last 2 months I went with Marcel to the Oxford NLP summer school, took vacation back home and worked on an idea I had recently. I refreshed my understanding of the topic, having in the last few years not touched the topic professionally.

The professor was from King Abdullah university in Saudi Arabia, his name is Naeemullah Khan.

While there I thought more deeply about an idea I came up with previously, and pitched it to Prof. Khan and another postdoc from a lab at Oxford. The postdoc is Dr. Naman Goel. The idea is to use the Microsoft Recall upcoming feature (which takes screenshots of the activity on the PC every few minutes) in order to get an idea about the preference of the user. This preference can be used to generate query-recommendations for web services, including Tribler.
There is an understandable reluctance by the internet community to use this kind of feature because of privacy concerns, but I think that as AI becomes stronger and chips become cheaper, this kind of AI which looks at everything you do and then helps you in different ways could be come essential for daily life. I think the problems with privacy will be addressed eventually in some way and so I think that starting to work on this topic now positions us well for the future.

Both Prof. Khan and Dr. Goel gave their approval and Dr. Goel even said he's willing to contribute with weekly calls and analysis of results (the code would be my task).

@synctext
Copy link
Member Author

synctext commented Aug 30, 2024

Venue: LCN or the collective intelligence Journal: https://journals.sagepub.com/editorial-board/COL

@pneague
Copy link

pneague commented Sep 9, 2024

Research plan.pdf

@qstokkink
Copy link
Contributor

A potentially interesting topic for your PhD is to check out self-evolving distributed ontologies based on Tries and - at least in this text-based proof of concept - based on Gemini (but other models like ChatGPT should also work). Of course, communicating using human language with the Gemini model is (probably) not a good way forward and this would need some more sophisticated hooking into the underlying model (i.e., Gemini here).

My txt-based intuition is here: learningtrees.txt

@synctext
Copy link
Member Author

synctext commented Sep 9, 2024

Great progress! For next sprint

  • follow a structured process for your 2nd thesis article. Maximise yield of 12 months of your time.
  • 2 ideas into first-draft stage. 1-2 page writeup each.
  • Show them before next meeting to fellow phd student (not publishable,workshop, conference level)

update
Reading list (lots of reading, focus on single thing at end of Oct??)

update2
Found the original paper doing "movie mining" is from 2015 and have over 3000 citations now. See the MIT and Toronto arXiv paper "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books image The first challenge we need to address, and the focus of this paper, is to align books with their movie releases in order to obtain rich descriptions for the visual content. We aim to align the two sources with two types of information: visual, where the goal is to link a movie shot to a book paragraph, and dialog, where we want to find correspondences between sentences in the movie’s subtitle and sentences in the book. and
To evaluate the book-movie alignment model we collected a dataset with 11 movie/book pairs annotated with 2,070 shot-to-sentence correspondences.

@mg98
Copy link
Contributor

mg98 commented Sep 21, 2024

Since you're doing automatic content analysis for decentralized search, here are some papers for related work:

Source What they do
ipfs-search.com centralized metadata extraction using Apache Tika (e.g. audio length, artist, title, in case of a MP3)
Khudhur et al. 2019. Siva-the ipfs search engine metadata extraction using Apache Tika
Zhu et al. 2020. Keyword search in decentralized storage systems metadata extraction using Apache Tika
Wang et al. 2020. Keyword search technology in content addressable storage system they mention something but don't describe it furhter ("Extract file attribute information as keywords. This step is important for non-text files because we cannot extract keywords from the contents of non-text files.")
Dix et al. 2006. Position Paper: Concept Classification for Decentralised Search they also mention metadata extraction
Keizer at al. 2023. Ditto: Towards decentralised similarity search for Web3 services automatic keyword extraction (on text though) using YAKE, this is a bit more semantic/ML flavored

@pneague
Copy link

pneague commented Oct 7, 2024

For the past 2 weeks I was reading papers and trying to understand the cutting edge in distributed training.

In particular I focused on a recent preprint paper
The idea there is to combat data inference attacks in distributed training (which leverages model gradients received from peers to infer the data that the peers have) by splitting the model into multiple parts and sending peers only one part. This way, every peer ends up with a complete model composed of parts from multiple peers. This makes data inference impossible.

I spent time understanding the mathematics of the issue (convergence and privacy guarantees) and made good progress. I realised that in order to be able to perform this kind of work I would need to go through the references to understand the theorems that are used in this field. This would take a while. Remains to be decided whether it's a good use of my time.

Additionally, I ran their algorithm, posted here.
image

@synctext
Copy link
Member Author

synctext commented Oct 7, 2024

  • Road to get 3 written down ideas...
  1. inference and mixnet architecture
  1. Microsoft recall; local search agent idea. Purpose: personal search, recall your history, personal media consumption. Do this prevacy-preserving, fully decentralised!
  • selling a system is a bad ML idea. Algorithm 1 novelty 👍
  • decouple item modeling from user modeling, as advised by Bytedance HLLM paper
  1. Search scientific literature. embedding of PDF files, link to Global Brain idea.
  • Science: scalable models of intelligence
  • De-DSI reusage? Build upon existing code?
  • Re-use this TPI-LLM code? Our TPI-LLM system addresses the privacy issue by enabling LLM inference on edge devices with limited resources. The system leverages multiple edge devices to perform inference through tensor parallelism, combined with a sliding window memory scheduler to minimize memory usage. Currently, TPI-LLM can run Yi-34B in full precision on 4 laptops with 5GB of memory on each laptop, and run Llama 2-70B on 8 devices with 3GB of memory on each device.
  1. reference baseline for decentralised learning. Lab-idea for longer time. Egbert, Quinten, Bulat time investment.

Systems or networking storyline for publication IEEE LCN or PETS or Middleware. Future ambition is NeurIPS or ICML

For next meeting in 2 weeks: attack ideas, IPv8 porting effort, get a experiment graph out of Shatter

@pneague
Copy link

pneague commented Oct 21, 2024

I have further looked into the code from SHATTER, data inference/reconstruction attack methods, and (as per Jeremie's recommendation) into MixNN which does similar work, though more basic.

I have presented the attack idea on models which mix their parameters and send them to different people to Dr. Naman Goel from Oxford lab and he suggested that since the method is not widely accepted, it may be an attack on an architecture which not many people use, thus being not very interesting.

I thought of looking into byzantine attacks in decentralized networks, then saw that a normal gradient similarity method has been published already in June this year, so I'd have to see if I can come up with something new. I found a literature review on the topic which I believe would be useful to read.

Idea: User has consumed some content, each with a semantic coordinate (calculated with an LLM for example). Then, we calculate the semantic coordinates of the user as the average of the coordinates of the content they have consumed. If I search with a query, I get the coordinates of the query, and then check around me for people who's semantic coordinate is closest to the query, then I ask them, as they are the most likely users to have content in which I'm interested.
Not sure if idea is feasible but it is plausible

@synctext
Copy link
Member Author

synctext commented Oct 21, 2024

Document needed for phd progress meeting. Mixture of Experts scaling is a great opportunity for decentralisation we talked about already in 18 Oct 2023. Idea outline:

update much related work exists on 6G federated learning. Yet highly theoretical, impractical, and immature. Great stuff to help realise for real 😃 IEEE/ACM Transactions on Networking cfp

AI on Networks
● Decentralized learning, distributed training and inference, federated learning over
device-edge-cloud continuum
● Trustworthy and privacy-preserving AI over a wide spectrum of networks
● Large foundation model pre-training, fine-tuning and inference over large-scale networks
● Robust adversarial machine learning over wired and wireless networks
● Resource-constrained and edge deployable AI solutions, experiments and testbeds for
ML-driven wireless systems.

15 Jan 2025 deadline, super rush! 🤔

@pneague
Copy link

pneague commented Oct 29, 2024

Idea 1: Decentralized file-search based on taste embeddings:

Description: When searching for a file in a decentralized network, instead of flooding the network with the query, the system finds people who have similar items to my query and only query them.
Methodology:

  • Use the ORCAS dataset (which contains query – doc_id pairs), and assign 10 nodes (representing users) a number of documents with their associated queries;

  • Use an LLM to get the embeddings of all queries;

  • Calculate embeddings of a document as the average embeddings of its queries;

  • Calculate embeddings of a node (representing his/her taste) as the average embedding of their documents;

  • For a new query, calculate cosine similarity between query and all nodes we’re aware of, and send the query to the most similar nodes. Average DSI retrieved doc-ids. This way it works like a mixture of experts;

  • Open Question: What is the state of the art in decentralized search

Idea 2: Decentralized learning with model-parallelism:

Description: Investigate different aspects of model training in decentralized networks when single nodes can hold only a section of the model.

Methodology:

  • Look into privacy aspects of the matter, such as how the node holding the first section of the model perturbs the updates sent to the others in order to safeguard its data;

  • Look into how fault tolerance strategies, including how nodes can reorganize themselves to address issues as they arise;

  • Consider additional aspects such as communication efficiency and load balancing in a fully decentralized environment.

Brief update after discussion with Naman:
Idea 1 is decent, low risk, low reward. It's publishable but doesn't have many avenues for future research; Not great conference potential but publishable
Idea 2 is ambitious, high rish, high reward. It's probably a direction with a lot of competition but has potential for future research and is more publishable to better conferences

Both ideas should be pursued at the same time. If second fails to deliver because the field is too crowded, at least I have the first one.

So the general plan with De-DSI:

  • Write the Go/Nogo doc
  • Find aprropiate baseline for comparison
  • Start implementation

And the general plan with the decentralized model-parallel training:

  • Get acquainted with the literature and write it down here
  • Once I understand the current state of the art in the topic, I'll see if I can find anything interesting to do

@qstokkink
Copy link
Contributor

Some reading pointers:

Semantic Overlay Networks. Arturo Crespo and Hector Garcia-Molina
https://resources.mpi-inf.mpg.de/d5/teaching/ws03_04/p2p-data/12-09-writeup2.pdf

Kademlia: A Peer-to-Peer Information System Based on the XOR Metric
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=eb51cb223fb17995085af86ac70f765077720504

Epidemic Broadcast Trees
https://www.dpss.inesc-id.pt/~ler/reports/srds07.pdf

@synctext
Copy link
Member Author

@pneague
Copy link

pneague commented Nov 25, 2024

Efforts after a week of trying to reproduce the baseline paper (not great looking yet).
It represents an aggregation of the last 4 graphs in the baseline paper.
X axis: Number of docs
Y axis: Accuracy at TTL = 50

Here's a sample of golden documents and their nearest query:
'impressions': 'impression',
'wwe': 'wcw',
'burgettstown': 'clarion-limestone',
'kariya': 'rucchin',
'fey': 'tina',
'damiano': 'cunego',
'mlh1': 'pms2',
'second-team': 'third-team',
'dooling': 'keyon',
'f-84f': 'thunderstreaks',
'harmonization': 'harmonisation'

The different lines represent combinations of alpha (teleportation factor) and distance from the golden-document in terms smallest of number of nodes that the query would need to pass to get to the golden document.

The implementation details on the baseline paper are fuzzy, so the results reflect me trying to infer what they did here and there (for example with regards to the 'uniform' distribution of documents to nodes: for 10 documents I have distributed 10 documents across the 4k nodes, so most nodes had no document in that experiment).

image

@synctext
Copy link
Member Author

synctext commented Nov 25, 2024

  • wow, this was again a fast one!
  • Just like your 1-week of magic productivity with your first paper
  • These results do not make any sense to me
    • the input data is not human readable (e.g. red red wine versus blue beer)
    • no pattern in the alpha and distance parameters
    • Please make debug plots and trivial correct experiments!
      • dataset characteristics
      • diffusion process (Each documents number versus How many of the 4k nodes know about this;sorted)
      • how many messages, iterations, matrix thingies, how many hops with certain alpha
    • Orcas seems superior versus this GLOVE (Query,doc) artificial construct
  • Sprint for coming 2 weeks: semantic overlay using ORCAS
    • embedding of the queries
    • We need embedding for documents, but dataset includes only URLs.
      • Take top-1000 documents from ORCAS
      • this ensures we have lot of queries
      • take average embedding of the queries
    • 4k network simulation on laptop? Take budget of 1000 messages for spreading the clicklog and URL
      • 25% network coverage with random spreading
      • Build a semantic overlay
        • we do not have user profiles in ORCAS 🤔
        • How to create accurate user-profiles?
  • Fri 13th Dec: have a trivial and correct overlay experiment operational (3 datasets? Glove, Orcas,AOL4PS)

@pneague
Copy link

pneague commented Dec 3, 2024

Had a discussion with Quinten and Naman and we came up with the following experiments:

  1. Construction speed of tree Vs Diffusion convergence speed. How fast can we construct the structure necessary for the execution of the queries. We can have a measure of convergence for the baseline (the details of which they don't specify), and we can have a measure of time for calculating the k-means in all leaf nodes for the same number of nodes.
  2. Execution speed of the queries of our method vs baseline method. Since the baseline requires a graph of connections, we can assume different types of graphs (for example a star graph) and compare the execution speed or success_chance_within_50_hops for our method vs their method. Our method would not use the graph structure in any way. Thus, we can examine in which types of graphs their method works better, in which it works worse.
    We can also vary the size of the graphs. The baseline fails for more than 3 hops almost completely so I would expect better results for our method when the graph is large.
  3. Our/baseline method with GLOVE Vs LLM embeddings. This would be useful because using an LLM takes a lot longer to get embeddings than the GLOVE model, and also has more dimensions, so calculating nearest neighbors takes slightly longer. We may want to have a graph to check how much better using an LLM is in terms of accuracy and compare the plus in accuracy with the minus in terms of tree-construction / diffusion-convergence speed.
  4. Experiment with many documents per person. When a node holds many documents, the average will not be very representative of the interests of the user. We may show how our method breaks when we consider 100 documents per person somehow. We can implement a solution to this by having a new parameter which measures how divergent a user's interests are. When the divergence factor is larger than a specified threshold, we can cluster our interests and we can duplicate ourselves for each cluster, and thus place ourselves in multiple places in the graph according to the number of clusters we have. The details of this are to be figured out.

Additional updates:

  • I think I have finished implementing the baseline. I get decent results in similar situations as in their paper. I have made some inferences as they didn't detail every part of their experiments. Nevertheless, I will get a graph and ask them for their code again, all of them. Maybe this time I get an answer.
  • I thought of future experiments (as they are probably out of scope for this paper):
    * Query Reformulation: Use LLM's to reformulate a query into multiple similar queries and follow each of them through the tree. Once you get a number of results from all of them, aggregate responses.
    * Cluster Labelling: Use LLM's to label a leaf nodes so the regions are more interpretable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants