Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Search engine #8

Open
davidar opened this issue Aug 29, 2015 · 28 comments
Open

Search engine #8

davidar opened this issue Aug 29, 2015 · 28 comments

Comments

@davidar
Copy link
Collaborator

davidar commented Aug 29, 2015

Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:

  • the static way: each node stores their index in a trie-like structure distributed across multiple IPFS objects, in such a way that clients only need to download a small subset of the index for any given query
  • the dynamic way: queries are broadcast to the network, nodes perform the search on their local index, and provide the results back over IPFS
  • the magic way: somehow storing the index in the IPFS DHT?

Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?

@rht
Copy link

rht commented Aug 30, 2015

+1 static indexes (and any objects computed out of the archives) should be static and versioned in merkledag just like the archives themselves.

dependency: file metadata ipfs/ipfs#36

@davidar
Copy link
Collaborator Author

davidar commented Aug 30, 2015

@rht I'm leaning towards the first one too. I added the latter two mainly because I believe YaCy works that way.

The two main problems that (I think) need to be solved for static indexes are: discovering which nodes provide them (ipfs/notes#15), and how to aggregate them all. Index aggregation could either be done a) entirely client-side (easy to implement but inefficient, especially once the network gets large), or b) through cooperation between index nodes (more difficult to implement, but also more efficient for clients).

@jbenet suggested that CRDT could be used to achieve (b), but I'm not familiar enough with it to make any intelligent comments :). I do know that trie merging is commutative (unlike some other search trees or hash tables), so it at least fits within CRDTs assumptions. The main thing to keep in mind is that these indexes are likely to become quite large, so you'd want to coordinate the index merging without nodes having to replicate much of each other's indexes (not sure how this would work within CRDT). One possible way of doing this is to partition prefixes across the nodes (since you only need to look at shared prefixes when performing the merge), but I'm not sure how this would work within IPFS.

@davidar
Copy link
Collaborator Author

davidar commented Sep 1, 2015

Looks like @krl is working on search functionality for starlog: https://github.com/krl/aolog (IRC)

@nbingham1
Copy link

If you go with one of the first two options, you lose flexibility -- mainly because the IPFS executable is determining what the index should look like. If you go with option 3, then you can have any number of indices with different structures storing different things. I don't think the search functionality should be integrated into the IPFS structure itself, but should instead be built on top as a web service. The real question you should be asking is not what an index should look like, but what a database built on top of the IPFS system should look like.

https://github.com/ipfs/ipfs/issues/82

@davidar
Copy link
Collaborator Author

davidar commented Sep 4, 2015

@nbingham1 None of them necessarily need to be integrated into ipfs itself, I'm just assuming nodes have some kind of indexing service running, which could be an entirely separate executable (and you could have multiple different ones running if you wanted). In terms of what a database on top of ipfs would look like, I'm in favor of one big json object transparently distributed over ipfs. You could then easily build radix trees and such on top of that.

@jbenet
Copy link
Contributor

jbenet commented Sep 4, 2015

@nbingham1 @davidar indeed. very much indeed.

@nbingham1
Copy link

Ah, ok, that makes much more sense. So the nodes run the indexing service, and updates the index whenever that node adds a file, but the index is still stored in a database format in the IPFS network. I was misinterpreting the options (Sorry! just found out about IPFS Yesterday!).

@davidar
Copy link
Collaborator Author

davidar commented Sep 4, 2015

@nbingham1 No need to apologise, clarifying questions are good :)

@BrendanBenshoof
Copy link

Merging the Tries via gossip is really viable, essentially each node just shares it's "head node" of the trie they store on ipfs with it's peers. Each peer preforms a recursive merge of the index (which is normally the complexity of the smaller index as you can re-use unaltered entries) and then propagates the new index.

Then we can implement search in js by pointing to that ipfs node's search index by /ipns/local/searchindexroot or somesuch.

Then anybody who wants can implement a search engine on client side with whatever UI or features they want.

This falls into the classic Byzantine Generals problem of people messing with the index. A proof of accuracy based block-chain actually looks like a decent solution to this problem.

@davidar
Copy link
Collaborator Author

davidar commented Sep 13, 2015

@BrendanBenshoof cool, sounds good

Do we actually need a blockchain though? I was thinking all of the (non-malicious) nodes would agree on the merged root node after the gossiping finishes, after which you should just be able to do a simple popular vote?

@BrendanBenshoof
Copy link

We could rotate it into web of trust but straight democracy is highly vulnerable to sybil attacks.

We just wandered into the big open issue in distributed systems, how to deal with consensus with attackers. I'm not a fan of proof of work blockchains as a general solution to this problem. I like the idea of a block-chain here because there is a clear "proof of truthiness" in that nodes can sanity check a "sub-index" against the contents of the file it indexes to ensure it is at least accurate. Once all the added "sub-indexes" have been inspected, you can also confirm the deterministic merging of them into the "primary index" and agree on the new "root" hash.

My only worry is that it is too wasteful in processing power and too complex. It might be more efficient just to deal with crap in the database,

A block would look like:


Address of last block
new source id | index_head
new source id | index_head
...

new_root id

Basic use dissemination would look like:

  • I get a new block:
    • I sanity check the new indexes
    • I merge the indexes into the "master index" and confirm the new head node
  • I propagate the block if it passes muster
  • I listen for newly advertised indexes
    • I sanity check that index
    • I propagate the index
    • If I have more than k indexes, publish a new block

This also has the side effect of aggressively caching newly indexed files by all the search engine participants. It also feeds into my preference that indexing be an active choice by a content producer rather than a spider.

@davidar
Copy link
Collaborator Author

davidar commented Sep 14, 2015

@BrendanBenshoof OK, I think I see what you mean, but distributed systems and associated attacks isn't really my area, and it sounds like you're more familiar with it than I am.

Just to clarify what my thoughts were (but as you say, I haven't thought about possible attacks on this). Suppose you have two nodes, and they swap information in order to agree on the merge of both their indexes. They then both sign this hash, so that everyone else can verify that they're satisfied this merged index contains their own subindexes. You then do a similar thing with merged indexes, etc, so that you end up with a global root signed by everyone with a stake in it.

Thinking about it some more, I agree popular vote wouldn't work in general, I guess what I meant was a vote from nodes that you trust (eg the bootstrap nodes), which you assume aren't malicious, but some of them might make mistakes.

But like I said, I have no idea how well these things will work in practice ☺

@jbenet @whyrusleeping thoughts?

@cryptix
Copy link

cryptix commented Nov 5, 2015

Maybe cothorities are relevant here? Warning: I came across this recently and I still try to make sense of it.

To summarize from the slides: Collective authorities want to make tamper-evident public logs while solving the forking- and freshness problem without falling back to a proof-of-work scheme, risk of temporary forks or a 51% attack vulnerability.

I imagine IPFS could offer this and individual projects can use it to sign their latest versions and indexes?

@rht
Copy link

rht commented Nov 7, 2015

@cryptix I haven't read the paper fully. Regardless, from its scope and aim, it can be used in 2 ways:

  1. Indirectly, only to keep track of whitelisted index providers. Providers are certified in the same way as http servers. One issue is, what if one requests a content from a provider (where it returns a hash) and the search index of the content, but the said provider is not in the whitelist log? Looks like the log has to be updated manually and witnessed by the cosigners. In this case, maybe the content credential (because one has to trust the provider that it provides the correct hash) can be reused for the index.
  2. Directly (provider-independent), where every new index is verified by all cosigners. Provided that there is a cheap way to verify an index without recreating it from scratch. One way is to test index few search terms and check against the full index (much like Rabin fingerprint). However, this is not as complete as pki or validating the proof of a theorem.

Also, since it is not crucial to get timestamp of indexes (of an immutable content) in order, I don't think a full-blown blockchain (be it proof-of-work or not) is necessary. There should be a much faster way than method 2 above.

An alternative to fingerprinting the index is verify the code used to build the index, provided that it can be verified to run in a sandbox with known parameters.

(worth looking: SCP, which uses quorum slice, which enables more organic growth of signing nodes)

@rht
Copy link

rht commented Nov 7, 2015

(ok the fingerprint check is basically @BrendanBenshoof's "proof-of-truthiness", which drawbacks have been mentioned. I favor the weak ordering method described in the CRDT discussion, if there is a protocol for this that has been fleshed out)

@davidar
Copy link
Collaborator Author

davidar commented Nov 7, 2015

I think it would make sense for index providers to sign the aggregated index once they've verified it contains their own index. That way "verification" is simply checking that it's been signed by the provider of the part of the index you're interested in.

@doesntgolf
Copy link

I made a prototype search tool here (also available here if it's unavailable on ipfs). The search index is built from scraping json search results from a public searx instance (which itself aggregates results from major search providers). This means the only things being searched are page title, URL, and a snippet of the page's text. It also means the search index only has about 1100 items. The client-side search is done with lunr.js. (search source, scraper source)

It's nothing impressive but was fun to hack together!

@ghost
Copy link

ghost commented Dec 13, 2015

@doesntgolf Very nice to see progress here ;)

@dokterbob
Copy link

I've just launched an early alpha of a proper search engine, based on elasticsearch. http://ipfs-search.com

@victorb
Copy link
Contributor

victorb commented Nov 8, 2016

@dokterbob thanks for sharing, looks very cool :) Maybe it's something you can submit a PR to add to https://github.com/ipfs/awesome-ipfs ?

@rcurrie
Copy link

rcurrie commented Jan 7, 2017

On a related vertical note, for the GA4GH Cancer Gene Trust, an IPFS based distributed genomic store, I've implemented an elastic search based crawler with web UI showing the network and allowing for searching by gene and clinical features.

@madorian
Copy link

madorian commented Oct 9, 2017

nobody working on a distributed search engine? Something where searching costs money.

@Kubuxu
Copy link
Contributor

Kubuxu commented Oct 9, 2017

@magik6k was experimenting with pre-computed sharded search for Wikipedia dumps. Magik, any input in this?

@magik6k
Copy link

magik6k commented Oct 9, 2017

There is https://github.com/magik6k/distributed-wiki-search. The code there isn't great, though it should give some idea how it may be implemented.

@garyee
Copy link

garyee commented Jan 12, 2018

I think this is a critical point if there shall be an HTML page-web (hyperlinked information space ... the whole of HTML based web pages out there in the HTTP world)
the point where the internet as we have it today really took of and became a thing for the average person and not just for nerds was: tata
When search engines were invented. So without the means to search through the swarm for information by using simple words, IPFS will stay a distributed file system for nerds.
Just remember the effect when you just had discovered IPFS, set it up, ran your daemon, discovered the swarm ... and ... and... nothing.
In order to find some information or File, you had to refer to the old-web or some other kind of index.

So I don't know if there is going to be a web-layer on top of IPFS, this is going to be a tough nut to crack...
I tried to "google scholar"-ing it and found this:
http://nsfcac.rutgers.edu/TASSL/Papers/squid-hpdc-03.pdf ... to give an idea of the complexity.

I must admit that I discovered IPFS fairly recently, but I did not find anywhere that searching is a critical point for the plans to build the web 3.0, which it is...

@RangerMauve
Copy link

Having a decentralized search engine is definitely a difficult challenge, and I think this will be the job of a company rather than the IPFS core team.

Google search already works fine with IPFS through gateways, so centralized search engines should be able to function just fine while a decentralized solution is developed by the community.

@chainify-io
Copy link

As mentioned by dokterbob above, we are currently developing an ipfs search engine that is indexing within IPFS, not searching through a gateway. Actually, our search engine listens to DHT chatter to index content. We just revived the search engine, but it is in an early (let's say) PoC mode. But in genereal, it works. But there's a lot to do.
We think the world needs a (as much as possible) decentralized search engine for IPFS. The major challenge we currently face is: getting people on board to help out.
So, if there is someone who wants to join to build an open, decentralized search engine for the web3.0 -- just raise hands! :-)

@RangerMauve
Copy link

@chainify-io Would you be interested in another collaborator in your efforts? I've been writing down ideas for a distributed search for web pages and it'd be cool to discuss and see if we can work together.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests