Search engine #8

davidar · 2015-08-29T07:18:06Z

Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:

the static way: each node stores their index in a trie-like structure distributed across multiple IPFS objects, in such a way that clients only need to download a small subset of the index for any given query
the dynamic way: queries are broadcast to the network, nodes perform the search on their local index, and provide the results back over IPFS
the magic way: somehow storing the index in the IPFS DHT?

Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?

rht · 2015-08-30T02:40:35Z

+1 static indexes (and any objects computed out of the archives) should be static and versioned in merkledag just like the archives themselves.

dependency: file metadata ipfs/ipfs#36

davidar · 2015-08-30T07:04:09Z

@rht I'm leaning towards the first one too. I added the latter two mainly because I believe YaCy works that way.

The two main problems that (I think) need to be solved for static indexes are: discovering which nodes provide them (ipfs/notes#15), and how to aggregate them all. Index aggregation could either be done a) entirely client-side (easy to implement but inefficient, especially once the network gets large), or b) through cooperation between index nodes (more difficult to implement, but also more efficient for clients).

@jbenet suggested that CRDT could be used to achieve (b), but I'm not familiar enough with it to make any intelligent comments :). I do know that trie merging is commutative (unlike some other search trees or hash tables), so it at least fits within CRDTs assumptions. The main thing to keep in mind is that these indexes are likely to become quite large, so you'd want to coordinate the index merging without nodes having to replicate much of each other's indexes (not sure how this would work within CRDT). One possible way of doing this is to partition prefixes across the nodes (since you only need to look at shared prefixes when performing the merge), but I'm not sure how this would work within IPFS.

davidar · 2015-09-01T10:00:25Z

Looks like @krl is working on search functionality for starlog: https://github.com/krl/aolog (IRC)

nbingham1 · 2015-09-03T21:13:29Z

If you go with one of the first two options, you lose flexibility -- mainly because the IPFS executable is determining what the index should look like. If you go with option 3, then you can have any number of indices with different structures storing different things. I don't think the search functionality should be integrated into the IPFS structure itself, but should instead be built on top as a web service. The real question you should be asking is not what an index should look like, but what a database built on top of the IPFS system should look like.

https://github.com/ipfs/ipfs/issues/82

davidar · 2015-09-04T01:00:18Z

@nbingham1 None of them necessarily need to be integrated into ipfs itself, I'm just assuming nodes have some kind of indexing service running, which could be an entirely separate executable (and you could have multiple different ones running if you wanted). In terms of what a database on top of ipfs would look like, I'm in favor of one big json object transparently distributed over ipfs. You could then easily build radix trees and such on top of that.

jbenet · 2015-09-04T01:10:38Z

@nbingham1 @davidar indeed. very much indeed.

nbingham1 · 2015-09-04T02:10:52Z

Ah, ok, that makes much more sense. So the nodes run the indexing service, and updates the index whenever that node adds a file, but the index is still stored in a database format in the IPFS network. I was misinterpreting the options (Sorry! just found out about IPFS Yesterday!).

davidar · 2015-09-04T07:03:58Z

@nbingham1 No need to apologise, clarifying questions are good :)

BrendanBenshoof · 2015-09-13T13:38:42Z

Merging the Tries via gossip is really viable, essentially each node just shares it's "head node" of the trie they store on ipfs with it's peers. Each peer preforms a recursive merge of the index (which is normally the complexity of the smaller index as you can re-use unaltered entries) and then propagates the new index.

Then we can implement search in js by pointing to that ipfs node's search index by /ipns/local/searchindexroot or somesuch.

Then anybody who wants can implement a search engine on client side with whatever UI or features they want.

This falls into the classic Byzantine Generals problem of people messing with the index. A proof of accuracy based block-chain actually looks like a decent solution to this problem.

davidar · 2015-09-13T14:13:23Z

@BrendanBenshoof cool, sounds good

Do we actually need a blockchain though? I was thinking all of the (non-malicious) nodes would agree on the merged root node after the gossiping finishes, after which you should just be able to do a simple popular vote?

BrendanBenshoof · 2015-09-13T15:16:08Z

We could rotate it into web of trust but straight democracy is highly vulnerable to sybil attacks.

We just wandered into the big open issue in distributed systems, how to deal with consensus with attackers. I'm not a fan of proof of work blockchains as a general solution to this problem. I like the idea of a block-chain here because there is a clear "proof of truthiness" in that nodes can sanity check a "sub-index" against the contents of the file it indexes to ensure it is at least accurate. Once all the added "sub-indexes" have been inspected, you can also confirm the deterministic merging of them into the "primary index" and agree on the new "root" hash.

My only worry is that it is too wasteful in processing power and too complex. It might be more efficient just to deal with crap in the database,

A block would look like:


Address of last block
new source id | index_head
new source id | index_head
...

new_root id

Basic use dissemination would look like:

I get a new block:
- I sanity check the new indexes
- I merge the indexes into the "master index" and confirm the new head node
I propagate the block if it passes muster
I listen for newly advertised indexes
- I sanity check that index
- I propagate the index
- If I have more than k indexes, publish a new block

This also has the side effect of aggressively caching newly indexed files by all the search engine participants. It also feeds into my preference that indexing be an active choice by a content producer rather than a spider.

davidar · 2015-09-14T03:57:19Z

@BrendanBenshoof OK, I think I see what you mean, but distributed systems and associated attacks isn't really my area, and it sounds like you're more familiar with it than I am.

Just to clarify what my thoughts were (but as you say, I haven't thought about possible attacks on this). Suppose you have two nodes, and they swap information in order to agree on the merge of both their indexes. They then both sign this hash, so that everyone else can verify that they're satisfied this merged index contains their own subindexes. You then do a similar thing with merged indexes, etc, so that you end up with a global root signed by everyone with a stake in it.

Thinking about it some more, I agree popular vote wouldn't work in general, I guess what I meant was a vote from nodes that you trust (eg the bootstrap nodes), which you assume aren't malicious, but some of them might make mistakes.

But like I said, I have no idea how well these things will work in practice ☺

@jbenet @whyrusleeping thoughts?

cryptix · 2015-11-05T09:27:42Z

Maybe cothorities are relevant here? Warning: I came across this recently and I still try to make sense of it.

To summarize from the slides: Collective authorities want to make tamper-evident public logs while solving the forking- and freshness problem without falling back to a proof-of-work scheme, risk of temporary forks or a 51% attack vulnerability.

I imagine IPFS could offer this and individual projects can use it to sign their latest versions and indexes?

rht · 2015-11-07T07:31:03Z

@cryptix I haven't read the paper fully. Regardless, from its scope and aim, it can be used in 2 ways:

Indirectly, only to keep track of whitelisted index providers. Providers are certified in the same way as http servers. One issue is, what if one requests a content from a provider (where it returns a hash) and the search index of the content, but the said provider is not in the whitelist log? Looks like the log has to be updated manually and witnessed by the cosigners. In this case, maybe the content credential (because one has to trust the provider that it provides the correct hash) can be reused for the index.
Directly (provider-independent), where every new index is verified by all cosigners. Provided that there is a cheap way to verify an index without recreating it from scratch. One way is to test index few search terms and check against the full index (much like Rabin fingerprint). However, this is not as complete as pki or validating the proof of a theorem.

Also, since it is not crucial to get timestamp of indexes (of an immutable content) in order, I don't think a full-blown blockchain (be it proof-of-work or not) is necessary. There should be a much faster way than method 2 above.

An alternative to fingerprinting the index is verify the code used to build the index, provided that it can be verified to run in a sandbox with known parameters.

(worth looking: SCP, which uses quorum slice, which enables more organic growth of signing nodes)

rht · 2015-11-07T07:40:09Z

(ok the fingerprint check is basically @BrendanBenshoof's "proof-of-truthiness", which drawbacks have been mentioned. I favor the weak ordering method described in the CRDT discussion, if there is a protocol for this that has been fleshed out)

davidar · 2015-11-07T11:23:36Z

I think it would make sense for index providers to sign the aggregated index once they've verified it contains their own index. That way "verification" is simply checking that it's been signed by the provider of the part of the index you're interested in.

doesntgolf · 2015-12-13T08:33:22Z

I made a prototype search tool here (also available here if it's unavailable on ipfs). The search index is built from scraping json search results from a public searx instance (which itself aggregates results from major search providers). This means the only things being searched are page title, URL, and a snippet of the page's text. It also means the search index only has about 1100 items. The client-side search is done with lunr.js. (search source, scraper source)

It's nothing impressive but was fun to hack together!

ghost · 2015-12-13T10:34:38Z

@doesntgolf Very nice to see progress here ;)

dokterbob · 2016-11-07T17:18:41Z

I've just launched an early alpha of a proper search engine, based on elasticsearch. http://ipfs-search.com

victorb · 2016-11-08T15:15:16Z

@dokterbob thanks for sharing, looks very cool :) Maybe it's something you can submit a PR to add to https://github.com/ipfs/awesome-ipfs ?

rcurrie · 2017-01-07T16:49:55Z

On a related vertical note, for the GA4GH Cancer Gene Trust, an IPFS based distributed genomic store, I've implemented an elastic search based crawler with web UI showing the network and allowing for searching by gene and clinical features.

madorian · 2017-10-09T15:58:22Z

nobody working on a distributed search engine? Something where searching costs money.

Kubuxu · 2017-10-09T16:48:10Z

@magik6k was experimenting with pre-computed sharded search for Wikipedia dumps. Magik, any input in this?

magik6k · 2017-10-09T16:57:03Z

There is https://github.com/magik6k/distributed-wiki-search. The code there isn't great, though it should give some idea how it may be implemented.

garyee · 2018-01-12T07:39:31Z

I think this is a critical point if there shall be an HTML page-web (hyperlinked information space ... the whole of HTML based web pages out there in the HTTP world)
the point where the internet as we have it today really took of and became a thing for the average person and not just for nerds was: tata
When search engines were invented. So without the means to search through the swarm for information by using simple words, IPFS will stay a distributed file system for nerds.
Just remember the effect when you just had discovered IPFS, set it up, ran your daemon, discovered the swarm ... and ... and... nothing.
In order to find some information or File, you had to refer to the old-web or some other kind of index.

So I don't know if there is going to be a web-layer on top of IPFS, this is going to be a tough nut to crack...
I tried to "google scholar"-ing it and found this:
http://nsfcac.rutgers.edu/TASSL/Papers/squid-hpdc-03.pdf ... to give an idea of the complexity.

I must admit that I discovered IPFS fairly recently, but I did not find anywhere that searching is a critical point for the plans to build the web 3.0, which it is...

RangerMauve · 2018-01-12T15:22:18Z

Having a decentralized search engine is definitely a difficult challenge, and I think this will be the job of a company rather than the IPFS core team.

Google search already works fine with IPFS through gateways, so centralized search engines should be able to function just fine while a decentralized solution is developed by the community.

chainify-io · 2018-01-20T13:09:23Z

As mentioned by dokterbob above, we are currently developing an ipfs search engine that is indexing within IPFS, not searching through a gateway. Actually, our search engine listens to DHT chatter to index content. We just revived the search engine, but it is in an early (let's say) PoC mode. But in genereal, it works. But there's a lot to do.
We think the world needs a (as much as possible) decentralized search engine for IPFS. The major challenge we currently face is: getting people on board to help out.
So, if there is someone who wants to join to build an open, decentralized search engine for the web3.0 -- just raise hands! :-)

RangerMauve · 2018-01-22T15:15:45Z

@chainify-io Would you be interested in another collaborator in your efforts? I've been writing down ideas for a distributed search for web pages and it'd be cool to discuss and see if we can work together.

davidar added the help wanted label Aug 29, 2015

davidar mentioned this issue Sep 9, 2015

Aggregation --> CRDTs discussion ipfs/notes#40

Open

This was referenced Sep 15, 2015

BookReader #14

Open

Wikipedia #20

Open

davidar mentioned this issue Sep 26, 2015

IPFS as a backend to a web archiving #28

Open

rht mentioned this issue Nov 5, 2015

Curate an ipfs directory (discoverability) ipfs-inactive/examples#9

Open

alexpear mentioned this issue Dec 13, 2015

Making files Searchable ipfs-inactive/faq#81

Closed

davidar mentioned this issue Jul 5, 2016

Maps on IPFS - IPFS Mapping Notes from 2016-07-03 ipfs/notes#142

Open

dokterbob mentioned this issue Nov 9, 2016

Added ipfs-search to apps. ipfs/awesome-ipfs#68

Merged

gpestana mentioned this issue Jul 16, 2018

Indexing content addressable and linked data gpestana/notes#6

Open

johnrfrank mentioned this issue Jan 10, 2019

adding leaf-level similarity hashing for discoverability of common types of content ipfs/notes#347

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search engine #8

Search engine #8

davidar commented Aug 29, 2015

rht commented Aug 30, 2015

davidar commented Aug 30, 2015

davidar commented Sep 1, 2015

nbingham1 commented Sep 3, 2015

davidar commented Sep 4, 2015

jbenet commented Sep 4, 2015

nbingham1 commented Sep 4, 2015

davidar commented Sep 4, 2015

BrendanBenshoof commented Sep 13, 2015

davidar commented Sep 13, 2015

BrendanBenshoof commented Sep 13, 2015

davidar commented Sep 14, 2015

cryptix commented Nov 5, 2015

rht commented Nov 7, 2015

rht commented Nov 7, 2015

davidar commented Nov 7, 2015

doesntgolf commented Dec 13, 2015

ghost commented Dec 13, 2015

dokterbob commented Nov 7, 2016

victorb commented Nov 8, 2016

rcurrie commented Jan 7, 2017

madorian commented Oct 9, 2017

Kubuxu commented Oct 9, 2017

magik6k commented Oct 9, 2017

garyee commented Jan 12, 2018

RangerMauve commented Jan 12, 2018

chainify-io commented Jan 20, 2018

RangerMauve commented Jan 22, 2018

Search engine #8

Search engine #8

Comments

davidar commented Aug 29, 2015

rht commented Aug 30, 2015

davidar commented Aug 30, 2015

davidar commented Sep 1, 2015

nbingham1 commented Sep 3, 2015

davidar commented Sep 4, 2015

jbenet commented Sep 4, 2015

nbingham1 commented Sep 4, 2015

davidar commented Sep 4, 2015

BrendanBenshoof commented Sep 13, 2015

davidar commented Sep 13, 2015

BrendanBenshoof commented Sep 13, 2015

davidar commented Sep 14, 2015

cryptix commented Nov 5, 2015

rht commented Nov 7, 2015

rht commented Nov 7, 2015

davidar commented Nov 7, 2015

doesntgolf commented Dec 13, 2015

ghost commented Dec 13, 2015

dokterbob commented Nov 7, 2016

victorb commented Nov 8, 2016

rcurrie commented Jan 7, 2017

madorian commented Oct 9, 2017

Kubuxu commented Oct 9, 2017

magik6k commented Oct 9, 2017

garyee commented Jan 12, 2018

RangerMauve commented Jan 12, 2018

chainify-io commented Jan 20, 2018

RangerMauve commented Jan 22, 2018