Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce search across all of HexDocs #1811

Open
josevalim opened this issue Nov 10, 2023 · 33 comments
Open

Introduce search across all of HexDocs #1811

josevalim opened this issue Nov 10, 2023 · 33 comments

Comments

@josevalim
Copy link
Member

The goal of this feature is to provide search and autocompletion across packages. We will add a new configuration, called related_deps, which is a list of package names we find related. We will improve both autocomplete and search to use this, such that:

  • Autocompletion
    • Without related_deps
      • Only autocompletes the current project (current behaviour)
    • With related_deps
      • Autocomplete the current package and all related dependencies
  • Full-text search
    • Without related_deps
      • By default searches the current project (current behaviour)
      • We will show radio buttons that allows you to customize the search. The options are "[ ] Current project" (default) and "[ ] HexDocs"
    • With related_deps
      • By default searches the current project and all related deps
      • We will show radio buttons that allows you to customize the search. The options are "[ ] Current project", "[ ] Current project + Related packages" (default), and "[ ] HexDocs"

To power this feature, we will build a new service that does both autocompletion and search based on SQLite3 database. We have proof of concepts from:

The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.

@josevalim
Copy link
Member Author

josevalim commented Nov 10, 2023

Btw, I have a dump of the database already, in case someone wants to use it for a proof of concept. Just ping me elsewhere and I will send a link. We should also skip any license.html and changelog.html files we find.

@ruslandoga
Copy link
Contributor

ruslandoga commented Nov 11, 2023

@josevalim 🙋‍♂️ I'd like to compare the dump with the data I've scraped.

Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by.

@josevalim
Copy link
Member Author

Getting access to logs is probably difficult but the Hex team may accept a PR that adds this computation. I cannot answer for them though, so you will have to ask. :)

@jeregrine
Copy link
Contributor

jeregrine commented Nov 11, 2023

Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by.

You could look at the dependency graph and weigh by downloads and get a crude measurement of it.

The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.

The code actually only grabs new packages and re-indexes them since Hex can sort by updated_at. So you could run that daily and it would take seconds.

One of the reason's its so slow is that the the json containing the indexable items sidebar_items-<rand_id>.js and search_items-<rand-id>.js is always different and I need to GET the HTML, find the script src then GET the js; then do the same for the search page. Changing the rand_id to a query string for cache busting like search_items.js?vsn=<rand-id> would mean I could only make 2 requests and skip parsing html.

@ruslandoga who shared his notes here: https://gist.github.com/ruslandoga/7f0f5b68d760ec5b3e650e7f73f694f2

@ruslandoga nice idea with the sqlite C function I did the lazy way with SQL and its not too slow https://github.com/jeregrine/hex-search/blob/main/lib/hex_docs_search/hex.ex#L50

@ruslandoga
Copy link
Contributor

You could look at the dependency graph and weigh by downloads and get a crude measurement of it.

Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric.

@josevalim
Copy link
Member Author

@jeregrine oh, so you skip downloading the whole docs tar?

@jeregrine
Copy link
Contributor

@jeregrine oh, so you skip downloading the whole docs tar?

Didn't even know it was downloadable. :-) But yea I don't do that it might faster at a cost of more disk/memory usage. ¯_(ツ)_/¯

Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric.

Actually more I think about it, nvm. Its messy.

@rhcarvalho
Copy link
Contributor

In the current design, would this require packages to update their ex_doc dependency and release a new version or would it work regardless of which version of ex_doc was used to generate the documentation?

@ruslandoga
Copy link
Contributor

ruslandoga commented Nov 12, 2023

👋 @rhcarvalho

The new search functionality (assets/js) would only be present in the new ex_doc version, so I think it's more likely that the packages would need to upgrade to get global search from their documentation pages. But for a package to be indexable, they don't need to upgrade.

@zachdaniel
Copy link
Contributor

👋 hey everyone, just checking in. Is this in progress? If so, any way I can assist? If not, I may be able to help get it off the ground :)

@josevalim
Copy link
Member Author

There is a delay because we are also investigating if it makes sense to add embeddings to the docs, so we can also use it to provide context for LLMs (such as OpenAI). I will try to post more information soon. :)

@zachdaniel
Copy link
Contributor

Sounds good! Thanks for your hard work. Not trying to hurry. I'm happy to wait, just want to assist if possible/warranted.

@josevalim
Copy link
Member Author

That's really good to know. I will reach out once we have an action plan, unless you are also happy to get involved in the "figure it out" process and write some JS too? :)

@zachdaniel
Copy link
Contributor

Yeah, I'd be very happy to be involved in any way. Cross package search is a major win for the Ash ecosystem, and is absolutely worth me spending my time on.

@couhajjou
Copy link

I see 4 search planes:

  • repo - current repo
  • deps - all deps (from mix.deps)
  • framework (set by framework author)
  • pinned repos (set by user)

Please empower the user

@couhajjou
Copy link

couhajjou commented Jan 12, 2024

I am WIP-ing 'pinned repos' in ex_doc. It's the most versatil.
the idea is that any repo should have a JSON file search_data.json

it's just the json version of this file:
https://hexdocs.pm/ash_postgres/dist/search_data-C114CB12.js

both search_data.js and search_data.json will include the package info like this:

Screenshot 2024-01-12 at 5 50 50 PM

That would allow the UI to ingest the search_data.json files of the pinned repos

and display the info like this

Screenshot 2024-01-12 at 6 06 27 PM

And we need to change the UI a bit, but that Idea was already sketched up in this post.
It just a matter of a little UI design.

Pinned repos can be stored in

  • local storage.
  • or chrome extension
  • or an account on hexdocs. (so that hexdocs can have all our emails ;)

It's not a big change to ex_doc.

And ofc we need to keep caching and versioning at it is now in search_results_72517.js

@josevalim
Copy link
Member Author

We explored this but sometimes those files can be really large and building a index of all of them in realtime would become very expensive. Often the resulting index was so large that it would blow up local storage, which would cause us to index them every time, making it worse.

@couhajjou
Copy link

couhajjou commented Jan 13, 2024

@josevalim, I am not sure that you read my comment here: #1811 (comment)

here it is again:

I see 4 search planes:

1- repo - current repo
2- deps - all deps (from mix.deps)
3- framework (set by framework author)
4- pinned repos (set by user)

I am addressing here the solution 4-pinned repos.

in the local storage we just store the list like this:

pinned-packages: [
  {
     package: 'ash',
     search-indrex-url: 'http://hexdocs/ash/searchIndex.json'
  }
  {
     package: 'ash_postgres',
     search-index-url: ....
  }
]

it's the user who decide wich repos he want to 'Pin'

Ash search index is 104KB, it's cached in browser cache
10 ash repos would be around 1MB.

So for ash framework users it will be a few bytes in the local storage.
And 1MB in the cache.

please correct me if I am missing something. as I am WIP-ing this.

@couhajjou
Copy link

Here is the architecture and the UI I propose for search

1- repo - current repo ====>ex_doc feature. Offline and online search
2- deps - all deps ====>hexdocs search engine. Online only. not available offline
3- framework (set by framework author) ===> ex_doc feature. Offline and online
4- pinned repos (set by user) ===> ex_doc feature. Offline and online

So ex_doc search for 1 3 and 4
And hexdocs search for 2

We have to have one UI.
THE SEARCH INPUT in ex_doc will be able to do :

  • call to ex_doc internal (this how it works now)
  • call to hexdocs search API (to be implemented)

I am.just WIP-ing 3 and 4
1 is working
2 it's an hexdocs project. Needs someone like algolia

So with 1 3 and 4 I can do some ash and phoenix coding on the plane @zachdaniel ;)

@couhajjou
Copy link

A complication to discuss later: You can pin online depos and/or local depos (if they are in your HD).

Like mix.deps can have local and remote packages.

Sounds complex but can be simple
....

@josevalim
Copy link
Member Author

Right. But you can think a new user would also want to pin Elixir itself and we know for a fact Elixir was too big to cache (so we added compression). Ecto and Phoenix are also on the larger side too. So I wonder if those three would not be enough to below up session storage space?

@couhajjou
Copy link

couhajjou commented Jan 13, 2024

Local cache is 10MB. Elixir search index is 2MB. On the plane it's not a pb, we loading from disc. Online we might have a cache miss, it's life :) Then the browser hits the CDN. If you want to cap everything to 10MB you can and make it like an amazon kindle and tell the user you don't have more storage with an UI like this:

Pinned Repos Size Actions
ash 108 KB [Unpin] [Download local doc] <- Keep coding while on the plane to ElixirConf
elixir 2.1 MB [Unpin] [Download local doc] //sessionStorage.removeItem(elixir)
Total 2.2 MB

I am saying let's empower the user. The persona is a Dev. So it's ok if the UX is technical a bit.

---This is tangent and maybe crazy ----

This is tangent but we could also Ideate a chrome extension UX. Don't we need one for phoenix ?
It can be The Phoenix Chrome extension and we could put other things in there.

A level of gamification is to track the most pinned repos. Like github (forks/stars). It can create another
dynamic, with prizes in ElixirConf. It's a design technique used in Building Architecture. You take a technical limitation and make it useful and elegant. (Designer Trick)


@josevalim
Copy link
Member Author

josevalim commented Jan 13, 2024

Interesting...

However, I think we have to be a bit less optimistic. We still need session storage for other indexes. For example, imagine you fill in your index with 9MB without Elixir. Now, without additional space, if you try to search Elixir docs, it will go through the slow path and rebuild the index every time. So maybe 7MB of custom search max.

And you are right about empowering the user... but should we realistically expect users to craft their own search engines? Projects like Elixir, Phoenix, and Ash need the search to work across several repos out of the box and I would focus on that instead. The good news is that I am quite sure your ideas could be fully explored as a separate project!

PS: in any case, I don't think this solves the airplane case either. You have the search contents but not the rendered pages themselves. You could try to rebuild them from the index but not all information is available.

@couhajjou
Copy link

couhajjou commented Jan 13, 2024 via email

@josevalim
Copy link
Member Author

I see, that definitely feels out of scope for ExDoc then. :) I recommend exploring this on your own, something that builds the docs in the deps folder and creates a unified search interface. Bonus points if it works both online and offline. Meanwhile, let's please refocus this issue on its original description. Thank you!

@couhajjou
Copy link

couhajjou commented Jan 15, 2024

@josevalim, In that case I would suggest to move hexdocs search feature as you envisioned it to the hexdocs repo.

Here are my arguments:

  • ex_doc is not hexdocs.
  • ex_doc is an HTML eBook generator. The generated eBook is searchable and self contained. Search feature is part of the generated book.
  • hexdocs is a book library. The book library should have a search engine.
  • ex_doc search is client based
  • hexdocs search is server based
  • hexdocs search architecture is to be done within hexpm/hexdocs team/project effort
  • hexpm should publish a protocol that have to be satisfied by package authors who want their documentation to be searchable by hexdocs.
  • that protocol will be implemented by ex_doc vestion X, and the upgrade will be seamless: upgrade ex_doc, run mix docs
  • From business perspective:
    • ex_doc is a product. (distributed thought github)
    • hexdocs is a service (run by hexpm organisation)

I suggest we figure out the TechnicalDesign/Architecture of the search functionality. we have 2 product/services (ex_doc, hexdocs).

For UX I would suggest the apple approach, one UX across physically separates complementary devices.

One search experience through ex_doc and hexdocs, the user will not notice the discontinuity.

@josevalim
Copy link
Member Author

josevalim commented Jan 15, 2024

That's historically how we have implemented features in Hexdocs that are used by ExDoc and that's most likely how we plan to implement this one too: Hexdocs provide a generic interface for others to hook into and ExDoc simply acts as one of the clients.

@josevalim
Copy link
Member Author

josevalim commented Jan 15, 2024

It all depends if the Hexdocs team wants to maintain a search service. If not, then a third service will consume Hexdocs packages (Hexdocs then works as "storage") and ExDoc then talks to said service. The feature is listed here because most of the work will be done by the ExDoc team anyway.

@couhajjou
Copy link

couhajjou commented Jan 15, 2024 via email

@ruslandoga
Copy link
Contributor

ruslandoga commented Apr 29, 2024

👋

I'm interested in working on this and would love to collaborate with anyone else currently involved! I'll start by revisiting the SQLite approaches and checking if there are better options available now (typesense, meilisearch, etc.).

@josevalim
Copy link
Member Author

Hi @ruslandoga! At the moment, we are thinking about going with Postgres. We will compute our own text embeddings using machine learning models and store them with pgvector. What are your thoughts?

@ruslandoga
Copy link
Contributor

ruslandoga commented May 6, 2024

👋 @josevalim oh right, I forgot about your comment above on wanting to add semantic search... Sorry! I should probably reread this thread.

With SQLite I kept the embeddings in a BLOB and loaded them all in memory on startup and used https://github.com/elixir-nx/hnswlib as index. That was too complicated and a bit resource-intensive, pgvector would likely make it much simpler and more efficient :)

But I was rather wondering about the basic global search, like a global autocomplete, is that still planned? Would Postgres be used for that as well?

@josevalim
Copy link
Member Author

Yes, the goal would be to use PG for that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants