Skip to content

MarginaliaSearch/PublicData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Data Sets

These are data sets used by Marginalia Search. If you feel something belongs that is absent, or is present that doesn't belong, feel free to make a pull request.

Contributions are welcome.

  • blogs.txt is a list of websites that are blogs (or close enough). Websites on this list receive slightly preferential treatment in how they are processed, and they are processed with the assumption that they are blogs with all that entails. blogs.txt is also the list of domains that show up in the new 'Blogosphere' filter.

  • docs.txt is not yet in use, but the idea is to gather as many good documentation sites as possible and make a filter for that.

  • random-domains.txt is the list of domains that are in the random exploration mode.

The Marginalia Search project also shares data sets and dumps from the search engine, much larger than anything you can upload on github, available at https://downloads.marginalia.nu/exports.

About

Public data sets for Marginalia Search

Resources

Stars

Watchers

Forks