These are data sets used by Marginalia Search. If you feel something belongs that is absent, or is present that doesn't belong, feel free to make a pull request.
Contributions are welcome.
-
blogs.txt
is a list of websites that are blogs (or close enough). Websites on this list receive slightly preferential treatment in how they are processed, and they are processed with the assumption that they are blogs with all that entails.blogs.txt
is also the list of domains that show up in the new 'Blogosphere' filter. -
docs.txt
is not yet in use, but the idea is to gather as many good documentation sites as possible and make a filter for that. -
random-domains.txt
is the list of domains that are in the random exploration mode.
The Marginalia Search project also shares data sets and dumps from the search engine, much larger than anything you can upload on github, available at https://downloads.marginalia.nu/exports.