Skip to content
shouya edited this page Aug 6, 2024 · 29 revisions

NOTE: rss-funnel is under development. This document attempts to describe the up-to-date filters with the code in latest master branch. The described features here may not fully align with that of the latest released version. If a filter documented here is not present in the version you use, try the nightly image.

full_text

This filter fetches the full HTML of the article from the article's original source and fill it in the body of the article. You probably want to specify simplify property or add a simplify_html filter after this filter.

Configuration type: Object

Properties:

  • parallelism (optional, number): The number of parallel requests to make. Default to 20.
  • simplify (optional, boolean): Whether to simplify the HTML using readability. Default to false.
  • append_mode (optional, boolean): Whether to append the full text to the existing content. If not, the content is replaced with the full text. Default to false.
  • keep_element (optional, string): Specify a selector to only keep the given element in the HTML.
  • client (optional, [Client config][Client config]): you can specify special http settings for the request like setting Cookies or User-Agent.

simplify_html

This filter simplifies the HTML content of an article using readability and replaces the content of the article with the simplified version.

Configuration type: Object. Only an empty object without property is accepted.

This filter is typically used after the full_text filter, which fetches the full HTML of the article from its link. And alternatively, you can specify simplify option in full_text filter to achieve the same effect.

remove_element

This filter removes HTML elements matching the CSS selectors.

Configuration type: Array of strings. Each string is a CSS selector.

keep_element

This filter keeps HTML elements matching the given CSS selector.

Configuration type: string. The string is a CSS selector.

split

This filter splits one article into multiple ones. It is useful for splitting aggregated RSS feeds into individual articles (like Hacker News Daily) and generating feed from an HTML source (in which case the HTML page is parsed as the singleton article of a feed).

Each article is split by the given CSS selector. You must specify the css selectors for various fields, including title, link, content, as well as author.

Configuration type: Object

Properties:

  • title_selector (required, string): The CSS selector for the title element. The textContent of each selected element is taken as the title.
  • link_selector (optional, string): The CSS selector for the link to the article. The url is extracted out of the href attribute of the selected elements. If the link selector is not specified, it takes on the value of title_selector.
  • description_selector (optional, string): The CSS selector for the description. The innerHTML of each selected element is taken as the description.
  • author_selector (optional, string): The CSS selector for the author.
  • date_selector (optional, string): The CSS selector for the publication date. The date is parsed from the textContent of the element or any of its attributes. Valid date formats are RFC3339 (e.g. 1996-12-19T16:39:57-08:00) and RFC2822 (e.g. Tue, 19 Dec 1996 16:39:57 -0800).

The selectors are evaluated against the article's description (or content) parsed as HTML.

The user must ensure that all selectors match the same number of elements, otherwise rss-funnel will have no way to match up the selectors.

sanitize

This filter allows you to redact or replace text in the content of the articles. The operations are executed in the order specified.

Configuration type: Array of "operations".

Operations:

  • remove (string): Remove the matched text.
  • remove_regex (string): Remove the text matching the given regular expression.
  • replace (object): Replace the matched text with the given string.
    • keys:
      • from (string): The text to replace.
      • to (string): The replacement.
      • case_sensitive (optional, boolean): Specifies whether the matching should be case-sensitive or not (default: false).
  • replace_regex (object): Replace the text matching the given regular expression with the given string.
    • keys:
      • from (string): The regular expression to match. Use (?<name>...) for named capture groups.
      • to (string): The replacement. Use $name to refer to the named captured groups. Or use $1, $2, etc. to refer to the groups by index.
      • case_sensitive (optional, boolean): Specifies whether the matching should be case-sensitive or not (default: false).

Note that due to syntax limitations, there is no way to specify remove or remove_regex with case sensitivity. If you need to do so, you can use replace or replace_regex with an empty to field.

keep_only / discard

The keep_only/discard filter enables users to selectively retain or discard posts based on specified keywords or patterns.

Configuration Type: String, Array of Strings, or Object

  • field (optional, enum): One of title, description, or any (default: any). If any is chosen, the filter applies to both the title and description of a post.
  • matches (string, or a list of strings): Regular expressions to match in the selected field of a post.
  • contains (string, or a list of strings): Plain strings to identify in the selected field of a post.
  • case_sensitive (optional, boolean): Specifies whether the matching should be case-sensitive or not (default: false).

For simple matching of non-regex keywords across any fields, users can directly specify the string or a list of strings as a short-hand. Example:

- path: /show-or-ask-hn.xml 
  source: <https://news.ycombinator.com/rss> 
  filters:
    - discard:
      - crypto
      - blockchain
    - discard: openai
    - keep_only:
      field: title
      matches:
        - '^Show HN:'
        - '^Ask HN:'
      case_sensitive: true

This example demonstrates discarding posts containing the keywords "crypto" and "blockchain" while keeping only those with titles starts with "Show HN:" or "Ask HN:".

limit

Limit the number of posts. Config type: an integer or a duration string.

This filter can operate in two modes:

  • count mode (integer): only the first n posts are kept
  • duration mode (duration string): only posts published within duration are kept

The format of duration string follows the same format as duration strings used in other places. Find more information about the duration format at duration_str.

Examples:

  - path: /hackernews-fresh.xml
    source: https://news.ycombinator.com/rss
    filters:
      - limit: 8h

  - path: /hackernews-first-10.xml
    source: https://news.ycombinator.com/rss
    filters:
      - limit: 10

highlight

Highlight matching keywords or any regular expression patterns in the posts' description.

Configuration Type: Object

  • keywords (optional, list of strings): A list of literal strings to match on. Either keywords or patterns must be specified.
  • patterns (optional, list of strings): A list of regular expressions. Either keywords or patterns must be specified.
  • bg_color (optional, string): The background color to the highlighted text (default: "#ffff00")
  • case_sensitive (optional, boolean): Specifies whether the matching should be case-sensitive or not (default: false).

merge

Merge articles from other feeds into the current feed. This is useful for merging multiple feeds into one.

Configuration type: Object, or a single source string, or an array of sources

Properties:

  • source (required, string or array of strings/objects): The URL or source of the feed(s) to merge. See the source syntax documentation for more information.
  • parallelism (optional, number): The number of concurrent requests to make for fetching multiple sources (default: 20).
  • client (optional, Client config): You can specify special HTTP settings for the request, like setting Cookies or User-Agent.
  • filters (optional, list of filters): The filters to apply to the merged feed. The filters are applied in the order specified.

Example of merging multiple feeds:

- path: /merge.xml
  source: https://example.com/feed1.xml
  filters:
    - merge:
        source:
          - https://example.com/feed2.xml
          - https://example.com/feed3.xml
          - https://example.com/feed4.xml
        client:
          user_agent: My Custom User-Agent
        parallelism: 10
        filters:
          - remove_element:
            - .ads

# or, if you don't need extra configuration:
- path: /merge.xml
  source: https://example.com/feed1.xml
  filters:
    - merge:
      - https://example.com/feed2.xml
      - https://example.com/feed3.xml
      - https://example.com/feed4.xml

In this example, the feeds from https://example.com/feed2.xml, https://example.com/feed3.xml, and https://example.com/feed4.xml are merged into the current feed (https://example.com/feed1.xml). The merged feed then has the .ads elements removed using the remove_element filter. The parallelism option is set to 10, which means up to 10 feeds will be fetched concurrently.

Example of merging a feed created "from scratch":

- path: /from-scratch.xml
  source:
    format: rss
    title: My Custom Feed
    link: https://example.com
    description: This is a custom feed created from scratch
  filters:
    - merge:
        source:
          - https://example.com/feed1.xml
          - https://example.com/feed2.xml

In this example, a new feed is created "from scratch" with a custom title, link, and description. The articles from https://example.com/feed1.xml and https://example.com/feed2.xml are then merged into this custom feed.

modify_post

The modify_post filter allows you to modify individual posts in the feed using JavaScript code.

Configuration Type: string

The string should be the JavaScript code that modifies the post variable in-place. You can also read from the feed variable in this filter. If you want to remove the article, set post = null or return null.

Example:

- path: /modify-title.xml
  source: https://example.com/feed.xml
  filters:
    - modify_post: post.title = `${post.title} (modified)`

You can use console.log(string) function to print debugging info to the stdout.

You can also early return from the filter by using an if statement and returning. Only the modifications made before the early return will be applied.

- path: /early-return.xml
  source: https://example.com/feed.xml
  filters:
    - modify_post: |
        if (post.title.includes("skip")) {
          return;
        }
        post.title = `${post.title} (modified)`

The actual fields of post can be found at:

You can use the "Json" mode on the inspector UI to view the JSON representation of the posts you're manipulating.

You can also use await inside the code to perform asynchronous operations. See the JavaScript API documentation for more details.

For an example of using await with fetch, check out the DeArrow YouTube feed in the Cookbook.

modify_feed

The modify_feed filter allows you to modify the entire feed using JavaScript code.

Configuration Type: string

The string should be the JavaScript code that modifies the feed variable in-place.

Example:

- path: /set-title.xml
  source: https://tokio.rs/_next/static/feed.xml
  filters:
    - modify_feed: feed.title.value = "My Modified Tokio Blog Feed"

You can use console.log(string) function to print debugging info to the stdout.

You can also early return from the filter by using an if statement and returning. Only the modifications made before the early return will be applied.

The actual fields of feed can be found at:

You can use the "Json" mode on the inspector UI to view the JSON representation of the feed you're manipulating.

You can also use await inside the code to perform asynchronous operations. See the JavaScript API documentation for more details.

js (deprecated)

Note: This filter is deprecated. It is recommended to use the modify_post or modify_feed filters instead, as they provide a more streamlined interface for modifying posts and feeds, respectively.

Configuration type: string. The string is the JavaScript code to run.

You must define either (or both) of the two global functions: modify_feed and modify_post.

convert_to

The convert_to filter allows you to convert the format of the feed from RSS to Atom, or vice versa. This can be helpful when you want to use the modify_post or modify_feed filters on feeds of different formats, as it allows you to write your JavaScript code in a uniform way, targeting a specific feed format.

Configuration type: string

The string should be either rss or atom, specifying the format you want to convert the feed to.

Example:

- path: /rss-feed.xml
  source: https://example.com/atom-feed.xml
  filters:
    - convert_to: rss
    - modify_post: |
        // JavaScript code targeting RSS format
        post.title = `${post.title} (modified)`;

In this example, the original Atom feed from https://example.com/atom-feed.xml is first converted to the RSS format using the convert_to: rss filter. The modify_post filter then modifies the post titles, and the JavaScript code is written for the RSS format.

Note: The conversion between feed formats is a best-effort process and may not be perfect, as there are many misaligned fields between the two formats. Some information or metadata may be lost or transformed during the conversion process.

It's generally recommended to use the convert_to filter before using modify_post or modify_feed filters, as it allows you to write your JavaScript code in a consistent manner, targeting a specific feed format. This can make your code more readable and maintainable, especially when working with feeds from various sources and formats.

image_proxy

Rewrite image URLs to use a proxy, helping bypass image loading restrictions set by some websites.

Configuration type: Object or empty object ({}) for default settings

Properties:

  • domains (optional, list of strings): Domains to apply the proxy to. Supports globbing.
  • selector (optional, string): CSS selector for image tags to rewrite (default: "img").
  • proxy (optional, string): Proxy to use for fetching images (e.g., "socks5://localhost:9150").
  • referer (optional, string): Referer header for image requests. Options: "none", "image_url", "image_url_domain", or a custom string.
  • user_agent (optional, string): User-Agent header for image requests. Options: "none", "transparent", or a custom string.
  • external (optional, object): Use an external proxy service instead of the built-in one.
    • base (required, string): Base URL of the external proxy service.
    • urlencode (optional, boolean): Whether to URL-encode the image URLs.

For more detailed configuration and usage information, refer to the full Image proxy documentation.

Examples:

# Use default settings
- image_proxy: {}

# Internal proxy with custom settings
- image_proxy:
    domains:
      - "*.example.com"
    selector: "img.proxy-me"
    referer: image_url_domain
    user_agent: "Custom User Agent String"

# External proxy
- image_proxy:
    external:
      base: "https://external-proxy.example.com/proxy?url="
      urlencode: true

magnet

Find magnet links in the body of entries and save them in the enclosure (RSS) or link (Atom). The resulting feed can be used in a torrent client.

Configuration type: Object or empty object ({}) for default settings

Properties:

  • info_hash (optional, boolean): Match any [a-fA-F0-9]{40} or [a-fA-F0-9]{68} as the info hash and construct a magnet link (default: false).
  • override_existing (optional, boolean): Whether to override existing magnet links in the enclosure/link (default: false).

Example:

# Use default settings
- magnet: {}

# Custom configuration
- magnet:
    info_hash: true
    override_existing: true

note

The note filter is a special filter that has no effect on the feed or its articles. It serves only documentation purposes, allowing you to add notes or comments to your filter configuration.

Configuration type: string

The string should be the note or comment you want to add.

Example:

- path: /feed.xml
  source: https://example.com/feed.xml
  filters:
    - note: This feed is for demonstration purposes only
    - remove_element:
      - .ads