Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help is needed as remove_regex is not working as expected #95

Closed
tillcash opened this issue Mar 23, 2024 · 4 comments · Fixed by #100
Closed

Help is needed as remove_regex is not working as expected #95

tillcash opened this issue Mar 23, 2024 · 4 comments · Fixed by #100

Comments

@tillcash
Copy link

Could you please assist me? I'm attempting to remove some text, but it's not functioning as anticipated.

source: https://www.thenewsminute.com/api/v1/collections/tamil-nadu.rss

  - path: /newsminute.xml
    filters:
      - sanitize:
          - remove_regex: '<p><strong>Read.*?<\/strong><\/p>'

image

@shouya
Copy link
Owner

shouya commented Mar 23, 2024

The reason it doesn't work is because there is a confusion between the content field and the description field. Currently, rss-funnel only recognizes the description field, so it doesn't perform any filtering in the content field.

Try this:

  - path: /newsminutes.xml
    source: https://www.thenewsminute.com/api/v1/collections/tamil-nadu.rss
    filters:
      - modify_post: |
          post.description = post.content;
          post.content = null;
      - sanitize:
          - remove_regex: "<p><strong>Read.*?<\/strong><\/p>"

Nonetheless, I have noted it as a bug that needs fixing. After it's been fixed, there will be no need for the modify_post filter.

@tillcash
Copy link
Author

Thanks, it's working now. I tried a lot before asking for help.

Somehow, the media:thumbnail url got combined into the content field in the RSS reader. I need to get rid of the ?w=280 from the media:thumbnail url so that the image can show up in its full size. Can you please explain how I can do this?

@shouya
Copy link
Owner

shouya commented Mar 23, 2024

Try this:

  - path: /newsminutes.xml
    source: https://www.thenewsminute.com/api/v1/collections/tamil-nadu.rss
    filters:
      - modify_post: |
          post.description = post.content;
          post.content = null;
          let thumbnail = post.extensions.media.thumbnail[0];
          if (thumbnail) {
            thumbnail.attrs.url = thumbnail.attrs.url.replace(/\?w=.*/, "");
          }
      - sanitize:
          - remove_regex: "<p><strong>Read.*?<\/strong><\/p>"

All the raw fields that can be modified in JavaScript code can be found from the Json view next to Raw and Rendered radio button.

@tillcash
Copy link
Author

It's working. Thank you for helping. I really appreciate it.

shouya added a commit that referenced this issue Mar 27, 2024
This PR clarifies the concept for "body" used in code and config.

Fixes #95 and
#96.

## Motivation

Previously, I name a generic field in the code "description" to
distinguish it from the title. For rss format it refers to the
[`description`
field](https://github.com/shouya/rss-funnel/blob/dc1efac19a96e06143b75e9495adb3f6b013a75f/src/feed.rs#L348)
and for atom it refers to the [`content`
field](https://github.com/shouya/rss-funnel/blob/dc1efac19a96e06143b75e9495adb3f6b013a75f/src/feed.rs#L368).
The choice of the name and the selected fields are purely arbitrary
based on the few example feeds I had in hand. Overall, it is supposed be
the field that ultimately get displayed in rss feeder beneath the title.

In this PR I renamed the general term to "body". Unlike the old notion,
a post can have multiple `body` fields. We need this if we want to
handle all types of different fields that considered as body in the RSS
reader. For example, if we consider all the body fields, then we can
correctly filter posts matching certain keyword using the `keep_only`
and `discard` filter (#95).

In addition, some feeds do not use the typical body fields. On example
is YouTube, who puts the video description in the `media:description`
field under the `media:group` tag
(#92). And we hope to support
filtering on this field as well.

## Implementation

First, I removed the single-field accessor for `Post.description` field.

Then I provided various APIs for accessing the bodies:

  + `Post.bodies_mut`
  + `Post.bodies`
  + `Post.modify_bodies`
  + `Post.first_body`
  + `Post.first_body_mut`
  + `Post.create_body`
  + `Post.ensure_body`

The following fields are considered as body fields:

- rss
  + `content`
  + `description`
  + `media:description`
  + `itunes:summary`
- atom
  + `content`
  + `summary`
  + `media:description`

## Config changes

- Rename the `content` variant to `body` of the `field` field for
`keep_only`/`discard` filter.
- Rename the `description_selector` field to `body_selector` for the
`extract` filter.

Both changes are backward compatible. The old fields are currently
marked deprecated, and may be removed in a future breaking release.

## Checklist

- [ ] update filter docs
- [x] review all usage of the term "description" in code
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants