Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized feed reader #31

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Generalized feed reader #31

wants to merge 9 commits into from

Conversation

semmons99
Copy link
Contributor

Here's the beginnings of a generalized RSS plugin for Mendibot. @gjp what do you think?
#30

@gjp
Copy link
Contributor

gjp commented Feb 16, 2012

👍 for "epic saga"

How to test this? With one feed and one channel, I set the pull interval much smaller than min_time and confirmed that the posts were periodically sent to the channel. With many-to-many I could see this getting fussy.

@semmons99
Copy link
Contributor Author

I'm currently playing around in #semmons99-test1 and #semmons99-test2 with a handful of different RSS feeds and configurations.

@semmons99
Copy link
Contributor Author

I think I've worked out the kinks. I'm going to leave my test bot sitting in #semmons99-test1 and #semmons99-test2 for a few hours to make sure things go as planned. If everything looks good I'll merge later tonight.

@gjp
Copy link
Contributor

gjp commented Feb 16, 2012

Sorry I didn't have anything to contribute here. I should be around for most of the day if there is something else I can help with.

@jordanbyron
Copy link
Contributor

Sweet. Let me know once this is merged so I can take care of #26

@semmons99
Copy link
Contributor Author

@gjp actually there's a couple things you could help with.

  1. Test more feeds to see if there's any more errors.
  2. Add author back to the message and default it to "a mysterious stranger"

It would also be fun to see multiple templates so it's not always "an epic saga".

@jordanbyron
Copy link
Contributor

It would also be fun to see multiple templates so it's not always "an epic saga".

👍 I would love to see that!

@gjp
Copy link
Contributor

gjp commented Feb 16, 2012

@semmons99 what sort of errors were you running across earlier? You'd mentioned every test feed was failing. Was that due to the RSS namespace?

alternatives to "epic saga": hmm, should we drop that into #mendicant and see how many suggestions we can gather?

@semmons99
Copy link
Contributor Author

@gjp mostly missing pubDates.

Yeah ask #mendicant for suggestions.

@semmons99
Copy link
Contributor Author

@gjp, @jordanbyron, @sandal

I've been playing around with this for the past couple hours and have come to the conclusion that most site's RSS feeds are a pile of 💩. We can't rely on much of anything to be there other than <link>. So we have two options as I see it.

  1. Only allow well formed RSS feeds to be monitored.
  2. Use PStore to store all the links we've announced and use that to determine what's new.

Thoughts?

@jordanbyron
Copy link
Contributor

@semmons99 the two main RSS feeds I want to support are Community and GitHub. It looks like GitHub is using updated and we can easily add an updated attribute to the Community RSS feed.

Would that be enough to get this working?

@gjp
Copy link
Contributor

gjp commented Feb 17, 2012

I'm pretty new to RSS formats but I think I see what's going on here. There's RSS, there's Atom, and then there's a site like Hacker News which strips out every byte they can to save bandwidth, standards be damned, and calls it RSS anyway.

I'd suggest checking both the pubdate and updated attributes to handle both RSS and Atom, but to ignore invalid feeds. Any site which is so busy that they need to strip all dates is probably too spammy for IRC anyway. Or they just suck.

unrelated: @sandal suggested combining synonyms of "epic" and "saga". I'll make a list of overused words from book blurbs. It'll be a stunning tour de force.

@semmons99
Copy link
Contributor Author

@jordanbyron @gjp sounds good.

@gjp think you can update this branch to use pubDate and updated when you're adding our synonyms feature?

@wicz
Copy link
Contributor

wicz commented Feb 17, 2012

Since we want to support just a few feeds, can't we use a 3rd party tool, e.g. feedburner, to normalize this feeds for us? Or maybe Yahoo! pipes?

@semmons99
Copy link
Contributor Author

@wicz I'd rather avoid thirdparty tools and just write the logic to handle the feed types we need. The more we can rely on ourselves at mendicant the better.

@jordanbyron
Copy link
Contributor

@semmons99 👍 I totally agree. Keep it simple

@practicingruby
Copy link
Contributor

@wicz: With the exception of Github, Mendicant University does not use third party services unless they are based on open source and easily replaceable. There are a ton of reasons for this, which reminds me that @jordanbyron and I really need to write up our guidelines...

@wicz
Copy link
Contributor

wicz commented Feb 17, 2012

@semmons99 @jordanbyron Agreed, I don't like to add dependencies either, unless they're extremely necessary. But just to make sure we're on the same page here, I was talking about using the feedburner.com service and not any other library with the same name.

Just pointing it out cause I've used it before when needed to aggregate news from a bunch of shitty feeds.

EDIT: Anyway, it won't work cause it seems it doesn't support HTTPS :boo:

@practicingruby
Copy link
Contributor

@wicz: A service dependency is an even heavier dependency than a third party library. We don't mind using libraries, because we can maintain them ourselves if/when they have problems. Services we have no control over.

@wicz
Copy link
Contributor

wicz commented Feb 17, 2012

@sandal You're absolutely right!

ps: comments ordering got a bit clumsy, but I'm glad we agreed anyway 😉

@gjp
Copy link
Contributor

gjp commented Feb 17, 2012

@semmons99 sure, I can roll up some small additions. Would you like to merge this now or wait for me to experiment on this branch? My schedule is uncertain today.

@semmons99
Copy link
Contributor Author

@gjp got any time to work on this?

@gjp
Copy link
Contributor

gjp commented Feb 20, 2012

I had open tickets on 4 projects before I drove out of state for the weekend. By the time I was home, 💤.

I'll background MoM and Community and start here with pubdate/updated and author. Do you mind if we make wordplay on the announcements a separate patch? I don't want to cause further delay to this one.

@semmons99
Copy link
Contributor Author

@gjp sure, that's fine.

@semmons99
Copy link
Contributor Author

@gjp feel free to work on MoM and Community first too.

@jordanbyron
Copy link
Contributor

@semmons99 @gjp both are blocked by me at this point, so feel free to work on this first :wink2:

@gjp
Copy link
Contributor

gjp commented Feb 20, 2012

Working on this one. It's trial and error with different test feeds, so I might narrow it down to Community and GitHub and not worry about weird variations.

Do you think the refresh interval should remain universal, or set per-feed?

@semmons99
Copy link
Contributor Author

@gjp Just focus on Community and Github. They're all we care about right now.

Also I can see the value in having a per-feed interval. It would allow us to follow github more actively.

@wicz
Copy link
Contributor

wicz commented Feb 20, 2012

Hi guys, sorry for not responding earlier.

This week I'll be working on puzzlenode, but you can count on me to help with this and #32 as well.

@semmons99
Copy link
Contributor Author

@wicz awesome!

@gjp
Copy link
Contributor

gjp commented Feb 20, 2012

I'm feeling the temptation to try a different parser. Doing this on multiple fields feels wrong
item.respond_to?(:pubDate) && item.pubDate

and item.updated is not being cast to a Time but item.pubDate is. 😒

@semmons99 How were you using Pry to help here? I've only noodled around with it. Given how long it's taking me to make one change and test it, I must be doing something wrong.

@gjp
Copy link
Contributor

gjp commented Feb 20, 2012

Feedzirra is making me happy.

@semmons99
Copy link
Contributor Author

@gjp awesome, the description alone warrants inclusion into mendibot.

I was using pry by inserting:

require "pry"
binding.pry

where I kept having trouble and inspecting the items as they came through. I really don't use any of it's fancy features. Most of the time I just inspect the object, look at methods, etc. If you're stuck at all and want to try some remote pairing tomorrow, let me know. I'm pretty sure I can setup my vps to function for remote pairing.

@gjp
Copy link
Contributor

gjp commented Feb 21, 2012

Feedzirra normalizes the feeds nicely. Now I'm stuck on something which is making me feel brain-dead. With a single global timer, we just iterate through all feeds on each refresh. With per-feed timers, we need to know which feed to refresh, when the timers are set up.

      Mendibot::Config::RSS_SETTINGS[:feeds].each do |feed|
        timer feed_interval(feed), method: :pull
      end

Will Cinch accept arguments to the pull method? How?

Also I'd prefer to be able to do feed.interval, but feeds are just a hash.

@semmons99
Copy link
Contributor Author

Let's just define each of the pull methods programmatically.

require 'cinch'
require 'open-uri'
require 'rss'

module Mendibot
  module Plugins
    class RSSRelay
      include Cinch::Plugin

      Mendibot::Config::RSS_SETTINGS[:feeds].each do |feed|
        method = "#{feed[:name]}_pull".to_sym

        timer feed[:interval], method: method

        define_method method do
          open(feed[:url]) do |rss|
            #... regular rss stuff
          end
        end
      end
    end
  end
end

@wicz
Copy link
Contributor

wicz commented Feb 21, 2012

Let's go by parts 😉

@jordanbyron The thing about one using pubDate and the other updated is that Community is RSS 2.0 and GitHub is Atom 1.0, both almost valid. That said, I don't think adding updated is the solution cause it will mix the two formats in a wrong way. Also, I think we could take care to make Community RSS valid. I'll open an issue and can take a look at it later.

@gjp I don't think we need a different parser here. The stdlib's RSS is capable of parsing both formats, all we need is to be able to distinguish between them.

gh = RSS::Parser.parse(open('https://github.com/jordanbyron/mission_of_mercy/commits/master.atom'))
gh.feed_type # => "atom"

cm = RSS::Parser.parse(open('http://community.mendicantuniversity.org/articles.rss'))
cm.feed_type # => "rss"

gh_entry = gh.items.first
cm_entry = cm.items.first

cm_entry.title.class # => String
cm_entry.title # => "Tempr - no fussin' temporal expressions"

cm_entry.pubDate.class # => Time
cm_entry.pubDate # => 2012-02-16 16:07:15 -0200

gh_entry.title.class # => RSS::Atom::Feed::Title
gh_entry.title.content # => "Now that loading zipcodes is fast, we don't need progress_bar"

gh_entry.updated.content.class  # => Time
gh_entry.updated.content # => 2012-02-16 18:32:34 -0200

Note that we have to call content in the Atom feed. But it automatically casts to Time.

About Feedzirra, it seems a nice project, and even though it can make our life easier, there is a heavy tradeoff in game: it depends on ActiveSupport. I think we can cast our own solution based only in the standard RSS parser. Maybe we can create an Entry class to encapsulate the feed_type conditionals?

@gjp @semmons99 You might want take a closer look at pry-nav and pry-stack_explorer

Thoughts?

@semmons99
Copy link
Contributor Author

@wicz and @gjp can you work together to build something using stdlib rss to function as we need it? Perhaps craft a simple gem and release it?

@gjp
Copy link
Contributor

gjp commented Feb 21, 2012

@semmons99 so I was not missing something? Dynamically creating N methods is the simplest solution?

@wicz after a couple hours of seemingly getting a different failure mode each time I changed feeds (grrr), dropping in Feedzirra and having it just work with less code (specific to Mendibot) across community, github, and CNN (a test for missing author) felt like a win. 👍 for abstracting away trivia.

        items.each do |item|
          next unless post_time = item.published

          if post_time >= Time.now - RSSRelay.feed_interval(feed)
            author = item.author || 'a mysterious stranger'

            feed[:channels].each do |chan|
              Channel(chan).send <<-MESSAGE
via #{feed[:name]} comes the epic saga "#{item.title}", by #{author}! #{item.url}
MESSAGE
            end
          end
        end

Note how pubDate and updated are normalized to published, author is author, and the url is url, without any respond_to? checks. Of course that's not free, but this is code that will only run perhaps once every 15 minutes. That leaves the dependency stack, which as you've pointed out, may be unnecessarily large. There are a lot of features we're not using.

How about if I get at least one working version committed to this branch with Feedzirra before breaking it again, and then we can attempt to reach the same or better level of brevity without it?

@semmons99
Copy link
Contributor Author

@semmons99 so I was not missing something? Dynamically creating N methods is the simplest solution?

Simplest that comes to mind.

@gjp
Copy link
Contributor

gjp commented Feb 21, 2012

This amuses me way more than it probably should.

(09:35:01 PM) mendibot-test:   via teh githubs comes the epic saga "gjp pushed to generalize-rss at gjp/mendibot", by gjp! https://github.com/gjp/mendibot/compare/4d79d3de85...6b61370f10

...using the code to monitor changes to itself ✨

@gjp
Copy link
Contributor

gjp commented Feb 21, 2012

infinite regression!

(09:39:58 PM) mendibot-test:   via teh githubs comes the epic saga "gjp commented on pull request 31 on mendicant-university/mendibot", by gjp! https://github.com/mendicant-university/mendibot/issues/31#issuecomment-4066922

@gjp
Copy link
Contributor

gjp commented Feb 21, 2012

I should probably explain why feed_interval exists, since my config file is not tracked here. I experimented with having both a global (default) interval and a per-feed interval. It may make more sense to remove this and always set the interval per-feed.

    RSS_SETTINGS = {
      interval: 23,
      feeds: [
        {
              name: "community feed",
               url: 'http://community.mendicantuniversity.org/articles.rss',
          channels: ["##gjp-mendibot1", "##gjp-mendibot2"]
        },
        {
          interval: 27,
              name: "teh githubs",
               url: 'https://github.com/gjp.atom',
          channels: ["##gjp-mendibot2"]
        },
        {
          interval: 29,
              name: "CNN",
               url: 'http://rss.cnn.com/rss/cnn_topstories.rss',
          channels: ["##gjp-mendibot1"]
        }
      ]
    }

@practicingruby
Copy link
Contributor

tl;dr on the thread but @jordanbyron @semmons99 @gjp, is there any reason why we wouldn't want to switch community to output Atom 1.0 and then not have to consider multiple types of feeds?

@semmons99
Copy link
Contributor Author

tl;dr on the thread but @jordanbyron @semmons99 @gjp, is there any reason why we wouldn't want to switch community to output Atom 1.0 and then not have to consider multiple types of feeds?

If it's easy enough for Jordan, that'd be fine with me.

@semmons99
Copy link
Contributor Author

It may make more sense to remove this and always set the interval per-feed.

@gjp I think it makes more sense to just require each feed to specify it's own interval.

@semmons99
Copy link
Contributor Author

(09:39:58 PM) mendibot-test:   via teh githubs comes the epic saga "gjp commented on pull request 31 on mendicant-university/mendibot", by gjp! https://github.com/mendicant-university/mendibot/issues/31#issuecomment-4066922

I think I might have just awoken my wife having literally lol'd 😂

@jordanbyron
Copy link
Contributor

@sandal: is there any reason why we wouldn't want to switch community to output Atom 1.0 and then not have to consider multiple types of feeds?

Hum ... that sounds familiar to something I suggested a few days ago 😏

@jordanbyron: It looks like GitHub is using updated and we can easily add an updated attribute to the Community RSS feed.

Switching the RSS to Atom or adding an additional Atom feed to Community makes sense to me.

@semmons99
Copy link
Contributor Author

Switching the RSS to Atom or adding an additional Atom feed to Community makes sense to me.

@jordanbyron Think you can do that this week? Or do you want to keep the focus on PuzzleNode?

@jordanbyron
Copy link
Contributor

Think you can do that this week? Or do you want to keep the focus on PuzzleNode?

I don't want to overcommit, so let's say I won't have time to do that until next week.


Created the ticket for adding an Atom feed in community if anyone is interested in picking it up: https://github.com/mendicant-university/community/issues/73

@jordanbyron
Copy link
Contributor

Thanks to @wicz community now has an Atom feed:

http://community.mendicantuniversity.org/articles.atom

Let me know if there is anything else we need to do on the community site to get this patch merged 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants