Generate an RSS feed using the Scrapy framework.
Install
scrapy-rss-exporter
usingpip
:pip install scrapy-rss-exporter
or using
setuptools
:python setup.py install
The most convenient way to use the exporter is to return the objects of
RssItem
class from your spiders. This class derives from
scrapy.Item
, so it will work with other exporters as well.
You will need to set the following keys:
from scrapy_rss_exporter.items import RssItem, Enclosure
rss_item = RssItem()
rss_item['title'] = 'Item title'
rss_item['link'] = 'Item url'
rss_item['guid'] = 'Item ID'
rss_item['description'] = 'Item Description'
rss_item['pub_date'] = None
rss_item['enclosure'] = [Enclosure(url=img, type='image/jpeg')]
The pub_date
field should contain a date in the
RFC882
format. If you use None
, the system will insert the current date
in the appropriate format. The enclosure
field is optional and should
contain a (possibly empty) list of Enclosure
objects.
To set the exporter up globally, you need to declare it in the
FEED_EXPORTERS
dictionary in the settings.py
file:
FEED_EXPORTERS = {
'rss': 'scrapy_rss_exporter.exporters.RssItemExporter'
}
You can then use it as a FEED_FORMAT
and specify the output file in the
FEED_URI
:
FEED_FORMAT = 'rss'
FEED_URI = 's3://my-feeds/my-feed.rss'
Note: Bear in mind that, if you use a local file as output, scrapy
will append to an existing file resulting with an invalid RSS code. You should,
therefore, make sure to delete any existing output file before running the
spider. The s3
storage does not have this problem because
scrapy
uploads are using the S3 PutObject
method.
scrapy
does not seem to allow to push any configuration option to an
exporter. Therefore, if you want to customize the feed title and other metadata,
you need to create a subclass and update the FEED_EXPORTERS
dictionary
with the new class name:
class MyRssExporter(RssItemExporter):
def __init__(self, *args, **kwargs):
kwargs['title'] = 'My RSS'
kwargs['link'] = 'https://www.mywebsite.com'
kwargs['description'] = 'My RSS Items'
super(MyRssExporter, self).__init__(*args, **kwargs)
You can, of course, specify a different exporter with different settings for
each spider. Just use the custom_settings
field to override the global
configuration fields:
class MySpider(scrapy.Spider):
name = "my"
start_urls = ['https://www.mywebsite.com']
custom_settings = {
'FEED_EXPORTERS': {'rss': 'project.spiders.my_spider.MyExporter'},
'FEED_FORMAT': 'rss',
'FEED_URI': 's3://my-feeds/my-feed.rss',
}
def parse(self, response):
pass