Multiscrape

Need help with Multiscrape?

Personal (paid) support option

I very often get asked for help, for example with finding the right CSS selectors or with a login. Actually more often than I can handle, so I'm running an experiment with a paid support option!

Sponsor me here, and I'll try to assist you with your multiscrape configuration within 1-2 days. The support funds will go towards family time, making up for the hours I spend on Home Assistant ☺️.

Note: Scraping isn't always possible. I'd love to offer a "no cure, no pay" service, but GitHub Sponsoring doesn't support that. If you're concerned about sponsoring without guarentee, please reach out by email before sponsoring!

Other options

If you don't manage to scrape the value you are looking for, please enable debug logging and log_response. This will provide you with a lot of information for continued investigation. log_response will write all responses to files. If the value you want to scrape is not in the files with the output from BeautifulSoup (*-soup.txt), Multiscrape will not be able to scrape it. Most likely it is retrieved in the background by javascript. Your best chance in this case, is to investigate the network traffic in de developer tools of your browser, and try to find a json response containing the value you are looking for.

If all of this doesn't help, use the home assistant forum. I cannot give everyone personal assistance and please don't create github issues unless you are sure there is a bug. Check the wiki for a scraping guide and other details on the functionality of this component.

Important note: be a good citizen and be aware of your responsibility

You and you alone, are accountable for your scraping activities. Be a good (web) citizen. Set reasonable scan_interval timings, seek explicit permission before scraping, and adhere to local and international laws. Respect website policies, handle data ethically, mind resource usage, and regularly monitor your actions. Uphold these principles to ensure ethical and sustainable scraping practices.

Introduction

This Home Assistant custom component can scrape multiple fields (using CSS selectors) from a single HTTP request (the existing scrape sensor can scrape a single field only). The scraped data becomes available in separate sensors.

It is based on both the existing Rest sensor and the Scrape sensor. Most properties of the Rest and Scrape sensor apply.

Multiscrape is sponsored by CapSolver!

CapSolver is an AI-powered service that automatically solves a range of CAPTCHAs, helping developers tackle CAPTCHA challenges encountered during web scraping. Whether you're extracting data from e-commerce sites, financial platforms, or social media, CapSolver supports CAPTCHAs like reCAPTCHA V2, reCAPTCHA V3, hCaptcha, ImageToText, DataDome, AWS, Geetest, Cloudflare Turnstile and more. With API integration and browser extensions options, and flexible pricing packages, CapSolver adapts to diverse web scraping needs and scenarios.

Installation

Install via HACS (default store) or install manually by copying the files in a new 'custom_components/multiscrape' directory.

Example configuration (YAML)

multiscrape:
  - name: HA scraper
    resource: https://www.home-assistant.io
    scan_interval: 3600
    sensor:
      - unique_id: ha_latest_version
        name: Latest version
        select: ".release-date"
        value_template: "{{ value | trim }}"
      - unique_id: ha_release_date
        icon: >-
          {% if is_state('binary_sensor.ha_version_check', 'on') %}
            mdi:alarm-light
          {% else %}
            mdi:bat
          {% endif %}
        name: Release date
        select: ".release-date"
        attribute: "title"
        value_template: "{{ (value.split('released')[1]) }}"
    binary_sensor:
      - unique_id: ha_version_check
        name: Latest version == 2021.7.0
        select: ".release-date"
        value_template: '{{ value | trim == "2021.7.0" }}'
        attributes:
          - name: Release notes link
            select: ".release-date"
            attribute: href

Options

Based on latest (pre) release.

name	description	required	default	type
name	The name for the integration.	False		string
resource	The url for retrieving the site or a template that will output an url. Not required when `resource_template` is provided.	True		string
resource_template	A template that will output an url after being rendered. Only required when `resource` is not provided.	True		template
authentication	Configure HTTP authentication. `basic` or `digest`. Use this with username and password fields.	False		string
username	The username for accessing the url.	False		string
password	The password for accessing the url.	False		string
headers	The headers for the requests.	False		template - list
params	The query params for the requests.	False		template - list
method	The method for the request. Either `POST` or `GET`.	False	GET	string
payload	Optional payload to send with a POST request.	False		template - string
verify_ssl	Verify the SSL certificate of the endpoint.	False	True	boolean
log_response	Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config)	False	False	boolean
timeout	Defines max time to wait data from the endpoint.	False	10	int
scan_interval	Determines how often the url will be requested.	False	60	int
parser	Determines the parser to be used with beautifulsoup. Either `lxml` or `html.parser`.	False	lxml	string
list_separator	Separator to be used in combination with `select_list` features.	False	,	string
form_submit	See Form-submit	False
sensor	See Sensor	False		list
binary_sensor	See Binary sensor	False		list
button	See Refresh button	False		list

Sensor/Binary Sensor

Configure the sensors that will scrape the data.

name	description	required	default	type
unique_id	Will be used as entity_id and enables editing the entity in the UI	False		string
name	Friendly name for the sensor	False		string
	See Selector fields	True
attributes	See Sensor attributes	False		list
unit_of_measurement	Defines the units of measurement of the sensor	False		string
device_class	Sets the device_class for sensors or binary sensors	False		string
state_class	Defines the state class of the sensor, if any. (measurement, total or total_increasing) (not for binary_sensor)	False	None	string
icon	Defines the icon or a template for the icon of the sensor. The value of the selector (or value_template when given) is provided as input for the template. For binary sensors, the value is parsed in a boolean.	False		string/template
picture	Contains a path to a local image and will set it as entity picture	False		string
force_update	Sends update events even if the value hasn’t changed. Useful if you want to have meaningful value graphs in history.	False	False	boolean

Refresh button

Configure a refresh button to manually trigger scraping.

name	description	required	default	type
unique_id	Will be used as entity_id and enables editing the entity in the UI	False		string
name	Friendly name for the button	False		string

Sensor attributes

Configure the attributes on the sensor that can be set with additional scraping values.

name	description	required	default	type
name	Name of the attribute (will be slugified)	True		string
	See Selector fields	True

Form-submit

Configure the form-submit functionality which enables you to submit a (login) form before scraping a site. More details on how this works can be found on the wiki.

name	description	required	default	type
resource	The url for the site with the form	False		string
select	CSS selector used for selecting the form in the html. When omitted, the input fields are directly posted.	False		string
input	A dictionary with name/values which will be merged with the input fields on the form	False		string - dictionary
input_filter	A list of input fields that should not be submitted with the form	False		string - list
submit_once	Submit the form only once on startup instead of each scan interval	False	False	boolean
resubmit_on_error	Resubmit the form after a scraping error is encountered	False	True	boolean
variables	See Form Variables	False		list

Form Variables

Configure the variables that will be scraped from the form_submit response. These variables can be used in the value_template of the main configuration of the current integration: a selector in sensors/attributes or in a header. A common use case is to populate the X-Login-Token header which is the result of the login.

name	description	required	default	type
name	Name of the variable	True		string
	See Selector fields	True

Example:

multiscrape:
  - resource: "https://somesiteyouwanttoscrape.com"
    form_submit:
      submit_once: True
      resource: "https://authforsomesiteyouwanttoscrape.com"
      input:
        email: "<email>"
        password: "<password>"
      variables:
        - name: token
          value_template: "{{ ... }}"
    headers:
      X-Login-Token: "{{ token }}"
    sensor: ...

Selector

Used to configure scraping options.

name	description	required	default	type
select	CSS selector used for retrieving the value of the attribute. Only required when `select_list` or `value_template` is not provided.	False		string/template
select_list	CSS selector for multiple values of multiple elements which will be returned as csv. Only required when `select` or `value_template` is not provided.	False		string/template
attribute	Attribute from the selected element to read as value.	False		string
value_template	Defines a template applied to extract the value from the result of the selector (if provided) or raw page (if selector not provided)	False		string/template
extract	Determines how the result of the CSS selector is extracted. Only applicable to HTML. `text` returns just text, `content` returns the html content of the selected tag and `tag` returns html including the selected tag.	False	text	string
on_error	See On-error	False

On-error

Configure what should happen in case of a scraping error (the css selector does not return a value).

name	description	required	default	type
log	Determines if and how something should be logged in case of a scraping error. Value can be either 'false', 'info', 'warning' or 'error'.	False	error	string
value	Determines what value the sensor/attribute should get in case of a scraping error. The value can be 'last' meaning that the value does not change, 'none' which results in HA showing 'Unkown' on the sensor, or 'default' which will show the specified default value.	False	none	string
default	The default value to be used when the on-error value is set to 'default'.	False		string

Services

For each multiscrape instance, a service will be created to trigger a scrape run through an automation. (For manual triggering, the button entity can now be configured.) The services are named multiscrape.trigger_{name of integration}.

Multiscrape also offers a get_content and a scrape service. get_content retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable log_response and open the page_soup.txt file.
scrape does what it says. It scrapes a website and provides the sensors and attributes.

Both services accept the same configuration as what you would provide in your configuration yaml (what is described above), with a small but important caveat: if the service input contains templates, those are automatically parsed by home assistant when the service is being called. That is fine for templates like resource and select, but templates that need to be applied on the scraped data itself (like value_template), cannot be parsed when the service is called. Therefore you need to slightly alter the syntax and add a ! in the middle. E.g. {{ becomes {!{ and %} becomes %!}. Multiscrape will then understand that this string needs to handled as a template after the service has been called.
If someone has a better solution, please let me know!

To call one of those services, go to 'Developer tools' in Home Assistant and then to 'services'. Find the multiscrape.get_content or multiscrape.scrape services and go to yaml mode. There you enter your configuration. Example:

service: multiscrape.scrape
data:
  name: HA scraper
  resource: https://www.home-assistant.io
  sensor:
    - unique_id: ha_latest_version
      name: Latest version
      select: ".release-date"
      value_template: "{!{ value | trim }!}"
    - unique_id: ha_release_date
      name: Release date
      select: ".release-date"
      attribute: "title"
      value_template: "{!{ (value.split('released')[1]) }!}"

Debug logging

Debug logging can be enabled as follows:

logger:
  default: info
  logs:
    custom_components.multiscrape: debug

Depending on your issue, also consider enabling log_response.

Contributions are welcome!

If you want to contribute to this please read the Contribution guidelines

Credits

This project was generated from @oncleben31's Home Assistant Custom Component Cookiecutter template.

Code template was mainly taken from @Ludeeus's integration_blueprint template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multiscrape

Need help with Multiscrape?

Personal (paid) support option

Other options

Important note: be a good citizen and be aware of your responsibility

Introduction

Multiscrape is sponsored by CapSolver!

Installation

Example configuration (YAML)

Options

Sensor/Binary Sensor

Refresh button

Sensor attributes

Form-submit

Form Variables

Selector

On-error

Services

Debug logging

Contributions are welcome!

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multiscrape

Need help with Multiscrape?

Personal (paid) support option

Other options

Important note: be a good citizen and be aware of your responsibility

Introduction

Multiscrape is sponsored by CapSolver!

Installation

Example configuration (YAML)

Options

Sensor/Binary Sensor

Refresh button

Sensor attributes

Form-submit

Form Variables

Selector

On-error

Services

Debug logging

Contributions are welcome!

Credits