Add lexbor as an alternative backend #42

rushter · 2021-08-06T11:20:50Z

No description provided.

BarryThrill · 2021-08-12T10:51:44Z

Just wanted to say thank you for the 0.2.14 update :) I'm currently checking the code changed and looks nice! Do you think it is worth to change from css to select() instead?

Also I know im too fast... but will those new features be added into your documentation? :D

Once again, Thank you very much for the awesome update! <3

rushter · 2021-08-12T15:21:03Z

Do you think it is worth to change from css to select() instead?

It's better to use it only when you need extra features.

Also I know im too fast... but will those new features be added into your documentation?

I've added a few examples https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb
The API is also documented: https://selectolax.readthedocs.io/en/latest/parser.html#selector

ma1ex · 2021-08-13T16:19:48Z

@rushter, hi! Is XPath support planned for the future?

rushter · 2021-08-13T16:32:40Z

@rushter, hi! Is XPath support planned for the future?

I don't think so, because XPath is a pretty complex query language. It's basically a programming language.

If you need some of the features from XPath, let me know. Some of them can be implemented using the new selector feature.

ma1ex · 2021-08-13T17:47:08Z

If you need some of the features from XPath, let me know. Some of them can be implemented using the new selector feature.

In new data extraction projects I decided to experimentally try to replace LXML module with your module, because I liked the performance. But it really lacked contains() analogue for text and had to do a separate filtering by text in the list of selected items. With select().text_contains() it definitely became more convenient, thank you very much!

Is there any way to do selection by the content of a certain text in the attributes? In XPath it looks something like this: //div[contains(@data-source, "img-big")] , //table/tr[not(contains(@class, "title"))]

And is there an analogue of normalize-space() for text nodes (normalize-space(//div[@class="offers-form"]/div/text()))? To end up with a single space between words, and to cut off the special line feed characters, tabs, etc. at both ends.

rushter · 2021-08-13T17:53:56Z

If you need some of the features from XPath, let me know. Some of them can be implemented using the new selector feature.

In new data extraction projects I decided to experimentally try to replace LXML module with your module, because I liked the performance. But it really lacked contains() analogue for text and had to do a separate filtering by text in the list of selected items. With select().text_contains() it definitely became more convenient, thank you very much!

Is there any way to do selection by the content of a certain text in the attributes? In XPath it looks something like this: //div[contains(@data-source, "img-big")] , //table/tr[not(contains(@class, "title"))]

Yes, you can do that. div[data-source*="img-big"] and table tr :not(.title) or table :not(tr[class*="title"]) (not sure about the correctness of the second one, but you can definitely do that).

And is there an analogue of normalize-space() for text nodes (normalize-space(//div[@class="offers-form"]/div/text()))? To end up with a single space between words, and to cut off the special line feed characters, tabs, etc. at both ends.

No

ma1ex · 2021-08-13T18:19:25Z

Yes, you can do that. div[data-source*="img-big"] and table tr :not(.title) or table :not(tr[class*="title"]) (not sure about the correctness of the second one, but you can definitely do that).

Great! Thank you so much!

God-damnit-all · 2023-06-13T17:27:12Z

(Reposting because I forgot to put graves around the <head> tags and that could mess up the email notification and you would have no idea what I'm talking about)

@rushter The primary upside to using XPath, in my opinion, is being able to parse the <head> tag, which can contain very valuable data for scraping these days. CSS skips over it.

I'm not sure if you could add a way to allow it to parse the <head> tag (preferably with Lexbor if I had to pick one backend only), but that would be swell.

rushter · 2023-06-13T17:31:21Z

(Reposting because I forgot to put graves around the <head> tags and that could mess up the email notification and you would have no idea what I'm talking about)

@rushter The primary upside to using XPath, in my opinion, is being able to parse the <head> tag, which can contain very valuable data for scraping these days. CSS skips over it.

I'm not sure if you could add a way to allow it to parse the <head> tag (preferably with Lexbor if I had to pick one backend only), but that would be swell.

Can you show me an example that you can't replicate in selectolax?

God-damnit-all · 2023-06-13T17:52:11Z

(Reposting because I forgot to put graves around the <head> tags and that could mess up the email notification and you would have no idea what I'm talking about)
@rushter The primary upside to using XPath, in my opinion, is being able to parse the <head> tag, which can contain very valuable data for scraping these days. CSS skips over it.
I'm not sure if you could add a way to allow it to parse the <head> tag (preferably with Lexbor if I had to pick one backend only), but that would be swell.

Can you show me an example that you can't replicate in selectolax?

Looks like I screwed up, my mistake. You already do have support for this,

I didn't think to try because I didn't see any mention of it in the documentation. Most CSS selection won't parse anything contained within head.

Sorry to bother you.

croqaz · 2024-04-30T17:14:54Z

I know this issue is closed, but maybe it's possible to use something like https://github.com/sissaschool/elementpath
They support the standard ElementTree library and the lxml.etree. If the API is compatible, it would be easy to add selectolax into elementpath.

rushter closed this as completed Aug 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lexbor as an alternative backend #42

Add lexbor as an alternative backend #42

rushter commented Aug 6, 2021

BarryThrill commented Aug 12, 2021 •

edited

Loading

rushter commented Aug 12, 2021

ma1ex commented Aug 13, 2021

rushter commented Aug 13, 2021

ma1ex commented Aug 13, 2021

rushter commented Aug 13, 2021 •

edited

Loading

ma1ex commented Aug 13, 2021

God-damnit-all commented Jun 13, 2023

rushter commented Jun 13, 2023

God-damnit-all commented Jun 13, 2023

croqaz commented Apr 30, 2024

Add lexbor as an alternative backend #42

Add lexbor as an alternative backend #42

Comments

rushter commented Aug 6, 2021

BarryThrill commented Aug 12, 2021 • edited Loading

rushter commented Aug 12, 2021

ma1ex commented Aug 13, 2021

rushter commented Aug 13, 2021

ma1ex commented Aug 13, 2021

rushter commented Aug 13, 2021 • edited Loading

ma1ex commented Aug 13, 2021

God-damnit-all commented Jun 13, 2023

rushter commented Jun 13, 2023

God-damnit-all commented Jun 13, 2023

croqaz commented Apr 30, 2024

BarryThrill commented Aug 12, 2021 •

edited

Loading

rushter commented Aug 13, 2021 •

edited

Loading