Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lexbor as an alternative backend #42

Closed
rushter opened this issue Aug 6, 2021 · 11 comments
Closed

Add lexbor as an alternative backend #42

rushter opened this issue Aug 6, 2021 · 11 comments

Comments

@rushter
Copy link
Owner

rushter commented Aug 6, 2021

No description provided.

@BarryThrill
Copy link
Contributor

BarryThrill commented Aug 12, 2021

Just wanted to say thank you for the 0.2.14 update :) I'm currently checking the code changed and looks nice! Do you think it is worth to change from css to select() instead?

Also I know im too fast... but will those new features be added into your documentation? :D

Once again, Thank you very much for the awesome update! <3

@rushter
Copy link
Owner Author

rushter commented Aug 12, 2021

Do you think it is worth to change from css to select() instead?

It's better to use it only when you need extra features.

Also I know im too fast... but will those new features be added into your documentation?

I've added a few examples https://github.com/rushter/selectolax/blob/master/examples/walkthrough.ipynb
The API is also documented: https://selectolax.readthedocs.io/en/latest/parser.html#selector

@ma1ex
Copy link

ma1ex commented Aug 13, 2021

@rushter, hi! Is XPath support planned for the future?

@rushter
Copy link
Owner Author

rushter commented Aug 13, 2021

@rushter, hi! Is XPath support planned for the future?

I don't think so, because XPath is a pretty complex query language. It's basically a programming language.

If you need some of the features from XPath, let me know. Some of them can be implemented using the new selector feature.

@ma1ex
Copy link

ma1ex commented Aug 13, 2021

If you need some of the features from XPath, let me know. Some of them can be implemented using the new selector feature.

In new data extraction projects I decided to experimentally try to replace LXML module with your module, because I liked the performance. But it really lacked contains() analogue for text and had to do a separate filtering by text in the list of selected items. With select().text_contains() it definitely became more convenient, thank you very much!

Is there any way to do selection by the content of a certain text in the attributes? In XPath it looks something like this: //div[contains(@data-source, "img-big")] , //table/tr[not(contains(@class, "title"))]

And is there an analogue of normalize-space() for text nodes (normalize-space(//div[@class="offers-form"]/div/text()))? To end up with a single space between words, and to cut off the special line feed characters, tabs, etc. at both ends.

@rushter
Copy link
Owner Author

rushter commented Aug 13, 2021

If you need some of the features from XPath, let me know. Some of them can be implemented using the new selector feature.

In new data extraction projects I decided to experimentally try to replace LXML module with your module, because I liked the performance. But it really lacked contains() analogue for text and had to do a separate filtering by text in the list of selected items. With select().text_contains() it definitely became more convenient, thank you very much!

Is there any way to do selection by the content of a certain text in the attributes? In XPath it looks something like this: //div[contains(@data-source, "img-big")] , //table/tr[not(contains(@class, "title"))]

Yes, you can do that. div[data-source*="img-big"] and table tr :not(.title) or table :not(tr[class*="title"]) (not sure about the correctness of the second one, but you can definitely do that).

And is there an analogue of normalize-space() for text nodes (normalize-space(//div[@class="offers-form"]/div/text()))? To end up with a single space between words, and to cut off the special line feed characters, tabs, etc. at both ends.

No

@ma1ex
Copy link

ma1ex commented Aug 13, 2021

Yes, you can do that. div[data-source*="img-big"] and table tr :not(.title) or table :not(tr[class*="title"]) (not sure about the correctness of the second one, but you can definitely do that).

Great! Thank you so much!

@rushter rushter closed this as completed Aug 22, 2021
@God-damnit-all
Copy link

(Reposting because I forgot to put graves around the <head> tags and that could mess up the email notification and you would have no idea what I'm talking about)

@rushter The primary upside to using XPath, in my opinion, is being able to parse the <head> tag, which can contain very valuable data for scraping these days. CSS skips over it.

I'm not sure if you could add a way to allow it to parse the <head> tag (preferably with Lexbor if I had to pick one backend only), but that would be swell.

@rushter
Copy link
Owner Author

rushter commented Jun 13, 2023

(Reposting because I forgot to put graves around the <head> tags and that could mess up the email notification and you would have no idea what I'm talking about)

@rushter The primary upside to using XPath, in my opinion, is being able to parse the <head> tag, which can contain very valuable data for scraping these days. CSS skips over it.

I'm not sure if you could add a way to allow it to parse the <head> tag (preferably with Lexbor if I had to pick one backend only), but that would be swell.

Can you show me an example that you can't replicate in selectolax?

@God-damnit-all
Copy link

(Reposting because I forgot to put graves around the <head> tags and that could mess up the email notification and you would have no idea what I'm talking about)
@rushter The primary upside to using XPath, in my opinion, is being able to parse the <head> tag, which can contain very valuable data for scraping these days. CSS skips over it.
I'm not sure if you could add a way to allow it to parse the <head> tag (preferably with Lexbor if I had to pick one backend only), but that would be swell.

Can you show me an example that you can't replicate in selectolax?

Looks like I screwed up, my mistake. You already do have support for this,

I didn't think to try because I didn't see any mention of it in the documentation. Most CSS selection won't parse anything contained within head.

Sorry to bother you.

@croqaz
Copy link

croqaz commented Apr 30, 2024

I know this issue is closed, but maybe it's possible to use something like https://github.com/sissaschool/elementpath
They support the standard ElementTree library and the lxml.etree. If the API is compatible, it would be easy to add selectolax into elementpath.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants