Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying nodes #9

Open
Skallwar opened this issue Dec 19, 2021 · 12 comments
Open

Modifying nodes #9

Skallwar opened this issue Dec 19, 2021 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@Skallwar
Copy link

Skallwar commented Dec 19, 2021

I would be great if tl users where able to modify Nodes.

My use case is a webdownloader, and thus I need to change the target href for images for examples.

From what I see this is not available yet but it will be great as tl seems to be a speedy HTML parser.

I might try to implement this but it seems that a big chunk of code will need to be changed so I might need some help to figure out some of the implications of such a change.

Anyway awesome project keep up the good work.

@y21 y21 added the enhancement New feature or request label Dec 19, 2021
@y21
Copy link
Owner

y21 commented Dec 19, 2021

Hi. Yeah, I've been meaning to do this for a while (making nodes mutable), but I never really got to it because I didn't know if it was needed and was holding off until someone asked for it.
And yep, because of the fact that everything is (immutably) borrowed from the input string, this requires some more changes in the code. I'm not sure how complex the changes would be, but I guess this means making substrings copy-on-write to avoid having to copy them during parsing.

@y21
Copy link
Owner

y21 commented Dec 21, 2021

My initial idea was to change Bytes (struct that holds substrings) to something like:

--- a/src/bytes.rs
+++ b/src/bytes.rs
@@ -3,7 +3,12 @@ use std::borrow::Cow;
 
 #[derive(PartialEq, Eq, PartialOrd, Ord, Hash, Clone)]
-pub struct Bytes<'a>(&'a [u8]);
+pub enum Bytes<'a> {
+    /// Borrowed bytes
+    Borrowed(&'a [u8]),
+    /// An owned string
+    Owned(Box<str>),
+}

That is, make it an enum representing either borrowed (unmodified) data, or an owned string in case it's mutated later after parsing (much like how Cow in std works).
Sadly, this brings the size from 16 bytes to 24 bytes, and this change is visible in benchmarks. Haven't looked at callgraphs but I guess this is because we have a lot of Bytes living on the stack that are moved around quite a bit, and moving means memcpy (unless optimized out, which it likely isn't as I've seen when I did some profiling a few days ago).

change:
  time:   [+30.296% +30.417% +30.540%] (p = 0.00 < 0.10)
  thrpt:  [-23.395% -23.323% -23.252%]

That being said, I have some more ideas on how to implement it without changing the Bytes struct in a way that introduces a lot of overhead.

@y21 y21 self-assigned this Dec 24, 2021
@y21 y21 mentioned this issue Dec 24, 2021
10 tasks
@Skallwar
Copy link
Author

Skallwar commented Jan 2, 2022

Thanks for the effort. I would gladly take a look at you PR, but I have a lot of school project to finish first.

Anyway, happy new year 🎉

@pittengermd
Copy link

Hey guys, what can I work on to help with this effort? I too need to be able to mutate href's.

@y21
Copy link
Owner

y21 commented Jan 4, 2022

Hey guys, what can I work on to help with this effort? I too need to be able to mutate href's.

The PR that implements this (#10) needs a few more things until it can be merged.

  • One of them being that some of the data structures used by the parser (InlineVec and InlineHashMap, which store up to N elements on the stack and manage to avoid heap allocations for the most part that way, at the cost of using more stack space) currently do not run the destructor of its elements if inlined (allocated on the stack) due to how it works - elements in a [MaybeUninit<T>; N] must be manually dropped, since it doesn't know what is and isn't initialized.
    That was fine up to this point, because the elements (T) did not manage any heap allocations themselves, but with this PR, it's possible that some elements do manage allocations, such as strings (this happens when changing the contents of nodes in some way).
    That means that without Drop code in those data structures in their current form, they would leak memory (bad).

  • As I mentioned in my previous comment here, storing a &[u8] in an enum brings the size from 16 to 24, which had huge perf regressions. A workaround is to store a raw pointer to the start of the array and use a u32 for the length: (*const u8, u32) and then reconstruct the slice with its raw parts whenever needed. The problem is potential overflows when a substring contains more than 2^32-1 (u32::MAX) bytes. As far as I could tell, this isn't necessarily bad for borrowed bytes (as in, cannot cause UB or panics), but can lead to confusing bugs, such as the slice length wrapping around, resulting in the slice being smaller than it actually is.
    For owned bytes, checks are definitely needed (and they already exist) because dropping a Box<[u8]> with the wrong length is not okay.
    I'm not sure if it's okay to just consider overflowing length for borrowed bytes intended behavior and document this or have checks in some places to prevent it from happening (this would mean tl::parse() needs to return a Result, which sooner or later we'd have to do anyway). (edit: now that I think about it, we definitely should do the latter)

I was a bit busy the last few days and didn't have a lot of time to spend here, but now I have more time.

@y21
Copy link
Owner

y21 commented Jan 8, 2022

@Skallwar Hey, I'm still not sure I understand your use-case entirely. What happens after modifying href attributes?
Given that it's a scraper that downloads websites, I guess after it's changed all of the href, src, etc attributes it converts the tree to a string again and e.g. saves it to disk? I don't want to assume too many things so I thought I'd ask here before making changes that aren't even needed or changes that miss the point of the feature request.
If that's the case, then it needs a bit more work than "just" changing Bytes to be copy-on-write. For example, HTMLTag::inner_html() right now simply returns the referenced substring, and changes to one of the attributes are not reflected.
Instead of directly returning the referenced substring, we'd need to do this similar to how inner_text() works. That is, it manually builds a string given its "components" (attributes) and recursively calls inner_* on all of the contained subnodes.

@Skallwar
Copy link
Author

Skallwar commented Jan 9, 2022

I guess after it's changed all of the href, src, etc attributes it converts the tree to a string again and e.g. saves it to disk?

Exactly

right now simply returns the referenced substring, and changes to one of the attributes are not reflected.

Yeah I saw that. When I will have more time, I would love to help you on both mutable node and serialization

bors bot added a commit that referenced this issue Jan 10, 2022
10: Allow DOM mutation r=y21 a=y21

This PR implements DOM mutation as suggested in #9. It does so by changing `Bytes` (type that holds substrings) to be Copy-on-Write, roughly like this:
```diff
- #[derive(PartialEq, Eq, PartialOrd, Ord, Hash, Clone)]
- pub struct Bytes<'a>(&'a [u8]);
+ #[derive(PartialEq, Eq, PartialOrd, Ord)]
+ enum Bytes {
+    /// Borrowed bytes
+    Borrowed(*const u8, u32),
+    /// Owned bytes
+    Owned(*mut u8, u32),
+ }
```

Initially, during parsing, it references input strings through `Bytes::Borrowed`. Later, the substring can be mutated by copying the bytes and changing its discriminant to `Bytes::Owned` (this is exposed through the safe method `Bytes::set`).
In `Bytes` Drop code, it deallocates the data *if* (and only if) `Bytes` contains owned data.

Sadly, we can't use `&[u8]` in the enum as that increases the size of the enum to 24. It gets around this by storing a pointer to the start of the slice and using a `u32` for the length. This brings it back to 16 (8 for the pointer + 4 for the length + 4 for discriminant/padding/alignment).

This PR also adds a new fuzzing target for query selectors.

## Todo
- [x] Check for input length in `tl::parse()` (it can't be `> u32::MAX`).
- [x] Check for input length in `Bytes::set` (it can't be `> u32::MAX`).
- [x] InlineVec and InlineHashMap needs `Drop` code to deallocate possibly owned `Byte`s
- [x] Create InlineHashMap/InlineVec mutation methods.
  - [x] `InlineHashMap::insert`
  - [x] `InlineHashMap::remove`
  - [x] `InlineHashMap::get_mut`
  - [x] `InlineVec::push`
  - [x] `InlineVec::remove`
  - [x] `InlineVec::get_mut`


Co-authored-by: y21 <btimo8822@gmail.com>
Co-authored-by: Timo <30553356+y21@users.noreply.github.com>
@kelko
Copy link
Contributor

kelko commented Aug 18, 2022

As far as I can tell most of what was requested here is already implemented. Right? Modifying href even is one of the examples.
But I would like to ask @y21 how extensive you plan to make the DOM manipulation features? Would it be possible in any point in time to also add new nodes to the DOM (e.g. extracted from a different DOM?)

I would need this kind of feature for my fun-project and depending on whether you plan to implement those at all (and maybe even with some idea of "when"?) I would either just stick to tl or would put a second layer of abstraction (e.g. "rctree") on top. Using tl to build up the DOM and rctree to manipulate the heck out of it before then writing it out as string.

So any info you can share would be very appreciated.

@y21
Copy link
Owner

y21 commented Aug 18, 2022

Yeah, most of the things in the DOM can be mutated now.

Would it be possible in any point in time to also add new nodes to the DOM

I think adding nodes to the DOM should be fairly easy to support. I can try to get it implemented later today.

@kelko
Copy link
Contributor

kelko commented Aug 18, 2022

I can try to get it implemented later today.

I greatly appreciate it 🙂

I wasn't so sure if that would be viable, as it appears to me, that you use some aspects of the InnerNodeHandle which might then be not true. Especially some iteration (Children#all) & length calculation (QueryIterable#len) code seemed to expect the InnerNodeHandle of the children to be strictly increasing, without skipping any value

@y21
Copy link
Owner

y21 commented Aug 18, 2022

Oh, that's true. I forgot that the query selector iterator relies on the value of the node ids. That indeed makes it more difficult than I thought.

@kelko
Copy link
Contributor

kelko commented Aug 23, 2022

A thought on this topic I would like to share:

Instead of making the model itself mutable through and through, maybe add special methods on the parser, inspired on CQRS. So you have your still super-fast parsing & iterating over all the model with as little overhead & copying as possible.
But then you also have a method to inject a list of Commands to manipulate the dom.
All the commands are performed as batch, that could take a bit of time and copy stuff around. And once it's done all indices are re-calculated and once again you have your fast model for querying.
One step of this batch processing would be to give each and every node a new, higher ID. Those new IDs once again follow currently restrictions of stricly increasing values without a gap. And you also keep a dictionary of mappings from old IDs to new IDs. If an old ID is not in such a mapping the node was deleted and get returns None.
(And checking if an ID is old should be simple, assuming the root ID always has the lowest 'currently active' ID)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants