Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowering the memory footprint of parse results #14

Open
mrguiman opened this issue Apr 26, 2024 · 6 comments
Open

Lowering the memory footprint of parse results #14

mrguiman opened this issue Apr 26, 2024 · 6 comments

Comments

@mrguiman
Copy link
Contributor

mrguiman commented Apr 26, 2024

Hello and thank you for the amazing work on the library.

We're working on an iOS forensics app and currently prototyping some code that will allow our researchers to pull specific, pinpointed events from UnifiedLogs.
We based our code on one of your exemples, so it currently revolves around calling the build_log method to assemble LogData and "send" them to be consumed by other parts of our code.
In that context, and because we want our program to be as portable as possible, we're trying to have the lowest possible memory footprint and as such would like to be able to work with iterators rather than allocating an entire Vector of LogData which can be quite heavy.

We were thinking of breaking down the build_logs method so that it can build vectors from the results of a new iter_logs function that itself would return an iterator.

Is there any footgun you can think of that we should be worrying about, and would you be open to us contributing such changes to the main codebase ?

@puffyCid
Copy link
Collaborator

hey @mrguiman thanks for opening this issue. I think creating a iter_logs function is a great idea.
I would definetly be interested in any memory improvemnts you or your team could do.

@jrouaix
Copy link
Contributor

jrouaix commented Apr 30, 2024

@puffyCid FYI, we try to go a little further on granular iteration, but memory consumption didn't improve : shindan-io@62dfb22

I my opinion a really great improvement would be to iterate straight away on a file itself, at nom parser level.

We can try, but it could be a BIG rework of the existing code (even trying to be as conservative as possible).

Would you be open to such a thing ? if yes, we could motivate ourself into this goal.

@puffyCid
Copy link
Collaborator

puffyCid commented May 4, 2024

@jrouaix that does sound like quite a large rewrite.

I think the primary source of the memory usage is after reading the tracev3 file (~10.5 MBs) the library iterates through all the compressed chunks in the file and then returns the data needed for build_log function.

I think the decompressed data is probably going to be the reason for memory usage?

What if instead of decompressing all chunks in the parse_unified_log function it just decompresses and parses one chunk and returns the data and pointer/offset to next chunk?

so instead of parse_unified_log(data: &[u8]) -> nom::IResult<&[u8], UnifiedLogData>

it would be something like parse_unified_log(data: File, offset: usize) -> nom::IResult<&[u8], UnifiedLogData>
The main thing to add would be something to track the next offset.

Thoughts on perhaps that approach?

@mrguiman
Copy link
Contributor Author

mrguiman commented Aug 2, 2024

Sorry @puffyCid it has been a while, we've been busy building our apps ! There's a first pull requests that was created here, do you want to hold off on it until we go further (as discussed above)?

@puffyCid
Copy link
Collaborator

puffyCid commented Aug 6, 2024

no worries @mrguiman!

There is a few merge conflicts between the PR and main branch. If you could update/sync the PR to the latest version. I would be fine with committing.

For further reducing memory usage i think there are two approaches:

  1. As @jrouaix mentioned start the iterator at the nom parser level
  2. Parse only one chunk of the log file at a time (instead of all chunks currently). And keep track of the offset to the next chunk

Both would be pretty large overhauls (1. being the largest).
I think option 1. makes the most sense and keeps the public API/functions easier to use.

If you guys want to take a stab at it that would be great. Otherwise, if your too busy with other projects, no worries.

Time permitting I can try to implement it, with the goal having it merged sometime in September (hopefully)

@jrouaix
Copy link
Contributor

jrouaix commented Aug 7, 2024

I agree the option 1 makes the more sense, could really flatten the memory footprint
(and we can build more iterators over this low level one).

It's a big refactoring, and difficult.
We won't be able to dig anything until late august, If you want to take the lead, feel free to.
If not we'll let you know when/if we deep dive into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants