Improve syft performance by memoizing filetree #1480

aeg · 2023-01-18T16:47:00Z

What would you like to be added:
To speed up syft packages processing, it is effective to memoize file.path.Normalize () in stereoscope/filetree.

Why is this needed:
It thought filetree of stereoscope used in syft is wasting CPU resources.

Additional context:

Profiling for syft packages dir:/ shows that path(*lazybuf).append() and `path.Clean()' consume more than 20% of CPU resources at runtime, as follows.

      flat  flat%   sum%        cum   cum%
    69.18s 19.52% 19.52%     69.46s 19.60%  path.(*lazybuf).append
    26.48s  7.47% 26.99%     96.49s 27.23%  path.Clean
    21.83s  6.16% 33.15%     75.73s 21.37%  runtime.scanobject
    18.44s  5.20% 38.36%     20.80s  5.87%  runtime.findObject
    13.42s  3.79% 42.15%     13.56s  3.83%  syscall.Syscall
    10.48s  2.96% 45.10%     10.48s  2.96%  aeshashbody
     9.90s  2.79% 47.90%     32.64s  9.21%  runtime.mapaccess1_faststr

I profile stereoscope, and I found this node() function in the filetree stereoscope was called too many times.
func (t \*FileTree) node(p file.Path, strategy linkResolutionStrategy) (\*filenode.FileNode, error)

In my case, with 55148 paths system, node() was called 58,882,000 times. It was called 1067 times each a path on average.

call flow:
  node() -> file.Path.Normalize() 
            -> path.Clean()
               -> (\*lazubuf).append()

Actually, file.Path.Normalize() was only needed to process once for the same path, I tried simply to implement memoization for file.Path.Normalize(), then path.(\*lazubuf).append() were not at the top of the ranking.

      flat  flat%   sum%        cum   cum%
    21.24s  7.55%  7.55%     23.79s  8.45%  runtime.findObject
    20.08s  7.14% 14.68%     75.85s 26.95%  runtime.scanobject
    18.93s  6.73% 21.41%     18.93s  6.73%  runtime.evacuated (inline)
    16.51s  5.87% 27.28%     16.51s  5.87%  aeshashbody
    12.43s  4.42% 31.69%     12.65s  4.50%  syscall.Syscall
    10.41s  3.70% 35.39%     38.29s 13.61%  runtime.mapaccess2_faststr

In this case, the execution time is much shorter, from 385 seconds to 257 seconds.
It seems to improve speed by 33% .
Essentially, the filetree process should be improved, but since the performance improves on average 10%~20%, it would be a good idea to include this workaround first.

The text was updated successfully, but these errors were encountered:

Mikcl · 2023-01-18T23:23:01Z

nice change

regarding:

, the filetree process should be improved

Have an attempt at that here #1463

(not a maintainer)

aeg · 2023-01-19T00:04:41Z

Thanks for your comment. and nice your PR!

I think doublestar search is very costly. In fact, I think a search like find . -name pipfile is better.

And If I tried syft packages dir:/home/myhome , syft's catalogers searches for link information from all directories to '/home/myhome'. I don't really see the need for this.

kzantow · 2023-01-26T21:24:19Z

We are addressing this right now #1510

wagoodman · 2023-02-09T14:08:35Z

Though #1510 will be addressing the symptom (slower cataloging process) it won't be by memoizing Normalize() calls. Instead we took a look at the common calls done and added indexes to the underlying datastructure to be leveraged before searching the tree.

aeg added the enhancement New feature or request label Jan 18, 2023

aeg mentioned this issue Jan 18, 2023

Evaluation of memoziation for p.Normalize() anchore/stereoscope#153

Closed

kzantow added this to OSS Jan 26, 2023

kzantow closed this as completed Jan 26, 2023

github-project-automation bot moved this to Done in OSS Jan 26, 2023

kzantow reopened this Jan 30, 2023

kzantow moved this from Done to In Progress in OSS Jan 30, 2023

kzantow linked a pull request Jan 30, 2023 that will close this issue

Speed up cataloging by replacing globs searching with index lookups #1510

Merged

wagoodman mentioned this issue Feb 9, 2023

Speed up cataloging by replacing globs searching with index lookups #1510

Merged

wagoodman closed this as completed in #1510 Feb 9, 2023

github-project-automation bot moved this from In Progress to Done in OSS Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve syft performance by memoizing filetree #1480

Improve syft performance by memoizing filetree #1480

aeg commented Jan 18, 2023

Mikcl commented Jan 18, 2023

aeg commented Jan 19, 2023

kzantow commented Jan 26, 2023

wagoodman commented Feb 9, 2023

Improve syft performance by memoizing filetree #1480

Improve syft performance by memoizing filetree #1480

Comments

aeg commented Jan 18, 2023

Mikcl commented Jan 18, 2023

aeg commented Jan 19, 2023

kzantow commented Jan 26, 2023

wagoodman commented Feb 9, 2023