Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pprof module extension #3773

Merged
merged 7 commits into from
Jul 23, 2021
Merged

Implement pprof module extension #3773

merged 7 commits into from
Jul 23, 2021

Conversation

niklr
Copy link
Contributor

@niklr niklr commented Jun 14, 2021

Issue

@JackyWYX

harmony-one/bounties#46

Test

Unit Test Coverage

go test -cover ./cmd/harmony/

Before:

ok  	github.com/harmony-one/harmony/cmd/harmony	0.266s	coverage: 47.8% of statements

After:

ok  	github.com/harmony-one/harmony/cmd/harmony	0.136s	coverage: 48.5% of statements

Test/Run Logs

When running and stdin is terminal

./harmony -c config_1_0_4.toml 
Old config version detected 1.0.4
Do you want to update config to the latest version: [y/N]
y
Original config backed up to config_1_0_4.toml.backup
Successfully migrated config_1_0_4.toml from 1.0.4 to 2.1.0 

When stdin is piped

echo Y | ./harmony -c config_1_0_4.toml 
Old config version detected 1.0.4
Update saved config with `./harmony config update [config_file]`

Operational Checklist

  1. Does this PR introduce backward-incompatible changes to the on-disk data structure and/or the over-the-wire protocol?. (If no, skip to question 8.)
    NO
  2. Does this PR introduce backward-incompatible changes NOT related to on-disk data structure and/or over-the-wire protocol? (If no, continue to question 11.)
    NO

Manual / cheatsheet

pprof basics

pprof lets you collect CPU profiles, traces, and heap profiles. The normal way to use pprof is:

  • Run ./harmony --pprof
  • Run curl localhost:6060/debug/pprof/{profile_name} --output profile.pb.gz to save a profile
  • Use go tool pprof profile.pb.gz to analyze said profile

pprof profiles

A Profile is a collection of stack traces showing the call sequences that led to instances of a particular event, such as allocation. Packages can create and maintain their own profiles; the most common use is for tracking resources that must be explicitly closed, such as files or network connections.

Each Profile has a unique name. A few profiles are predefined:

  • goroutine - stack traces of all current goroutines
  • heap - a sampling of memory allocations of live objects
  • allocs - a sampling of all past memory allocations
  • threadcreate - stack traces that led to the creation of new OS threads
  • block - stack traces that led to blocking on synchronization primitives
  • mutex - stack traces of holders of contended mutexes

For more information see runtime/pprof docs

A special profile accepted by --pprof.profile.names

  • cpu - profiles the current process

pprof config

  • --pprof enable pprof profiling
  • --pprof.addr string listen address for pprof (default "127.0.0.1:6060")
  • --pprof.folder string folder to put pprof profiles (default "./profiles")
  • --pprof.profile.names strings a list of pprof profile names (separated by ,) e.g. cpu,heap,goroutine
  • --pprof.profile.intervals ints a list of pprof profile interval integer values (separated by ,)
  • --pprof.profile.debug ints a list of pprof profile debug integer values (separated by ,)

Example 1

./harmony --pprof.profile.names cpu,heap --pprof.profile.intervals 0,10

Saves the cpu and heap profiles in the ./profiles folder. The cpu profile will be saved on shutdown and the heap profile will be saved every 10 seconds.

Example 2

./harmony --pprof.profile.names cpu,heap,goroutine --pprof.profile.intervals 10 --pprof.profile.debug 0,0,2

Saves the cpu, heap and goroutine profiles in the ./profiles folder. All profiles will be saved every 10 seconds.
The goroutine profile will use debug=2 which means to print the goroutine stacks in the same form that a Go program uses when dying due to an unrecovered panic.

@niklr
Copy link
Contributor Author

niklr commented Jun 14, 2021

Test case 1 - Starting harmony without any arguments or config

Running:
./harmony

Result:

Staking mode; node key 02c0db3387d1acba29266e087bb45d29ce71d6f9271f7d3c44c3ad346cb999005f0d268850709c91c4d5620770012f99; -> shard 1
Started RPC server at: 127.0.0.1:9500
Started WS server at: 127.0.0.1:9800

Problems Found: None

Test case 2 - Passing --pprof --pprof.folder ./profiles --pprof.profile.names cpu,heap,goroutine --pprof.profile.intervals 10 --pprof.profile.debug 0,0,2

Running:

./harmony --pprof --pprof.folder ./profiles --pprof.profile.names cpu,heap,goroutine --pprof.profile.intervals 10 --pprof.profile.debug 0,0,2

Result:

Staking mode; node key 02c0db3387d1acba29266e087bb45d29ce71d6f9271f7d3c44c3ad346cb999005f0d268850709c91c4d5620770012f99; -> shard 1
Started RPC server at: 127.0.0.1:9500
Started WS server at: 127.0.0.1:9800

Problems Found: None

Test case 3 Running with config 1.0.2

Running:

./harmony -c config_1_0_2.toml

Result:

Old config version detected 1.0.2
Do you want to update config to the latest version: [y/N]

Staking mode; node key 02c0db3387d1acba29266e087bb45d29ce71d6f9271f7d3c44c3ad346cb999005f0d268850709c91c4d5620770012f99; -> shard 1
Started RPC server at: 127.0.0.1:9500
Started WS server at: 127.0.0.1:9800

Problems Found: None

Test case 4 Running with config 1.0.3

Running:

./harmony -c config_1_0_3.toml

Result:

Old config version detected 1.0.3
Do you want to update config to the latest version: [y/N]

Staking mode; node key 02c0db3387d1acba29266e087bb45d29ce71d6f9271f7d3c44c3ad346cb999005f0d268850709c91c4d5620770012f99; -> shard 1
Started RPC server at: 127.0.0.1:9500
Started WS server at: 127.0.0.1:9800

Problems Found: None

Test case 5 Running with config 1.0.4

Running:

./harmony -c config_1_0_4.toml

Result:

Old config version detected 1.0.4
Do you want to update config to the latest version: [y/N]

Staking mode; node key 02c0db3387d1acba29266e087bb45d29ce71d6f9271f7d3c44c3ad346cb999005f0d268850709c91c4d5620770012f99; -> shard 1
Started RPC server at: 127.0.0.1:9500
Started WS server at: 127.0.0.1:9800

Problems Found: None

@JackyWYX JackyWYX requested review from JackyWYX and LeoHChen and removed request for JackyWYX June 29, 2021 22:19
Copy link
Contributor

@JackyWYX JackyWYX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. Some suggestions in the comments. Please fix. Thanks.

api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
cmd/harmony/config.go Outdated Show resolved Hide resolved
cmd/harmony/config_migrations.go Show resolved Hide resolved
cmd/harmony/flags.go Outdated Show resolved Hide resolved
cmd/harmony/flags.go Outdated Show resolved Hide resolved
cmd/harmony/flags.go Outdated Show resolved Hide resolved
@JackyWYX
Copy link
Contributor

JackyWYX commented Jul 7, 2021

Please rebase and solve the conflict introduced by another config upgrade. Thanks!!

@niklr
Copy link
Contributor Author

niklr commented Jul 9, 2021

Thank your for the review. Really good points indeed! I will try to resolve them asap.

@niklr
Copy link
Contributor Author

niklr commented Jul 9, 2021

Setting the default value in config_migrations.go as follows results in the error Can't convert []([]string) to []string(slice)

migrations["2.1.0"] = func(confTree *toml.Tree) *toml.Tree {
  ...
  if confTree.Get("Pprof.ProfileNames") == nil {
    confTree.Set("Pprof.ProfileNames", []string{})
  }
  ...
  
  confTree.Set("Version", "2.2.0")
  return confTree
}

Changing []string{} to make([]string, 0) has the same result when running config_migrations_test.go

Do you know what I'm missing here @JackyWYX ?

@niklr
Copy link
Contributor Author

niklr commented Jul 9, 2021

Changing []string{} to make([]interface{}, 0) did the trick.

@niklr
Copy link
Contributor Author

niklr commented Jul 10, 2021

All mentioned suggestions and remarks should now be covered. Please review again. @JackyWYX @LeoHChen

Thanks a lot!

@niklr niklr requested a review from JackyWYX July 10, 2021 13:23
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
Copy link
Contributor

@JackyWYX JackyWYX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some comments out there. Please fix. Thanks!

api/service/pprof/service.go Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
api/service/pprof/service.go Outdated Show resolved Hide resolved
cmd/harmony/default.go Outdated Show resolved Hide resolved
cmd/harmony/default.go Outdated Show resolved Hide resolved
cmd/harmony/config_migrations.go Outdated Show resolved Hide resolved
cmd/harmony/config_migrations.go Outdated Show resolved Hide resolved
go.mod Show resolved Hide resolved
@niklr
Copy link
Contributor Author

niklr commented Jul 15, 2021

Ready for round 3 @JackyWYX

@niklr niklr requested a review from JackyWYX July 15, 2021 09:28
}
svc.profiles = profiles

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind moving these two lines to service.Start?

// stopCpuProfile stops the current CPU profile, if any
func stopCpuProfile() {
pprof.StopCPUProfile()
if cpuFile != nil {
Copy link
Contributor

@JackyWYX JackyWYX Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried testing with go test --race of this line? It seems there is a data race condition of cpuFile.

The race condition can be detected by

  1. implement a unit test with service.Start and last for at least one iteration.
  2. Test with go test --race.

go.mod Show resolved Hide resolved
@JackyWYX
Copy link
Contributor

One last small comments about data racing.

@niklr
Copy link
Contributor Author

niklr commented Jul 19, 2021

Fixed data racing using a mutex and moved the two lines @JackyWYX

@niklr niklr requested a review from JackyWYX July 19, 2021 21:47
Copy link
Contributor

@JackyWYX JackyWYX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niklr Thanks for all your work! Impressed :)

@rlan35 @LeoHChen Take a look?

api/service/pprof/service.go Outdated Show resolved Hide resolved
@niklr
Copy link
Contributor Author

niklr commented Jul 20, 2021

Thank you @JackyWYX, that review process was rigorous. :) Makes me appreciate the harmony project even more.

With that, do you consider this bounty as completed? The deadline is approaching quite soon (~4 days).

@JackyWYX
Copy link
Contributor

JackyWYX commented Jul 21, 2021

Please resolve the last comment about the variable name lock. Thanks!

@rlan35
Copy link
Contributor

rlan35 commented Jul 21, 2021

looks good to merge if this is backward-compatible. We should do more testing to make sure no race condition and it won't crash the node.

@JackyWYX
Copy link
Contributor

looks good to merge if this is backward-compatible. We should do more testing to make sure no race condition and it won't crash the node.

What kind of testing are you suggesting?

@rlan35
Copy link
Contributor

rlan35 commented Jul 21, 2021

looks good to merge if this is backward-compatible. We should do more testing to make sure no race condition and it won't crash the node.

What kind of testing are you suggesting?

Like extended period of testing longevity test. Like doing frequent pprof every 1s for 2 hours and see if there is any race condition and other problems.

@JackyWYX
Copy link
Contributor

Like extended period of testing longevity test. Like doing frequent pprof every 1s for 2 hours and see if there is any race condition and other problems.

Sound reasonable. @niklr Could you please do this test and give out the machine performance metrics? (Like CPU usage, memory, with / without pprof module)

@niklr
Copy link
Contributor Author

niklr commented Jul 21, 2021

Sure, I will record metrics with the following script tomorrow (getting late here already):

#! /bin/bash
echo "ELAPSED %CPU %MEM" >> output.txt
while :
do
    clear
    ps -p {HARMONY_PID} -o etime,%cpu,%mem | tail -n +2 | tee -a output.txt
    sleep 1s
done

Or do you have any other preferences @JackyWYX ?

@JackyWYX
Copy link
Contributor

Or do you have any other preferences @JackyWYX ?

Could you do performance benchmark on this PR and previous code to see the performance difference?

@niklr
Copy link
Contributor Author

niklr commented Jul 21, 2021

10 minutes running the previous code + 2 hours running this PR with pprof enabled (--pprof.profile.names cpu,heap --pprof.profile.intervals 1)

Something like that?

@JackyWYX
Copy link
Contributor

10 minutes running the previous code + 2 hours running this PR with pprof enabled (--pprof.profile.names cpu,heap --pprof.profile.intervals 1)

Something like that?

Can you do 2 hours running previous code vs 2 hours running with pprof enabled?

@niklr
Copy link
Contributor Author

niklr commented Jul 22, 2021

Harmony performance profiling over 2 hours

LGTM @JackyWYX @rlan35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants