Skip to content

satyasm/gopagesize

Repository files navigation

GOPAGESIZE

gopagesize is a small program to explore various concurrency options with the Go programming language. This program downloads a list of web pages, either given as command line parameters or listed in a file (one URL, per line and specified with the -f option).

For each of the web page URL seen, it downloads the HTML, parses it to find all the link, script and image tags and if these point to external resources, downloads them too. It them computes the size of the web page plus the resources it references to give a sense of the total page weight.

Usage

    gopagesize [flags] url [url...]
    flags:
    -H    host level concurrency
    -c    specify to run in concurrent mode
    -f string
            file name to load list of url's from
    -p int
            pool size when using concurrent mode (default 10)
    -r    do resource level concurrency
    -s    print only the slowest resource
    -t    trace network connections

Running it with just the URL's downloads each web page, parses it and then downloads each resource found synchronously, one after the other. This is the basic mode with no concurrency whatsoever.

The -c option turns on the concurrent mode. With just -c, each URL is processed in it's own go routine, but the download, parse and fetch of resources happens synchronously per URL just as in the case of the synchronous mode.

The -r option, along with -c, turns on request level concurrency. If specified, then a pool of go routines are created, the pool size determined by the -p option, and all web requests, including for the original web page and for each of the resources parsed then happens concurrently within this pool. Like above, there is a go routine spawned per input web page URL, which first fetches the webpage concurrently using the pool, parses it and then fetches the resources concurrently, using the same pool.

The -H option, along with -c and -r, turns on per-host concurrency, which basically means that a pool of go routines are maintained per host, instead of globally. The behavior is similar to above, except that when the request is handed off by the go routine per webpage, it goes to a router go routine when then multiplexes it to the appropriate pool based on the host name in the request, instead of going directly to the pool. The pool size is again controlled by the -p option. [Note:yes it's a capital H for this option, because the small -h is already taken by convention to mean print-help]

The -s and -t options control how the results are printed. If -s is specified, then for each webpage, in addition to the details of fetching the main page and the parse time, one the slowest resource is printed. This is useful in concurrent mode to check if the total time taken for the page is close to the sum of the time for the page, parse time and the time to fetch the slowest resource, thus showing true concurrency.

The -t option, when specified, turns on request tracing and prints a summary of number of connections made by host, along with the address resolved for the host. This is useful to see the difference in connection behavior with and without the -H option, when fetching concurrently.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages