sample

Status	Conda Downloads	Conda Version	Platforms

Produce a sample of lines from files. The sample size is either fixed or proportional to the size of the file. Additionally, the header and footer can be included in the sample.

Red tape

no dependencies other than a POSIX system and a C99 compiler.
licensed under BSD3c

Features

proportional sampling of streams and files
header and footer can be included in the sample
reservoir sampling (fixed sample size) of streams and files
stable reservoir sampling (i.e. the order is preserved)

Motivation

Practically ubiquitous, there's shuf -n of GNU coreutils, a tool that, in principle, solves the problem at hand. However, shuf buffers all input and is therefore useless for files that don't fit in memory.

So, looking for alternatives one may come across paulgb's subsample or earino's fast_sample. They usually do the trick and everyone seems to agree (judged by github stars). However, both tools have short-comings: they try to make sense of the line data semantically, and secondly, they are slow!

The first issue is such a major problem that their bug trackers are full of reports. subsample needs lines to be UTF-8 strings and fast_sample wants CSV files whose correctness is checked along the way. This project's tool, sample, on the other hand does not care about the line's content, all it needs are those line breaks at the end.

The speed issue is addressed by

using the most appropriate programming language for the problem
using radix sort
using the PCG family to obtain randomness
oversampling

Examples

To get 10 random words from the words file:

$ sample -n 10 -H 0 /usr/share/dict/words
...
benzopyrene
calamondins
cephalothorax
copulate
garbology's
Kewadin
Peter's
reassembly
Vienna's
Wagnerism's
...

The -H 0 produces 0 lines of header output which defaults to 5.

For proportional sampling use -r|--rate:

$ wc -l /usr/share/dict/words
305089
$ sample -r 1% /usr/share/dict/words | wc -l
3080

which is close to the true result bearing in mind that by default the header and footer of the file is printed as well.

Sampling with a rate of 0 replaces awkward scripts that use multios and head and tail to produce the same result.

$ sample -r 0 /usr/share/dict/words
A
AA
AAA
Aachen
aah
...
Zyuganov
Zyuganov's
zyzzyva
zyzzyvas
ZZZ

Similar projects

In no particular order and without any claim to completeness:

subsample: https://github.com/paulgb/subsample
fast_sample: https://github.com/earino/fast_sample
shuf: https://www.gnu.org/software/coreutils/coreutils.html

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
build-aux		build-aux
info		info
m4		m4
snap		snap
src		src
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
GNUmakefile		GNUmakefile
LICENSE.txt		LICENSE.txt
Makefile.am		Makefile.am
README.md		README.md
configure.ac		configure.ac
version.mk.in		version.mk.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sample

Red tape

Features

Motivation

Examples

Similar projects

About

Releases 2

Packages

Contributors 2

Languages

License

hroptatyr/sample

Folders and files

Latest commit

History

Repository files navigation

sample

Red tape

Features

Motivation

Examples

Similar projects

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages