Skip to content

Exploring the cost of small sample sizes, using various goodness-of-fit tests.

License

Notifications You must be signed in to change notification settings

ytbaum/sample_size_cost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This is a project to investigate the accuracy cost of small sample sizes when sampling from a categorical distribution.

Currently, the project only implements one very simple case: sampling from a distribution of evenly weighted categories, using the Jaccard index to evaluate the similarity of the sample distribution to the known population distribution.

To run the project, just run R/sample_size_cost.R, either from within an R REPL/IDE or from the command line.

There are two variables of interest that the user might want to set. These currently need to be set within the code. They are:

  • bucket.counts: the set of different distributions that will be sampled from
  • sample.sizes: the set of different sample sizes to use when sampling from each distribution.

The script will create a folder for each distribution within the plots/ directory. There, for each sample size, it will store a histogram of similarity scores generated by sampling from the distribution 1000 times and comparing the sample to the distribution. Also within the same directory is a plot called 'errorbars.jpg'; it shows the mean and standard deviation of similarity scores at each sample size for that distribution.

Also in the plots/ directory, the script will create a "cross-section" plot. This plot shows the mean and standard deviation of similarity scores for each distribution size, at a fixed sample size.

About

Exploring the cost of small sample sizes, using various goodness-of-fit tests.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages