Merge pull request #304 from nadiaenh/glm

Add GLM with design-based standard errors
xKDR · Nov 23, 2023 · c425057 · c425057
2 parents a916697 + c677962
commit c425057
Show file tree

Hide file tree

Showing 20 changed files with 716 additions and 162 deletions.
diff --git a/Project.toml b/Project.toml
@@ -9,6 +9,7 @@ CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
 CairoMakie = "13f3f980-e62b-5c42-98c6-ff1f3baf88f0"
 CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a"
 LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
 Missings = "e1d29d7a-bbdc-5cf2-9ac0-f12de2c33e28"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

diff --git a/README.md b/README.md
@@ -92,7 +92,7 @@ to compute the standard errors.
 
 ```julia
 julia> bootsrs = bootweights(srs; replicates=1000)
-ReplicateDesign:
+ReplicateDesign{BootstrapReplicates}:
 data: 200×1047 DataFrame
 strata: none
 cluster: none

diff --git a/docs/make.jl b/docs/make.jl
@@ -23,6 +23,7 @@ makedocs(;
             "Replicate weights" => "man/replicate.md",
             "Plotting" => "man/plotting.md",
             "Comparison with other survey analysis tools" => "man/comparisons.md",
+            "Generalised linear models" => "man/glm.md",
         ],
         "API reference" => "api.md",
     ],

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -17,11 +17,12 @@ JackknifeReplicates
 load_data
 bootweights
 jackknifeweights
-variance
+stderr
 mean
 total
 quantile
 ratio
+glm
 plot
 boxplot
 hist

diff --git a/docs/src/man/glm.md b/docs/src/man/glm.md
@@ -0,0 +1,99 @@
+# Generalized Linear Models in Survey
+The `glm()` function in the Julia Survey package is used to fit generalized linear models (GLMs) to survey data. It incorporates survey design information, such as sampling weights, stratification, and clustering, to produce valid estimates and standard errors that account for the type of survey design.
+
+As of June 2023, the [GLM.jl documentation](https://juliastats.org/GLM.jl/stable/) lists the supported distribution families and their link functions as:
+```txt
+Bernoulli (LogitLink)
+Binomial (LogitLink)
+Gamma (InverseLink)
+InverseGaussian (InverseSquareLink)
+NegativeBinomial (NegativeBinomialLink, often used with LogLink)
+Normal (IdentityLink)
+Poisson (LogLink)
+```
+
+Refer to the GLM.jl documentation for more information about the GLM package.
+
+## Fitting a GLM to a Survey Design object
+
+You can fit a GLM to a Survey Design object the same way you would fit it to a regular data frame. The only difference is that you need to specify the survey design object as the second argument to the `glm()` function.
+
+```julia
+using Survey
+apisrs = load_data("apisrs") 
+
+# Simple random sample survey
+srs = SurveyDesign(apisrs, weights = :pw) 
+
+# Survey stratified by stype
+dstrat = SurveyDesign(apistrat, strata = :stype, weights = :pw) 
+
+# Survey clustered by dnum
+dclus1 = SurveyDesign(apiclus1, clusters = :dnum, weights = :pw) 
+```
+
+Once you have the survey design object, you can fit a GLM using the `glm()` function. Specify the formula for the model and the distribution family. 
+
+The `glm()` function supports all distribution families supported by GLM.jl, i.e. Bernoulli, Binomial, Gamma, Geometric, InverseGaussian, NegativeBinomial, Normal, and Poisson. 
+
+For example, to fit a GLM with a Bernoulli distribution and a Logit link function to the `srs` survey design object we created above:
+```julia
+formula = @formula(api00 ~ api99)
+my_glm = glm(formula, srs, family = Normal())
+
+# View the coefficients and standard errors
+my_glm.Coefficients
+my_glm.SE
+```
+
+## Examples
+
+The examples below use the `api` datasets, which contain survey data collected about California schools. The datasets are included in the Survey.jl package and can be loaded by calling `load_data("name_of_dataset")`.
+
+### Bernoulli with Logit Link
+
+A school being eligible for the awards program (`awards`) is a binary outcome (0 or 1). Let's assume it follows a Bernoulli distribution. Suppose we want to predict `awards` based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:
+
+```julia
+using Survey
+apisrs = load_data("apisrs")
+srs = SurveyDesign(apisrs, weights = :pw) 
+
+# Convert yes/no to 1/0
+apisrs.awards = ifelse.(apisrs.awards .== "Yes", 1, 0)
+
+# Fit the model
+model = glm(@formula(awards ~ meals + ell), apisrs, Bernoulli(), LogitLink())
+```
+
+### Poisson with Log Link
+
+Let us assume that the number of students tested (`api_stu`) follows a Poisson distribution, which models the number of successes out of a fixed number of trials. Suppose we want to predict the number of students tested based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:
+
+```julia
+using Survey
+apisrs = load_data("apisrs")
+srs = SurveyDesign(apisrs, weights = :pw) 
+
+# Rename api.stu to api_stu
+rename!(apisrs, Symbol("api.stu") => :api_stu)
+
+# Fit the model
+model = glm(@formula(api_stu ~ meals + ell), apisrs, Poisson(), LogLink())
+```
+
+### Gamma with Inverse Link
+
+Let us assume that the average parental education level (`avg_ed`) follows a Gamma distribution, which is suitable for modeling continuous, positive-valued variables with a skewed distribution. Suppose we want to predict the average parental education level based on the percentage of students eligible for subsidized meals (`meals`) and the percentage of English Language Learners (`ell`). We can fit this GLM using the code below:
+
+```julia
+using Survey
+apisrs = load_data("apisrs")
+srs = SurveyDesign(apisrs, weights = :pw) 
+
+# Rename api.stu to api_stu
+rename!(apisrs, Symbol("avg.ed") => :avg_ed)
+
+# Fit the model
+model = glm(@formula(avg_ed  ~ meals + ell), apisrs, Gamma(), InverseLink())
+```
diff --git a/docs/src/man/replicate.md b/docs/src/man/replicate.md
@@ -4,9 +4,9 @@ Replicate weights are a method for estimating the standard errors of survey stat
 
 The basic idea behind replicate weights is to create multiple versions of the original sample weights, each with small, randomly generated perturbations. The multiple versions of the sample weights are then used to calculate the survey statistic of interest, such as the mean or total, on multiple replicate samples. The variance of the survey statistic is then estimated by computing the variance across the replicate samples.
 
-Currently, the Rao-Wu bootstrap[^1] is the only method in the package for generating replicate weights. 
+Currently, the Rao-Wu bootstrap[^1] and the Jackknife [^2] are the only methods in the package for generating replicate weights. In the future, the package will support additional types of inference methods, which will be passed when creating a `ReplicateDesign` object.
 
-The `bootweights` function of the package can be used to generate a `ReplicateDesign` from a `SurveyDesign`
+The `bootweights` function of the package can be used to generate a `ReplicateDesign` using the Rao-Wu bootstrap method from a `SurveyDesign`.
 For example: 
 ```@repl bootstrap
 using Survey
@@ -15,7 +15,16 @@ dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
 bstrat = bootweights(dstrat; replicates = 10)
 ```
 
-For each replicate, the `DataFrame` of `ReplicateDesign` has an additional column. The of the column is `replicate_` followed by the replicate number.  
+The `jackknifeweights` function of the package can be used to generate a `ReplicateDesign` using the Jackknife method from a `SurveyDesign`.
+For example: 
+```@repl bootstrap
+using Survey
+apistrat = load_data("apistrat")
+dstrat = SurveyDesign(apistrat; strata=:stype, weights=:pw)
+bstrat = jackknifeweights(dstrat; replicates = 10)
+```
+
+For each replicate, the `DataFrame` of `ReplicateDesign` has an additional column. The name of the column is `replicate_` followed by the replicate number.  
 
 ```@repl bootstrap
 names(bstrat.data)
@@ -38,4 +47,5 @@ For each replicate weight, the statistic is calculated using it instead of the w
 
 ## References
 
-[^1]: [Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.](https://journals.sagepub.com/doi/abs/10.1177/096228029600500305?journalCode=smma)
+[^1]: [Rust, Keith F., and J. N. K. Rao. "Variance estimation for complex surveys using replication techniques." Statistical methods in medical research 5.3 (1996): 283-310.](https://journals.sagepub.com/doi/abs/10.1177/096228029600500305?journalCode=smma)
+[^2]: [Miller, Rupert G. “The Jackknife--A Review.” Biometrika 61, no. 1 (1974): 1–15. https://doi.org/10.2307/2334280.](https://www.jstor.org/stable/2334280)
diff --git a/src/Survey.jl b/src/Survey.jl
@@ -1,8 +1,9 @@
 module Survey
 
 using DataFrames
+import DataFrames: rename!
 using Statistics
-import Statistics: quantile
+import Statistics: std, quantile
 using StatsBase
 import StatsBase: mean, quantile
 using CSV
@@ -12,6 +13,8 @@ using AlgebraOfGraphics
 using CategoricalArrays
 using Random
 using Missings
+using GLM
+import GLM: @formula, glm
 
 include("SurveyDesign.jl")
 include("bootstrap.jl")
@@ -26,17 +29,19 @@ include("boxplot.jl")
 include("show.jl")
 include("ratio.jl")
 include("by.jl")
+include("reg.jl")
 
 export load_data
 export AbstractSurveyDesign, SurveyDesign, ReplicateDesign
 export BootstrapReplicates, JackknifeReplicates
 export dim, colnames, dimnames
-export mean, total, quantile
+export mean, total, quantile, std
 export plot
 export hist, sturges, freedman_diaconis
 export boxplot
 export bootweights
 export ratio
 export jackknifeweights, variance
+export @formula, glm
 
 end
diff --git a/src/bootstrap.jl b/src/bootstrap.jl
@@ -61,45 +61,75 @@ function bootweights(design::SurveyDesign; replicates = 4000, rng = MersenneTwis
 end
 
 """
-    variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates})
+stderr(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...)
 
+Compute the standard error of the estimated mean using the bootstrap method.
 
-Use replicate weights to compute the standard error of the estimated mean using the bootstrap method. The variance is calculated using the formula
+# Arguments
+- `x::Union{Symbol, Vector{Symbol}}`: Symbol or vector of symbols representing the variable(s) for which the mean is estimated.
+- `func::Function`: Function used to calculate the mean.
+- `design::ReplicateDesign{BootstrapReplicates}`: Replicate design object.
+- `args...`: Additional arguments to be passed to the function.
+- `kwargs...`: Additional keyword arguments.
+
+# Returns
+- `df`: DataFrame containing the estimated mean and its standard error.
+
+The standard error is calculated using the formula
 
 ```math
 \\hat{V}(\\hat{\\theta}) = \\dfrac{1}{R}\\sum_{i = 1}^R(\\theta_i - \\hat{\\theta})^2
 ```
 
 where above ``R`` is the number of replicate weights, ``\\theta_i`` is the estimator computed using the ``i``th set of replicate weights, and ``\\hat{\\theta}`` is the estimator computed using the original weights.
 
-```jldoctest
-julia> using Survey, StatsBase;
-
-julia> apiclus1 = load_data("apiclus1");
-
-julia> dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw);
+# Examples
 
-julia> bclus1 = dclus1 |> bootweights;
+```jldoctest; setup = :(using Survey, StatsBase, DataFrames; apiclus1 = load_data("apiclus1"); dclus1 = SurveyDesign(apiclus1; clusters = :dnum, weights = :pw); bclus1 = dclus1 |> bootweights;)
 
-julia> weightedmean(x, y) = mean(x, weights(y));
+julia> mean(df::DataFrame, column, weights) = StatsBase.mean(df[!, column], StatsBase.weights(df[!, weights]));
 
-julia> variance(:api00, weightedmean, bclus1)
+julia> stderr(:api00, mean, bclus1)
 1×2 DataFrame
  Row │ estimator  SE
      │ Float64    Float64
 ─────┼────────────────────
    1 │   644.169  23.4107
-
 ```
 """
-function variance(x::Symbol, func::Function, design::ReplicateDesign{BootstrapReplicates})
-    θ̂ = func(design.data[!, x], design.data[!, design.weights])
-    θ̂t = [
-        func(design.data[!, x], design.data[!, "replicate_"*string(i)]) for
-        i = 1:design.replicates
+function stderr(x::Union{Symbol, Vector{Symbol}}, func::Function, design::ReplicateDesign{BootstrapReplicates}, args...; kwargs...)
+
+    # Compute the estimators
+    θs = func(design.data, x, design.weights, args...; kwargs...)
+
+    # Compute the estimators for each replicate
+    θts = [
+        func(design.data, x, "replicate_" * string(i), args...; kwargs...) for i in 1:design.replicates
     ]
-    variance = sum((θ̂t .- θ̂) .^ 2) / design.replicates
-    return DataFrame(estimator = θ̂, SE = sqrt(variance))
+
+    # Convert θs and θts to a vector if they are not already
+    θs = (θs isa Vector) ? θs : [θs]  
+    θts2 = Vector[]
+    if !(θts[1] isa Vector)
+        for θt in θts
+            push!(θts2, [θt])
+        end
+        θts = θts2
+    end
+
+    # Calculate variances for each estimator
+    variance = Float64[]
+
+    θts = hcat(θts...)
+    for i in 1:length(θs)
+        θ = θs[i]
+        θt = θts[i, :]
+        θt = filter(!isnan, θt)
+        num = sum((θt .- θ) .^ 2) / length(θt)
+        push!(variance, num)
+    end
+
+    return DataFrame(estimator = θs, SE = sqrt.(variance))
 end
 
 function _bootweights_cluster_sorted!(cluster_sorted,

diff --git a/src/by.jl b/src/by.jl
@@ -1,28 +1,23 @@
-function bydomain(x::Symbol, domain, design::SurveyDesign, func::Function)
-    gdf = groupby(design.data, domain)
-    X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic)
-    return X
+function subset(group, design::SurveyDesign)
+    return SurveyDesign(DataFrame(group);clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights)   
+end 
+
+function subset(group, design::ReplicateDesign)
+    return ReplicateDesign{typeof(design.inference_method)}(DataFrame(group), design.replicate_weights;clusters = design.cluster, strata = design.strata, popsize = design.popsize, weights = design.weights)   
 end
 
-function bydomain(x::Symbol, domain, design::ReplicateDesign, func::Function)
+function bydomain(x::Union{Symbol, Vector{Symbol}}, domain,design::Union{SurveyDesign, ReplicateDesign}, func::Function, args...; kwargs...)
+    domain_names = unique(design.data[!, domain])
     gdf = groupby(design.data, domain)
-    nd = length(gdf)
-    X = combine(gdf, [x, design.weights] => ((a, b) -> func(a, weights(b))) => :statistic)
-    Xt_mat = Array{Float64,2}(undef, (nd, design.replicates))
-    for i = 1:design.replicates
-        Xt_mat[:, i] =
-            combine(
-                gdf,
-                [x, Symbol("replicate_" * string(i))] =>
-                    ((a, c) -> func(a, weights(c))) => :statistic,
-            ).statistic
+    domain_names = [join(collect(keys(gdf)[i]), "-") for i in 1:length(gdf)]
+    vars = DataFrame[]
+    for group in gdf
+        push!(vars, func(x, subset(group, design), args...; kwargs...))
     end
-    ses = Float64[]
-    for i = 1:nd
-        filtered_dx = filter(!isnan, Xt_mat[i, :] .- X.statistic[i])
-        push!(ses, sqrt(sum(filtered_dx .^ 2) / length(filtered_dx)))
+    estimates = vcat(vars...)
+    if isa(domain, Vector{Symbol})
+        domain = join(domain, "_")
     end
-    replace!(ses, NaN => 0)
-    X.SE = ses
-    return X
-end
+    estimates[!, domain] = domain_names
+    return estimates
+end