The Hawkins-Young Uncertainty Method #240

dt-woods · 2024-05-08T21:16:39Z

dt-woods
May 8, 2024
Collaborator

tldr;

Below, I present the Hawkins-Young method for building a log-normal distribution to serve as the error/uncertainty distribution for Monte Carlo simulations where the expected value of the distribution is the emission factor and the 95th percentile of the distribution is the 90% confidence interval upper limit.

It took a while, but I think I am finally coming to grips with the uncertainty calculation used in eLCI's generation.py. Most of the origins were lost in the evolution of _geometric_mean and _calc_geom_std sub-modules found in the aggregate_data function. This seems reasonable having waded through all the approximation methods, value substitutions, and distribution assumptions that serve as the background.

import numpy as np
from scipy.optimize import least_squares
from scipy.special import erfinv
from scipy.special import erf
from scipy.stats import uniform
from scipy.stats import lognorm
from scipy.stats import t

def hawkins_young_sigma(x, **kwargs):
    # From Young et al. (2019) <https://doi.org/10.1021/acs.est.8b05572>,
    # to ensure non-negative releases in Monte Carlo simulations, the error
    # is set to a log-normal distribution with the expected value assigned
    # to the emission factor and the 95th percentile of the cumulative
    # distribution function (CDF) set to the 90% confidence interval upper
    # limit (CIU).
    # Based on the CDF for lognormal distribution, D(x), set to 0.95 for
    # x = EF*(1+PI), the 90% CIU based on a given emission factor, EF, and
    # prediction/confidence interval expressed as a fraction (or percentage);
    # hence the 1+CIU. If CIU is undefined, a default value of 50% is used.
    if 'alpha' in kwargs.keys():
        alpha = kwargs['alpha']
    else:
        alpha = 0.9

    if 'ciu' in kwargs.keys():
        ciu = kwargs['ciu']
    else:
        ciu = 0.5

    a = 0.5
    z = erfinv(alpha)
    b = -2**0.5*z
    c = np.log(1 + ciu)
    r = a*x**2 + b*x + c
    return r


def hawkins_young(data, ef, alpha):
    # From Young et al. (2019) <https://doi.org/10.1021/acs.est.8b05572>,
    # the prediction interval is expressed as the percentage of the expected
    # release factor; Eq 3. expresses it as:
    # P = s * sqrt(1 + 1/n)*z/y_hat
    # where:
    #   s is the standard error of the expected value, SEM
    #   n is the sample size
    #   z is the critical value for 90% confidence
    #   y_hat is the expected value
    # Note that there is no assumed log-normal distribution here.
    n = len(data)
    z = t.ppf(q=alpha, df=n-1)
    se = np.std(data)/np.sqrt(n)
    y_hat = data.mean()
    ciu = se*np.sqrt(1 + 1/n)*z/y_hat

    # Use least-squares fitting for the quadratic.
    # NOTE: remember, we are fitting sigma, the standard deviation of the
    #       underlying normal distribution. A 'safe' assumption is to
    #       expect sigma to be between 1 and 5. So run a few fits and
    #       get the one that isn't negative (most positive).
    #       Alternativley, we could take std(ddof=1) of the log of the data
    #       to get an estimate of the standard deviation and search across
    #       4x's of it. See snippet code for method:
    #       `s_std = np.round(4*np.log(data).std(ddof=1), 0)`
    all_ans = []
    for i in uniform.rvs(0, 6, size=10):
        ans = least_squares(
            hawkins_young_sigma, i, kwargs={'alpha': alpha, 'ciu': ciu})
        all_ans.append(ans['x'][0])

    sigma = np.max(all_ans)
    mu = np.log(ef) - 0.5*sigma**2

    mu_g = np.exp(mu)
    sigma_g = np.exp(sigma)

    return {
        'mu': mu,
        'sigma': sigma,
        'mu_g': mu_g,
        'sigma_g': sigma_g,
        'ci_%': ciu*100,
    }

First, I had to dispel the error function approximation methods that were used at the heart of this method.
As a reminder, the cumulative distribution function, D(x), is given by:

$$ D(x)=\frac{1}{2}\left[1+\mathrm{erf}\left(\frac{\ln x - \mu}{\sqrt{2}\sigma}\right)\right] $$

and let, $x=EF\times(1+CIU)$ and $\mu=\ln(EF)-0.5\times\sigma^2$, where $EF$ is the emission factor and $CIU$ is the 90% confidence interval expressed as a fraction (or percentage), such that $EF\times(1+CIU)$ gives you the upper limit value. By substitution, you get:

$$ \mathrm{erf}^{-1}(0.9) = \frac{\ln(1+CIU)+0.5\sigma^2}{\sqrt{2}\sigma} $$

So we are looking for the $x$ that gives us the error-function value 0.9.

Here are the two approximate methods of Abramowitz & Stegun's equations (.html), which have been designed to be minimized for a value of 0.9 (hence the return 0.9-r).

def estMethod1(x):
    # Abramowitz & Stegun. Handbook of Mathematical Equations.
    # Eq. 7.1.27; setting the error function to 0.9.
    # Basically returns scipy.special.erfinv(0.9) = 1.163087
    a1 = 0.278393
    a2 = 0.230389
    a3 = 0.000972
    a4 = 0.078108
    r = 1.0 - 1.0/(1 + a1*x + a2*x**2 + a3*x**3 + a4*x**4)**4
    return 0.9 - r


def estMethod2(x):
    # Abramowitz & Stegun. Handbook of Mathematical Equations.
    # Eq. 7.1.27; setting the error function to 0.9.
    # Basically returns scipy.special.erfinv(0.9) = 1.163087
    a1 = 0.0705230784
    a2 = 0.0422820123
    a3 = 0.0092705272
    a4 = 0.0001520143
    a5 = 0.0002765672
    a6 = 0.0000430638
    r = 1.0
    r -= 1.0/(1 + a1*x + a2*x**2 + a3*x**3 + a4*x**4 + a5*x**5 + a6*x**6)**16
    return 0.9 - r

Just using scipy's least squares method, we can find the x value associated with 0.9.

em1 = least_squares(estMethod1, 1.5)
em2 = least_squares(estMethod2, 1.5)
e90 = erfinv(0.9)
print("Estimate 1: %0.5f" % em1['x'][0])
print("Estimate 2: %0.5f" % em2['x'][0])
print("Inv. erf:   %0.5f" % e90)

And we see the output:

Estimate 1: 1.16183
Estimate 2: 1.16309
Inv. erf:   1.16309

Okay, we can use scipy's special inverse error function, erfinv in place of the approximations.

Let's looks at example data, which has been purposefully log-normally oriented, but doesn't have to be:

data = np.random.lognormal(0.25, 1.25, size=19)
data *= 250

In eLCI, the emission factor tends to be the regional sum and our confidence interval level is 90% (inexcusably labeled here as alpha, which is in all the textbooks, confidence = 1 - alpha).

ef = data.sum()
alpha = 0.9
results = hawkins_young(data, ef, alpha)

The goal was to assign log-normal distribution values for the geometric mean and geometric standard deviation where the expected value is the emission factor and the 95th percentile is the 90% CIU.

# Extract the results and observe:
mu = results['mu']
sigma = results['sigma']
ci = results['ci_%']
hss = 0.5*sigma**2   # half sigma squared (hss)
srt = np.sqrt(2)     # square-root of two (srt)
error = 0.5*(1 + erf((np.log(1 + ci/100.) + hss)/(srt*sigma)))
print("E(x) = %0.3f (%0.3f)" % (np.exp(mu + hss), ef))
print("D(x) = %0.3f (0.95)" % error)

The output is:

E(x) = 14305.705 (14305.705)
D(x) = 0.950 (0.95)

We see that the expected value of the error distribution matches the emission factor and the CDF is 0.95!

If you're feeling it, you can check these values against their fitted alternatives (this assumes the data are log-normal, which is not the assumption held within the Hawkins-Young method).

Start with a handy function that computes geometric mean and geometric standard deviation from a given dataset.

def geoSD(x):
    """Compute geometric standard deviation and the associated decimal
    coefficients to provide high and low range values about the mean.

    For a series of values X, which are log-normal, such that Z = log(X)
    is normally distributed about (mu, sigma).

    The geometric mean (mu_g) of X is e^(mu) and geometric standard
    deviation (std_g) is e^(sigma).

    Unlike standard deviation, the geometric standard deviation is a
    non-negative, unitless multiplicative factor that acts as a measure of
    the spread of logarithmic values around the mean. Therefore, the spread
    is given as: mu_g/sigma_g to mu_g*sigma_g. This contrasts standard
    deviation, which is mu +/- sdev.

    Analogous to Pearson's coefficient of variation, the geometric
    coefficient of variation (GCV) provides the multiplicative factor to
    be used against X is:

    GCV (%) = (GSD - 1)*100

    with the range (100*100/(100+GSC), 100+GCV) in percentages.

    Parameters
    ----------
    x : list, numpy.array, or iterable
        A list of numeric values representing X.

    Returns
    -------
    dict
        Geometric mean (mu_g), geometric standard deviation (sigma_g),
        geometric coefficient of variation (gcv), the upper/lower
        coefficients (to multiply against mu_g), and upper/lower limits
        (after multiplying against mu_g).

    Notes
    -----
    Provides several variants of the geometric standard deviation
    calculation, all providing the same value.

    Reference:
        Kirdwood (1970), "Geometric means and measures of dispersion,"
        Biometrics, 35(4), 908--909. https://www.jstor.org/stable/2530139

    Examples
    --------
    >>> rv = scipy.stats.lognorm.rvs(s=0.95, scale=0.5, loc=693, size=5000)
    >>> params = geoSD(rv)
    >>> print("%0.3f,%0.3f,%0.3f" % (
    ...     params['lower_limit'], params['mu_g'], params['upper_limit']))
    692.843,693.786,694.730
    """
    # Convert to floats in a numpy array
    x = np.array(x, dtype=np.float64)

    # Remove negatives and non-finite values
    x[x <= 0] = np.nan
    x_finite = np.isfinite(x)
    x = x[x_finite]
    n = len(x)

    # Transform data
    xlog = np.log(x)

    # Compute geometric mean and standard dev.
    mu_g = np.exp(xlog.mean())

    # NOTE: sigma_g's lower bound is 1 because exp to any positive
    # number is >=1 and stdev (i.e., square root of variance) is
    # always positive.

    # Scipy's method; note that ddof is zero,
    # where ddof determines the degress of freedom = N-ddof
    sigma_g3 = np.exp(np.std(xlog, ddof=0))
    # BIO-RSG/oceancolouR method
    # https://rdrr.io/github/BIO-RSG/oceancolouR/src/R/math_funcs.R
    sigma_g2 = np.exp(np.sqrt(np.sum(np.log(x/mu_g)**2)/n))
    # codellama's translation (yep, that's the same):
    sigma_g1 = np.exp(np.sqrt(np.var(xlog, ddof=0, axis=0)))
    # Short-hand of sigma_g2, note the ddof needs to be 1
    sigma_g = np.exp(np.sqrt((n-1)/n) * np.std(xlog, ddof=1))

    # Geometric coefficient of variation, GCV, expressed as a
    # percentage. This provides you with the limits of the range
    # of values about the mu_g.
    gcv = 100*(sigma_g - 1)
    r1 = 100*100/(100 + gcv)
    r2 = 100+gcv

    return ({
        'mu_g': mu_g,
        'sigma_g': (sigma_g3, sigma_g2, sigma_g1, sigma_g),
        'gcv_%': gcv,
        'gcv': gcv/100.0,
        'lower_coef_%': r1,
        'lower_coef': r1/100.0,
        'upper_coef_%': r2,
        'upper_coef': r2/100.0,
        'lower_limit': mu_g*r1/100.0,
        'upper_limit': mu_g*r2/100.0,
    })

Then get the results.

g = geoSD(data)
print("Geo Mean: %0.3f (%0.3f)" % (results['mu_g'], g['mu_g']))
print("Geo Std: %0.3f (%0.3f)" % (results['sigma_g'], g['sigma_g'][0]))

Which, in my case, returned the following:

Geo Mean: 177.868 (273.456)
Geo Std: 19.341 (4.434)

I expect them to be different and they are. The assumptions are laid out now for the Hawkins-Young method and it serves its purpose. Please feel free to comment.

bl-young · 2024-05-08T23:02:44Z

bl-young
May 8, 2024
Maintainer

Thanks for laying this out so clearly. I don't know your implementation plans here in eLCI, but it could be valuable to have this function available more broadly within esupy if you are open to considering it.

0 replies

dt-woods · 2024-05-09T00:38:01Z

dt-woods
May 9, 2024
Collaborator Author

I'm open to it!

…

On Wed, May 8, 2024, 19:03 Ben Young (ERG) ***@***.***> wrote: Thanks for laying this out so clearly. I don't know your implementation plans here in eLCI, but it could be valuable to have this function available more broadly within esupy if you are open to considering it. — Reply to this email directly, view it on GitHub <#240 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHCFB5MBIVX46SDWMQHXXHLZBKVKTAVCNFSM6AAAAABHNWOABOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGNRRGI3DS> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

dt-woods · 2024-05-09T17:53:49Z

dt-woods
May 9, 2024
Collaborator Author

The one suggestion I would make would be to allow a single data point to be evaluated. Currently, it would fail and there's no checks to prevent it (that's another point). For sensitivity analysis using Monte Carlo and log-normal error distributions, it might be nice to provide a value of data that's the same as ef to hawkins_young and still generate a distribution (e.g., if len(data) == 0, data = ef). It just needs a parameterized ciu value (e.g., the 0.5 used in hawkins_young_sigma). This would let a user who has a single data point for an EF create distribution parameters for sensitivity analysis and all the background methodology is accounted for here.

0 replies

dt-woods · 2024-05-14T22:06:52Z

dt-woods
May 14, 2024
Collaborator Author

After testing the 2020 baseline inventory in eLCI, I've made a few updates/improvements to the main function.
Notably, the max search on the answers returned from the least_squares was always >3.0, so I took a closer looks and saw there were two positive roots. Now, I'm grabbing the smallest positive root, which is given more expected results (particularly for groups with 300 data points). There's a lot of bad data groups (e.g., only 1 data point, all zero data, all highly negative data), so I added an error flag that gets caught in the return dictionary for quality checking the results.

def hawkins_young(data, ef, alpha):
    """Model a log-normal uncertainty distribution to a dataset.

    The (geometric) mean and (geometric) standard deviation are fitted to an
    assumed distribution that has an expected value of the emission factor,
    `ef`, and the 95th percentile at the 90% confidence upper interval.

    This modeled log-normal distribution is for use with Monte-Carlo
    simulations to guarantee non-negative emission values with an expected
    value that matches a given emission factor.

    This method does not fit a log-normal distribution to the given data!

    Parameters
    ----------
    data : numpy.array
        A data array.
    ef : float
        The emission factor, the expected value.
    alpha : float
        The confidence level, expressed as a fraction
        (e.g., 90% confidence = 0.9).

    Returns
    -------
    dict
        A dictionary of results. Keys include the following.

        -   'mu' (float): The mean of a normally distributed values of
            Y = log(X)
        -   'sigma' (float): The standard deviation of the normally distributed
            values of Y = log(X)
        -   'mu_g' (float): The geometric mean for the log-normal distribution.
        -   'sigma_g' (float): The geometric standard deviation for the
            log-normal distribution.
        -   'ci_%' (float): The upper end of the 90% confidence interval for X
            (expressed as a percentage).
        -   'error' (bool): Whether the method failed (e.g., too few data
            points, ci < -1, ef < 0). To be used to quality check results.
    """
    # From Young et al. (2019) <https://doi.org/10.1021/acs.est.8b05572>,
    # the prediction interval is expressed as the percentage of the expected
    # release factor; Eq 3. expresses it as:
    # P = s * sqrt(1 + 1/n)*z/y_hat
    # where:
    #   s is the standard error of the expected value, SEM
    #   n is the sample size
    #   z is the critical value for 90% confidence
    #   y_hat is the expected value
    # Note that there is no assumed log-normal distribution here.
    # HOTFIX nans in z and ciu calcs [2024-05-14; TWD]
    is_error = True
    n = len(data)
    z = 0.0
    if n > 1:
        is_error = False
        z = t.ppf(q=alpha, df=n-1)
    se = np.std(data)/np.sqrt(n)
    y_hat = data.mean()
    ciu = 0.0
    if y_hat != 0:
        ciu = se*np.sqrt(1 + 1/n)*z/y_hat
    if ciu <= -1:
        is_error = True
        ciu = -9.999999e-1  # makes log(0.0000001) in hawkins_young_sigma

    # Use least-squares fitting for the quadratic.
    # NOTE: remember, we are fitting sigma, the standard deviation of the
    #       underlying normal distribution. A 'safe' assumption is to
    #       expect sigma to be between 1 and 5. So run a few fits and
    #       get the one that isn't negative (most positive).
    #       Alternatively, we could take std(ddof=1) of the log of the data
    #       to get an estimate of the standard deviation and search across
    #       4x's of it. See snippet code for method:
    #       `s_std = np.round(4*np.log(data).std(ddof=1), 0)`
    all_ans = []
    for i in uniform.rvs(0, 6, size=10):
        ans = least_squares(
            hawkins_young_sigma, i, kwargs={'alpha': alpha, 'ciu': ciu})
        all_ans.append(ans['x'][0])

    # Find the minimum positive root:
    all_ans = np.array(all_ans)
    sigma = all_ans[np.where(all_ans > 0)].min()

    if ef < 0:
        is_error = True
        mu = np.nan
    else:
        mu = np.log(ef) - 0.5*sigma**2

    mu_g = np.exp(mu)
    sigma_g = np.exp(sigma)

    return {
        'mu': mu,
        'sigma': sigma,
        'mu_g': mu_g,
        'sigma_g': sigma_g,
        'ci_%': ciu*100,
        'error': is_error,
    }

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hawkins-Young Uncertainty Method #240

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The Hawkins-Young Uncertainty Method #240

dt-woods May 8, 2024 Collaborator

Replies: 4 comments

bl-young May 8, 2024 Maintainer

dt-woods May 9, 2024 Collaborator Author

dt-woods May 9, 2024 Collaborator Author

dt-woods May 14, 2024 Collaborator Author

dt-woods
May 8, 2024
Collaborator

bl-young
May 8, 2024
Maintainer

dt-woods
May 9, 2024
Collaborator Author

dt-woods
May 9, 2024
Collaborator Author

dt-woods
May 14, 2024
Collaborator Author