Ratio estimator

Last updated

The ratio estimator is a statistical estimator for the ratio of means of two random variables. Ratio estimates are biased and corrections must be made when they are used in experimental or survey work. The ratio estimates are asymmetrical and symmetrical tests such as the t test should not be used to generate confidence intervals.

Contents

The bias is of the order O(1/n) (see big O notation) so as the sample size (n) increases, the bias will asymptotically approach 0. Therefore, the estimator is approximately unbiased for large sample sizes.

Definition

Assume there are two characteristics – x and y – that can be observed for each sampled element in the data set. The ratio R is

The ratio estimate of a value of the y variate (θy) is

where θx is the corresponding value of the x variate. θy is known to be asymptotically normally distributed. [1]

Statistical properties

The sample ratio (r) is estimated from the sample

That the ratio is biased can be shown with Jensen's inequality as follows (assuming independence between and ):

where is the mean of the variate and is the mean of the variate .

Under simple random sampling the bias is of the order O( n−1 ). An upper bound on the relative bias of the estimate is provided by the coefficient of variation (the ratio of the standard deviation to the mean). [2] Under simple random sampling the relative bias is O( n−1/2 ).

Correction of the mean's bias

The correction methods, depending on the distributions of the x and y variates, differ in their efficiency making it difficult to recommend an overall best method. Because the estimates of r are biased a corrected version should be used in all subsequent calculations.

A correction of the bias accurate to the first order is[ citation needed ]

where mx is the mean of the variate x and sxy is the covariance between x and y.

To simplify the notation sxy will be used subsequently to denote the covariance between the variates x and y.

Another estimator based on the Taylor expansion is [3]

where n is the sample size, N is the population size, mx is the mean of the x variate and sx2 and sy2 are the sample variances of the x and y variates respectively.

A computationally simpler but slightly less accurate version of this estimator is

where N is the population size, n is the sample size, mx is the mean of the x variate and sx2 and sy2 are the sample variances of the x and y variates respectively. These versions differ only in the factor in the denominator (N - 1). For a large N the difference is negligible.

If x and y are unitless counts with Poisson distribution a second-order correction is [4]

Other methods of bias correction have also been proposed. To simplify the notation the following variables will be used

Pascual's estimator: [5]

Beale's estimator: [6]

Tin's estimator: [7]

Sahoo's estimator: [8]

Sahoo has also proposed a number of additional estimators: [9]

If x and y are unitless counts with Poisson distribution and mx and my are both greater than 10, then the following approximation is correct to order O( n−3 ). [4]

An asymptotically correct estimator is [3]

Jackknife estimation

A jackknife estimate of the ratio is less biased than the naive form. A jackknife estimator of the ratio is

where n is the size of the sample and the ri are estimated with the omission of one pair of variates at a time. [10]

An alternative method is to divide the sample into g groups each of size p with n = pg. [11] Let ri be the estimate of the ith group. Then the estimator

where is the mean of the ratios rg of the g groups, has a bias of at most O( n−2 ).

Other estimators based on the division of the sample into g groups are: [12]

where is the mean of the ratios rg of the g groups and

where ri' is the value of the sample ratio with the ith group omitted.

Other methods of estimation

Other methods of estimating a ratio estimator include maximum likelihood and bootstrapping. [10]

Estimate of total

The estimated total of the y variate ( τy ) is

where ( τx ) is the total of the x variate.

Variance estimates

The variance of the sample ratio is approximately:

where sx2 and sy2 are the variances of the x and y variates respectively, mx and my are the means of the x and y variates respectively and sxy is the covariance of x and y.

Although the approximate variance estimator of the ratio given below is biased, if the sample size is large, the bias in this estimator is negligible.

where N is the population size, n is the sample size and mx is the mean of the x variate.

Another estimator of the variance based on the Taylor expansion is

where n is the sample size and N is the population size and sxy is the covariance of x and y.

An estimate accurate to O( n−2 ) is [3]

If the probability distribution is Poissonian, an estimator accurate to O( n−3 ) is [4]

A jackknife estimator of the variance is

where ri is the ratio with the ith pair of variates omitted and rJ is the jackknife estimate of the ratio. [10]

Variance of total

The variance of the estimated total is

Variance of mean

The variance of the estimated mean of the y variate is

where mx is the mean of the x variate, sx2 and sy2 are the sample variances of the x and y variates respectively and sxy is the covariance of x and y.

Skewness

The skewness and the kurtosis of the ratio depend on the distributions of the x and y variates. Estimates have been made of these parameters for normally distributed x and y variates but for other distributions no expressions have yet been derived. It has been found that in general ratio variables are skewed to the right, are leptokurtic and their nonnormality is increased when magnitude of the denominator's coefficient of variation is increased.

For normally distributed x and y variates the skewness of the ratio is approximately [7]

where

Effect on confidence intervals

Because the ratio estimate is generally skewed confidence intervals created with the variance and symmetrical tests such as the t test are incorrect. [10] These confidence intervals tend to overestimate the size of the left confidence interval and underestimate the size of the right.

If the ratio estimator is unimodal (which is frequently the case) then a conservative estimate of the 95% confidence intervals can be made with the Vysochanskiï–Petunin inequality.

Alternative methods of bias reduction

An alternative method of reducing or eliminating the bias in the ratio estimator is to alter the method of sampling. The variance of the ratio using these methods differs from the estimates given previously. Note that while many applications such as those discussion in Lohr [13] are intended to be restricted to positive integers only, such as sizes of sample groups, the Midzuno-Sen method works for any sequence of positive numbers, integral or not. It's not clear what it means that Lahiri's method works since it returns a biased result.

Lahiri's method

The first of these sampling schemes is a double use of a sampling method introduced by Lahiri in 1951. [14] The algorithm here is based upon the description by Lohr. [13]

  1. Choose a number M = max( x1, ..., xN) where N is the population size.
  2. Choose i at random from a uniform distribution on [1,N].
  3. Choose k at random from a uniform distribution on [1,M].
  4. If kxi, then xi is retained in the sample. If not then it is rejected.
  5. Repeat this process from step 2 until the desired sample size is obtained.

The same procedure for the same desired sample size is carried out with the y variate.

Lahiri's scheme as described by Lohr is biased high and, so, is interesting only for historical reasons. The Midzuno-Sen technique described below is recommended instead.

Midzuno-Sen's method

In 1952 Midzuno and Sen independently described a sampling scheme that provides an unbiased estimator of the ratio. [15] [16]

The first sample is chosen with probability proportional to the size of the x variate. The remaining n - 1 samples are chosen at random without replacement from the remaining N - 1 members in the population. The probability of selection under this scheme is

where X is the sum of the Nx variates and the xi are the n members of the sample. Then the ratio of the sum of the y variates and the sum of the x variates chosen in this fashion is an unbiased estimate of the ratio estimator.

In symbols we have

where xi and yi are chosen according to the scheme described above.

The ratio estimator given by this scheme is unbiased.

Särndal, Swensson, and Wretman credit Lahiri, Midzuno and Sen for the insights leading to this method [17] but Lahiri's technique is biased high.


Other ratio estimators

Tin (1965) [18] described and compared ratio estimators proposed by Beale (1962) [19] and Quenouille (1956) [20] and proposed a modified approach (now referred to as Tin's method). These ratio estimators are commonly used to calculate pollutant loads from sampling of waterways, particularly where flow is measured more frequently than water quality. For example see Quilbe et al., (2006) [21]


Ordinary least squares regression

If a linear relationship between the x and y variates exists and the regression equation passes through the origin then the estimated variance of the regression equation is always less than that of the ratio estimator[ citation needed ]. The precise relationship between the variances depends on the linearity of the relationship between the x and y variates: when the relationship is other than linear the ratio estimate may have a lower variance than that estimated by regression.

Uses

Although the ratio estimator may be of use in a number of settings it is of particular use in two cases:

History

The first known use of the ratio estimator was by John Graunt in England who in 1662 was the first to estimate the ratio y/x where y represented the total population and x the known total number of registered births in the same areas during the preceding year.

Later Messance (~1765) and Moheau (1778) published very carefully prepared estimates for France based on enumeration of population in certain districts and on the count of births, deaths and marriages as reported for the whole country. The districts from which the ratio of inhabitants to birth was determined only constituted a sample.

In 1802, Laplace wished to estimate the population of France. No population census had been carried out and Laplace lacked the resources to count every individual. Instead he sampled 30 parishes whose total number of inhabitants was 2,037,615. The parish baptismal registrations were considered to be reliable estimates of the number of live births so he used the total number of births over a three-year period. The sample estimate was 71,866.333 baptisms per year over this period giving a ratio of one registered baptism for every 28.35 persons. The total number of baptismal registrations for France was also available to him and he assumed that the ratio of live births to population was constant. He then used the ratio from his sample to estimate the population of France.

Karl Pearson said in 1897 that the ratio estimates are biased and cautioned against their use. [22]

See also

Related Research Articles

In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired.

<span class="mw-page-title-main">Standard deviation</span> In statistics, a measure of variation

In statistics, the standard deviation is a measure of the amount of variation of a random variable expected about its mean. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is commonly used in the determination of what constitutes an outlier and what does not.

<span class="mw-page-title-main">Variance</span> Statistical measure of how far values spread from their average

In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers is spread out from their average value. It is the second central moment of a distribution, and the covariance of the random variable with itself, and it is often represented by , , , , or .

The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The notion of weighted mean plays a role in descriptive statistics and also occurs in a more general form in several other areas of mathematics.

<span class="mw-page-title-main">Allan variance</span> Measure of frequency stability in clocks and oscillators

The Allan variance (AVAR), also known as two-sample variance, is a measure of frequency stability in clocks, oscillators and amplifiers. It is named after David W. Allan and expressed mathematically as . The Allan deviation (ADEV), also known as sigma-tau, is the square root of the Allan variance, .

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk, as an estimate of the true MSE.

<span class="mw-page-title-main">Beta distribution</span> Probability distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.

<span class="mw-page-title-main">Gamma distribution</span> Probability distribution

In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use:

  1. With a shape parameter k and a scale parameter θ
  2. With a shape parameter and an inverse scale parameter , called a rate parameter.
<span class="mw-page-title-main">Pearson correlation coefficient</span> Measure of linear correlation

In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationships or correlations. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1.

Importance sampling is a Monte Carlo method for evaluating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by Teun Kloek and Herman K. van Dijk in 1978, but its precursors can be found in statistical physics as early as 1949. Importance sampling is also related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.

<span class="mw-page-title-main">Monte Carlo integration</span> Numerical technique

In mathematics, Monte Carlo integration is a technique for numerical integration using random numbers. It is a particular Monte Carlo method that numerically computes a definite integral. While other algorithms usually evaluate the integrand at a regular grid, Monte Carlo randomly chooses points at which the integrand is evaluated. This method is particularly useful for higher-dimensional integrals.

Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. In estimation theory, two approaches are generally considered:

<span class="mw-page-title-main">Ordinary least squares</span> Method for estimating the unknown parameters in a linear regression model

In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.

<span class="mw-page-title-main">Simple linear regression</span> Linear regression model with a single explanatory variable

In statistics, simple linear regression (SLR) is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable and finds a linear function that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. The adjective simple refers to the fact that the outcome variable is related to a single predictor.

The control variates method is a variance reduction technique used in Monte Carlo methods. It exploits information about the errors in estimates of known quantities to reduce the error of an estimate of an unknown quantity.

In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a linear regression model. It is used when there is a non-zero amount of correlation between the residuals in the regression model. GLS is employed to improve statistical efficiency and reduce the risk of drawing erroneous inferences, as compared to conventional least squares and weighted least squares methods. It was first described by Alexander Aitken in 1935.

<span class="mw-page-title-main">Jackknife resampling</span> Statistical method for resampling

In statistics, the jackknife is a cross-validation technique and, therefore, a form of resampling. It is especially useful for bias and variance estimation. The jackknife pre-dates other common resampling methods such as the bootstrap. Given a sample of size , a jackknife estimator can be built by aggregating the parameter estimates from each subsample of size obtained by omitting one observation.

In statistics, the antithetic variates method is a variance reduction technique used in Monte Carlo methods. Considering that the error in the simulated signal has a one-over square root convergence, a very large number of sample paths is required to obtain an accurate result. The antithetic variates method reduces the variance of the simulation results.

In survey research, the design effect is a number that shows how well a sample of people may represent a larger group of people for a specific measure of interest. This is important when the sample comes from a sampling method that is different than just picking people using a simple random sample.

In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator needs fewer input data or observations than a less efficient one to achieve the Cramér–Rao bound. An efficient estimator is characterized by having the smallest possible variance, indicating that there is a small deviance between the estimated value and the "true" value in the L2 norm sense.

References

  1. Scott AJ, Wu CFJ (1981) On the asymptotic distribution of ratio and regression estimators. JASA 76: 98–102
  2. Cochran WG (1977) Sampling techniques. New York: John Wiley & Sons
  3. 1 2 3 van Kempen GMP, van Vliet LJ (2000) Mean and variance of ratio estimators used in fluorescence ratio imaging. Cytometry 39:300–305
  4. 1 2 3 Ogliore RC, Huss GR, Nagashima K (2011) Ratio estimation in SIMS analysis. Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 269 (17) 1910–1918
  5. Pascual JN (1961) Unbiased ratio estimators in stratified sampling. JASA 56(293):70–87
  6. Beale EML (1962) Some use of computers in operational research. Industrielle Organization 31: 27-28
  7. 1 2 Tin M (1965) Comparison of some ratio estimators. JASA 60: 294–307
  8. Sahoo LN (1983). On a method of bias reduction in ratio estimation. J Statist Res 17:1—6
  9. Sahoo LN (1987) On a class of almost unbiased estimators for population ratio. Statistics 18: 119-121
  10. 1 2 3 4 Choquet D, L'ecuyer P, Léger C (1999) Bootstrap confidence intervals for ratios of expectations. ACM Transactions on Modeling and Computer Simulation - TOMACS 9 (4) 326-348 doi : 10.1145/352222.352224
  11. Durbin J (1959) A note on the application of Quenouille's method of bias reduction to estimation of ratios. Biometrika 46: 477-480
  12. Mickey MR (1959) Some finite population unbiased ratio and regression estimators. JASA 54: 596–612
  13. 1 2 Lohr S (2010) Sampling - Design and Analysis (2nd edition)
  14. Lahiri DB (1951) A method of sample selection providing unbiased ratio estimates. Bull Int Stat Inst 33: 133–140
  15. Midzuno H (1952) On the sampling system with probability proportional to the sum of the sizes. Ann Inst Stat Math 3: 99-107
  16. Sen AR (1952) Present status of probability sampling and its use in the estimation of a characteristic. Econometrika 20-103
  17. Särndal, C-E, B Swensson J Wretman (1992) Model assisted survey sampling. Springer, §7.3.1 (iii)
  18. Tin M (1965). Comparison of Some Ratio Estimators. Journal of the American Statistical Association, 60(309), 294–307. https://doi.org/10.1080/01621459.1965.10480792
  19. Beale EML (1965) Some use of computers in operational research. Industrielle organisation 31:27-8
  20. Quenouille R Rousseau AN Duchemin M Poulin A Gangbazo G Villeneuve J-P (2006) Selecting a calculation method to estimate sediment and nutrient loads in streams: application to the Beaurivage River (Quebec, Canada). Journal of Hydrology 326:295-310
  21. Quilbé, R., Rousseau, A. N., Duchemin, M., Poulin, A., Gangbazo, G., & Villeneuve, J. P. (2006). Selecting a calculation method to estimate sediment and nutrient loads in streams: Application to the Beaurivage River (Québec, Canada). Journal of Hydrology, 326(1–4), 295–310. https://doi.org/10.1016/j.jhydrol.2005.11.008
  22. Pearson K (1897) On a form of spurious correlation that may arise when indices are used for the measurement of organs. Proc Roy Soc Lond 60: 498