Estimating the prevalence of rare events — theory and practice



Significance sampling is used to enhance precision in estimating the prevalence of some uncommon occasion in a inhabitants. On this submit, we clarify how we use variants of significance sampling to estimate the prevalence of movies that violate neighborhood requirements on YouTube. We additionally cowl many sensible challenges encountered in implementation when the requirement is to supply contemporary and common estimates of prevalence.


Every single day, tens of millions of movies are uploaded to YouTube. Whereas most of those movies are protected for everybody to get pleasure from, some movies violate the neighborhood pointers of YouTube and must be faraway from the platform. There’s a variety of policy violations, from spammy movies, to movies containing nudity, to these with harassing language. We need to estimate the prevalence of violation of every particular person coverage class (we name them coverage verticals) by sampling the movies and manually reviewing these sampled movies.

Naturally, we get an unbiased estimate of the general prevalence of violation if we pattern the movies uniformly from the inhabitants and have them reviewed by human raters to estimate the proportion of violating movies. We additionally get an unbiased estimate of the violation charge in every coverage vertical. However given the low chance of violation and wanting to make use of our rater capability properly, this isn’t an ample resolution — we sometimes have too few constructive labels in uniform samples to realize an correct estimate of the prevalence, particularly for these delicate coverage verticals. To acquire a relative error of not more than 20%, we want roughly 100 constructive labels, and most of the time, we have now zero violation movies within the uniform samples for rarer insurance policies.

Our purpose is a greater sampling design to enhance the precision of the prevalence estimates. This downside might be phrased as an optimization downside — given some mounted assessment capability

  • how ought to we pattern movies?
  • how can we calculate the prevalence charge from these samples (with confidence intervals)?

In fact, any errors by the reviewers would propagate to the accuracy of the metrics, and the metrics calculation ought to consider human errors. Moreover, the metrics of curiosity might contain the prevalence of multiple coverage vertical, and the selection of the metrics would have an effect on how we design our sampling and goal operate. We are going to defer rater accuracy and a number of coverage verticals to future posts. On this submit, we are going to assume that the raters all the time give the proper reply, and we solely take into account the binary label case.

Significance sampling

Within the binary label downside, a video will likely be both labeled good or dangerous. Often, dangerous movies solely make up a tiny proportion of the inhabitants, so the 2 labels are extremely imbalanced. Our purpose right here is to have an correct and exact estimate of the proportion of dangerous movies within the inhabitants by sampling.

As we be aware, uniform sampling is unlikely to get sufficient constructive samples to attract inference concerning the proportion. However importance sampling in statistics is a variance discount method to enhance the inference of the speed of uncommon occasions, and it appears pure to use it to our prevalence estimation downside. We refer readers to [1] (cited right here with the creator’s permission) on the small print of significance sampling, and can observe the notation from [1] in what follows.

Suppose there are $N$ movies within the inhabitants $V = {v_1, ldots, v_N}$, and $p(v)$ is the chance mass operate (PMF) of the video $v$ in $V$, such that $p(v) > 0$, $sum_i p(v_i) = 1$. The PMF will depend on how prevalence is outlined. For instance, if we care about prevalence in uploaded movies, every video has equal weight $p(v) = frac{1}{N}$, whereas for prevalence in views, the load of a video is its share of views throughout some given interval.

Moreover, let $B$ be the set of all dangerous movies. Outline $f(v) = [v in B]$ the place $[cdot]$ maps the boolean worth of the expression contained inside to $0$ or $1$. Then the proportion of dangerous movies is $$
mu = P_p(B) = sum_{i = 1} ^N f(v_i) p(v_i) = mathbb{E}_{p}[f(V)]
$$If $q$ is one other PMF outlined on the movies such that $q(v_i) > 0$, $sum_{i=1} ^N q(v_i) = 1$, then we might additionally write the proportion as$$
mu = sum_{i=1}^N f(v_i) q(v_i) cdot frac{p(v_i)}{q(v_i)} = mathbb{E}_qleft[ f(V) frac{p(V)}{q(V)}right]$$

As a substitute of uniformly sampling from the goal distribution $p$, significance sampling permits us to pattern $n$ objects $v_1, ldots, v_n$ from the inhabitants with alternative (extra on this later) utilizing the significance distribution $q$, and to appropriate the sampling bias by inverse-weighting the sampling ratio $q / p$ within the estimator:$$

hat{mu}_q =frac{1}{n} sum_{i = 1} ^ n f(v_i) cdot frac{p(v_i)}{q(v_i)}

$$It’s straightforward to point out that the significance sampling estimator is unbiased for $mu$. The variance of $hat{mu}_q$ isbegin{align*}

texttt{Var}(hat{mu}_q) &=
frac{1}{N}sum_{i=1}^N frac{left(f(v_i)p(v_i) proper)^2}{q(v_i)} -mu^2
&= frac{1}{N}sum_{i=1}^N frac{left(f(v_i)p(v_i) – mu q(v_i)proper)^2}{q(v_i)}

finish{align*}The variance of our estimator will depend on the proposed significance distribution $q$, and with applicable selection of $q$, we are able to have a greater estimator than by uniform sampling.

From the equation above, the variance is minimized when $$q propto f cdot p$$that’s, when $q$ is proportional to the goal distribution $p$ on the set $B$, and Zero outdoors the set $B$. Sadly, this excellent significance distribution requires us to know $f$. If we might separate dangerous movies from good movies completely, we might merely calculate the metrics straight with out sampling. However in a world with out excellent data of $f$, the heuristic nonetheless holds — if we have now an approximate rule to separate dangerous movies from the great ones, we might use it to design our significance distribution $q$. A machine studying classifier serves this process completely.

Formally, let $S(v)$ be the real-valued classifier rating for video $v$ (if the rating is categorical, we assume it has some variety of distinct ranges). The classifier rating $S(v)$ comprises no matter info we have now concerning the possibilities that the video is dangerous — if two movies have the identical rating $S(v_1) = S(v_2)$, then the ratio within the significance distribution $q$ must be the identical because the ratio within the goal distribution $p$, i.e. $$
S(v_1) = S(v_2) implies frac{q(v_1)}{p(v_1)} = frac{q(v_2)}{p(v_2)}$$The ratio between the significance distribution and goal distribution is thus a operate of $S(v)$:$$

frac{q(v)}{p(v)} = frac{tilde{q}(S(v))}{tilde{p}(S(v))}

$$the place $tilde{p}$ and $tilde{q}$ are PMFs of $S(v)$ beneath the goal distribution and significance distribution respectively.

As famous, we have no idea $f$ exactly however have a great estimate of it within the type of a rating $S$. For now, assume that we have now a well-calibrated operate $g$ such that $$
g(S(v)) = mathbb{E}_pleft[f(V)|S(V)=S(v)right]
$$That’s, $g$ maps the rating to the fraction of movies having that rating which might be dangerous. To decide on $tilde{q}$, we first observe that when $f$ is binary, the components for $texttt{Var}(hat{mu}_q)$ given earlier is minimized by minimizing$$
sum_{i=1}^N frac{f(v_i) p(v_i)^2}{q(v_i)}
$$Whereas we can’t decrease this amount, we are able to decrease our greatest guess of it. Thus, optimum q is given by$$

q^* = arg min_{q} sum_{i=1}^N frac{g(S(v_i))p(v_i)^2}{q(v_i)}
mbox{ such that } sum_{i=1}^N q(v_i)=1

$$Making use of Lagrange multipliers, this constrained optimization results in the selection$$

q^*(v) propto p(v)sqrt{g(S(v))}

$$Thus, the significance distribution that minimizes the infinite inhabitants variance isbegin{align*}

frac{q(v)}{p(v)} &propto sqrt{g(S(v))}
&= sqrt{ mathbb{E}_pleft[f(V)|S(V)=S(v)right]}

finish{align*}the place $g(v)=mathbb{E}_pleft[f(V)|S(V)=S(v)right]$ is the conditional prevalence given the classifier rating.


When the chance of sampling video $v$ relies upon solely on its rating $S(v)$, significance sampling is similar as random sampling throughout the identical rating class. We will leverage the rating info and use post-stratification to  cut back the strata variability launched by sampling:$$

hat{mu}_{q, PS} =
sum_s tilde{p}(s)
frac{sum_{i=1}^N f(v_i) [S(v_i) = s]}{sum_{i=1}^n [S(v_i) = s]}

$$To see the connection between $hat{mu}_{q, PS}$ and $hat{mu}_q$, we are able to rewrite $hat{mu}_q$:$$

hat{mu}_q = frac{1}{n}sum_{i=1}^n f(v_i)frac{tilde{p}(S(v_i))}{tilde{q}(S(v_i))} = sum_{s}tilde{p}(s)frac{sum_{i=1}^n f(v_i) [S(v_i) = s] }{ntilde{q}(s)}

$$This exhibits that $hat{mu}_q$ calculates the per stratum prevalence charge utilizing the anticipated pattern measurement $ntilde{q}(s)$ within the denominator, whereas $hat{mu}_{q,PS}$ divides by the precise variety of samples in every stratum.

With the estimator $hat{mu}_{q, PS}$, the optimum sampling weight is proportional to the stratum customary error (see [7]):start{align*}

sqrt{g(S(v)) left(1 – g(S(v)) proper) }
sqrt{ mathbb{E}_pleft[f(V)|S(V)=S(v)right] left( 1 – mathbb{E}_pleft[f(V)|S(V)=S(v)right]proper) }

finish{align*}The largest distinction between post-stratification and significance sampling is that the optimum sampling weight for post-stratification peaks on the conditional prevalence charge of $frac{1}{2}$, whereas the optimum weight for significance sampling is monotone rising as conditional prevalence charge will increase. In our case when the occasions are uncommon and the chance of excessive conditional prevalence charge is small beneath the goal distribution, the distinction between the strategies is minor. However in different purposes, utilizing post-stratification may additional cut back the variance to a major diploma.

We focus on different sensible advantages of post-stratification estimator $mu_{q,PS}$ within the subsequent part.

Problems of implementation

To date, we have now seen how significance sampling and post-stratification can present an unbiased, extra exact estimate of prevalence of dangerous movies. In actuality, there are lots of elements which may have an effect on the unbiasedness and precision of the estimator. On this part, we concentrate on the sensible problems with sampling and estimation, particularly how completely different circumstances may have an effect on our estimator. We additionally focus on how to decide on $q$ with respect to the conditional prevalence charge $g(S(v))=mathbb{E}_pleft[f(V)|S(V)=S(v)right]$.

Sampling algorithm

If the pattern measurement $n$ is mounted, sampling algorithms might be labeled as sampling with alternative or sampling with out alternative. The distinction between the 2 is whether or not the identical merchandise can seem within the pattern a number of occasions. Our estimators $mu_{q}$ and $mu_{q, PS}$ depend on the idea of sampling with alternative. The measurement could also be biased if our samples are generated from a process that samples with out alternative, akin to reservoir sampling, particularly if some objects have disproportionate weight, i.e., $q(v_i) cdot n$ is massive. For movies and different net content material, it’s common for the most well-liked objects to account for a major fraction of views.

To see how a lot of a distinction this makes, we are able to examine the variety of distinctive objects within the two sampling schemes. It’s clear that there are $n$ distinctive objects in a sampling with out alternative. If the distinctive objects in samples with alternative is near $n$, the affect of particular person objects on the metric may be small, and we would use sampling with out alternative to approximate the process of sampling with alternative.

The variety of distinctive objects in sampling with alternative isbegin{align*}

sum_{i=1}^Nleft[1 – (1 – q(v_i))^nright]
sum_{i=1}^Nleft[ n q(v_i) – frac{n^2}{2} q(v_i)^2 right]
&= n – frac{n^2}{2} sum_{i=1}^N q(v_i)^2

finish{align*}So the anticipated variety of distinctive objects is managed by $n ^ 2 sum_{i=1}^N q(v_i)^2$, and we should always have the ability to use sampling with out alternative if $q(v_i)^2$ is small.

With this perception, we might artificially break up all movies to $okay$ impartial copies $(v_i, q_i) rightarrow (v_i, q_i/okay), ldots, (v_i, q_i/okay)$, so every merchandise may seem within the pattern at most $okay$ occasions. The sum of weights squared after splitting is $sum_{i=1}^N q_i^2 / okay$, and thus mitigates the issue of skewness within the weights.

Alternatively, if we enable the sampling measurement to range, we might make use of Poisson sampling. Poisson sampling takes an analogous concept from the Poisson bootstrap — sampling with alternative is sampling from a multinomial distribution $texttt{Multinom}(n, (q_1,ldots, q_N))$. With mounted measurement $n$, the variety of occasions a person video $v_i$ happens within the pattern is $texttt{Binom}(n, q_i)$ marginally, and this may be approximated by the Poisson distribution $texttt{Pois}(n q_i)$ when $nq_i$ will not be massive.

Lacking evaluations

One problem we have now with human eval information is that we do not have particular person verdict $f(v_i)$ instantly after sampling, and it takes time earlier than we are able to use it to calculate the metric. Very often, once we calculate the prevalence metrics, the variety of samples with a verdict $n^*$ is smaller than the full pattern measurement. The lacking verdicts create two issues.

First, we can’t assume the metric we calculate is an efficient approximation of the “true metric”. If missingness of the decision is correlated with the label, ignoring lacking objects may bias the metric. For instance, if constructive objects take longer time to confirm due to extra steps, then ignoring these objects may underestimate the prevalence metric. We can’t appropriate such biases in our estimators, and to restrict the influence of the lacking verdicts, we should always wait to compute the metrics till the lacking verdict charge is small.

Second, even when the lacking verdict charge is small, we would nonetheless want to change our estimators to regulate for it. For the significance sampling estimator $hat{mu}_q$, we divide the weighted sum by the variety of samples $n$, and with lacking verdicts, it appears pure to calculate the metrics utilizing solely reviewed objects and divide the weighted sum by the variety of reviewed samples $n^*$. Nevertheless, if the lacking verdict charge will depend on the classifier rating, then we could have $$
mathbb{E}_q left[frac{p(V)}{q(V)} |V text{ is reviewed} right] neq 1
$$As a result of our derivation of significance sampling required that $mathbb{E}_q left[frac{p(V)}{q(V)} right] = 1$, utilizing $n^*$ because the denominator could also be biased. However this could occur simply in follow. For instance, if we need to take away dangerous content material from the platform as quickly as potential, we could prioritize reviewing the content material with the best classifier scores, and those lacking assessment usually tend to be these movies with low classifier scores.

Suppose we settle for the lacking at random (MAR) assumption, that the chance the video is lacking a verdict is impartial of its label given the classifier rating. On this case, we are able to take away bias through the use of the post-stratification estimator $mu_{q, PS}$. Beneath the MAR assumption, each the strata weights within the goal distribution $p$ and the conditional prevalence charge is unbiased. Even when this MAR assumption will not be true, post-stratification appears a extra strong estimator than the significance sampling estimator.

Selecting the significance distribution

Now we have that seen the optimum significance distribution $q*$ will depend on the conditional prevalence charge and the goal distribution (it additionally relies upon whether or not we use post-stratification or not), however the conditional prevalence charge is unknown and must be estimated. There are lots of methods we are able to use to estimate this amount, and we are going to focus on every possibility intimately.

First, if the classifier prediction is a bona fide chance of the video being dangerous, we are able to use the rating straight. That is simple and simple to implement. However we discover that the expected rating is usually calibrated to the coaching information. When coaching a classifier with few positives within the inhabitants, one widespread technique is to over pattern objects with constructive labels, and/or down pattern objects with unfavorable labels. The ratio between constructive and unfavorable samples within the coaching information straight impacts the chance rating of the classifier, so the classifier rating often will not be calibrated to the chance.

On this state of affairs, we could use historic information to estimate the conditional prevalence charge. Say we observe $(S(v_i), f(v_i))$ pair from previous samples, and we wish to calculate $g(s) = mathbb{E}_pleft[f(V)|S(V)= s right]$ from the observations. With finite observations on a steady classifier rating $S(v_i)$, the pattern conditional prevalence within the stratum would make for a risky estimation. There are two concepts to enhance the estimates:

  1. We’d easy the estimation by becoming a regression on the given information. For instance, we might match a logistic regression on a easy remodel of $(S(v_i)$ (e.g. polynomial or spline) to estimate the conditional prevalence. This reduces the variety of parameters to estimate.
  2. We may cut back the variety of parameters we have to estimate, by bucketing the $S(v_i)$ into a number of discrete buckets and estimating the prevalence charge inside every bucket. By decreasing the parameters we need to estimate, the sparse observations change into dense, and we could have a greater estimation of every classifier bucket.

It’s price evaluating the 2 methods — each estimate the conditional prevalence with fewer parameters. Regression implicitly makes use of info (borrows power) from different scores to estimate the outcomes of a given rating, whereas discretizing and merging buckets does this in a extra express method whereas solely utilizing native info. The bucketing technique additionally adjustments the significance sampling to a stratified sampling setting, and permits us to make use of binomial confidence intervals to estimate the uncertainty of our estimate (extra on that later). However the sampling charge is discontinuous on the rating bucket boundary and never easy as with the regression technique.

Whether or not or not we borrow power from different scores additionally impacts the estimation. In our case when the inhabitants prevalence charge is low, it’s fairly widespread to have few constructive objects in historic samples for when the classifier rating is low. The regression technique extrapolates the prevalence of excessive rating areas and extends the estimate to low rating areas. Then again, the bucketing technique may put Zero because the prevalence charge if no constructive objects happen within the bucket, and the significance sampling distribution utilizing this prevalence charge wouldn’t pattern any new objects from the area (we might alter the bucket boundary to keep away from that).

Once we match the historic information, it is very important concentrate on the next caveats:

  • The estimates rely on the distribution of historic samples, which needn’t be equivalent to the goal distribution. If we use significance sampling on historic samples, we might inversely weigh every remark within the pattern and match a weighted regression or weighted common to type a greater estimate.
  • In regression, the size of the classifier rating $S(v_i)$ issues. As a result of binary classifiers typically use softmax to normalize the expected rating, it is smart to reverse the expected rating $S(v_i)$ with the logit function and match the regression on the actual line.

All strategies mentioned on this part depend on previous information (both from coaching information or historic samples). Nevertheless, the distribution of dangerous movies won’t stay fixed over time because of the adversarial nature of the issue. For instance, if the video violation is correlated with classifier rating, dangerous actors may add extra movies which might be more durable to detect by the classifiers, leading to a shifting of conditional prevalence over time. To mitigate this situation, we could restrict to solely utilizing latest information, or including a time decay operate to down-weight objects additional up to now.

Extra typically, we are able to apply defensive significance sampling [8] wherein the significance sampling distribution is a combination of our goal distribution $p$ and the estimated significance distribution $hat{q}$ to take care of shifts in both conditional prevalence or the area the place the prevalence is most unsure:$$

{hat{q}_{lambda}(v)} = lambda p(v)+ (1 – lambda){hat{q}(v)}

$$With the combination distribution, we are able to assured that inverse sampling ratio $$

frac{p(v)}{hat{q}_{lambda}(v)} leq 1 / lambda

$$ is bounded above, and we are able to allocate sufficient samples even when $hat{q}$ could be very small within the areas with few positives (or few negatives utilizing optimum weight from post-stratification). The selection of $lambda$ could be a small worth between 0.1 and 0.5 as estimated from the information (particulars in [8]).

Confidence Intervals

Apart from the purpose estimate of the prevalence metric, we’re additionally concerned about setting up a confidence interval to point the accuracy of the estimate. Sometimes, we have now the variance components for $hat{mu}_q$ or $hat{mu}_{q, PS}$, and we might use a z-score to calculate the boldness interval of the metric.

Nevertheless, within the case of binomial proportions removed from $frac{1}{2}$, estimates of $hat{mu}_q$ and $hat{mu}_{q, PS}$ on finite samples converge poorly to the true uncertainty. In consequence, confidence intervals from the z-distribution have poor protection (see [2]).

To reveal this situation, take into account the next toy instance with categorical scores:



Inhabitants Proportion

Pattern Proportion 

Low Danger




Excessive Danger




The sampling proportions are chosen to be optimum for post-stratification. The next chart exhibits the purpose estimate and z-confidence interval of this toy instance with 400 Monte Carlo simulations. From the chart, because the pattern measurement will increase from 1,000 to 100,000, the size of the boldness interval is extra constant, and the mis-coverage case is extra uniformly distributed on either side.

Confidence intervals ordered by the purpose estimate. Uncovered intervals are crimson.

If we zoom in to the case with 1,000 samples, we see that the CI width is very correlated with the variety of constructive objects within the low danger bucket. When there are not any positives in that bucket, the pattern customary error is on common 40% smaller than the one utilizing theoretical customary deviation, leading to poor protection.

The state of affairs is barely higher when we have now a bigger pattern, however even with 10,000 samples (anticipated 6.7 constructive within the low danger bucket), having 5 or fewer constructive samples (about 1/3 of the possibility) would lead to the usual error being at the least 9% smaller than the theoretical customary deviation.

Beneath-estimation of ordinary error basically occurs once we shouldn’t have sufficient positives in sure areas. However right here is the factor — significance sampling makes the issue worse by down-sampling such areas, leading to much more sparse constructive objects!

Fortunately, the binomial proportion estimation literature comprises several methods aside from the traditional approximation interval (the Wald interval), and [3] suggests utilizing both the Jeffreys interval, the Agresti-Coull interval, or the Wilson rating interval to estimate the proportion. Jeffreys interval is a Bayesian posterior interval with prior $texttt{Beta}(frac{1}{2}, frac{1}{2})$. The Agresti-Coull interval provides pseudo counts 2 to success and failures earlier than utilizing the traditional approximation, and the middle of the Wilson rating interval may be interpreted as including pseudo counts. Thus all these strategies are linked to including pseudo counts and shrinking the purpose estimate in direction of $frac{1}{2}$.

Once we use post-stratification in our significance sampling, we might apply an analogous concept to estimate the usual error (or posterior interval for Bayesian strategies) in every stratum earlier than combining them into an total CI. However none of those strategies yields a passable lead to our case as a result of the constructive occasions are uncommon, and shrinking the prevalence estimate to $frac{1}{2}$ will overestimate the true prevalence. An alternate is to the Stratified Wilson Interval estimate [3] such the pseudo counts added to every stratum rely on the general interval width.

In fact, which CI technique works finest for the issue will depend on many elements and no technique is universally higher than others. In our downside, the stratified Wilson interval works effectively for our video sampling downside.

To reveal the efficiency of various CI strategies, we created an artificial information with 5 strata that’s just like our video sampling downside. The next chart exhibits the protection of 95% CI of various strategies for every technique beneath completely different pattern sizes utilizing 4000 simulations. 

Protection of 95% CIs with 95% confidence bands.
We see that confidence intervals utilizing the traditional approximation carry out poorly till we have now about 100,000 samples. Jeffreys CI under-covers when the variety of samples is small, whereas the stratified Wilson CI is a bit more conservative, having protection higher than nominal.

If we drill down into the instances when the CIs don’t cowl the true prevalence charge, it’s clear that Jeffreys CI tends to overestimate the CI (the decrease finish of the CI is above the bottom fact). This occurs as a result of the prior shrinks the estimate in direction of $frac{1}{2}$. CI utilizing the traditional approximation has the alternative situation — due to zero counts, the purpose estimate is extra prone to be too small than too large. In distinction, the miss-coverage of the stratified Wilson CI is extra balanced when we have now about 25,000 samples.

Miss-coverage charge with 95% confidence bands.

How Many Strata?

If post-stratification and stratified Wilson confidence interval can present higher uncertainty estimates, then it’s pure to ask what number of strata do we want. It actually will depend on the issue, however a rule of thumb is there must be sufficient constructive and unfavorable objects in every stratum. If two strata shouldn’t have any constructive objects, with completely different sampling charges, merging them right into a single strata with out altering the variety of samples will enhance the efficient pattern measurement.

There’s an extra profit of getting coarse strata — it makes it more durable for movies to maneuver into completely different buckets once we retrain a brand new model of the classifier. This makes aggregating the metrics over a number of intervals extra strong, as we could pattern objects on completely different days with completely different variations of the classifier.

Cochran recommends not more than 6 or 7 strata [4] whereas others [5] recommend little to gained from going past 5 to 10 strata. When you repair strata measurement, you may think about using the Danlenius-Hodges technique [6] to pick strata boundary.


On this submit, we mentioned the idea of significance sampling to enhance the precision estimate of the binomial proportion of uncommon occasions. We additionally focus on methods to design a sampling scheme and select the suitable aggregation metric to handle sensible considerations. The development of the metric will depend on the precision/recall of the classifier, in addition to many different elements. In our downside, once we apply post-stratification to significance sampling, we’re in a position to greater than triple the rely constructive objects within the pattern whereas decreasing the CI width of the estimator by greater than 30%. We hope you see related good points in your software.


[1] Artwork Owen. Unpublished e book chapter on importance sampling.

[2] Lawrence Brown, Tony Cai, Anirban DasGupta (2001). Interval Estimation for a Binomial Proportion. Statistical Science. 16 (2): 101–133.

[3] Xin Yan, Xiao Gang Su. Stratified Wilson and Newcombe Confidence Intervals for Multiple Binomial Proportions. Statistics in Biopharmaceutical Analysis, 2010.

[4] William Cochran (1977). Sampling Methods pp 132-134.

[5] Ray Chambers, Robert Clark (2012). An Introduction to Mannequin-Primarily based Survey Sampling with Purposes.

[6] Tore Dalenius, Joseph Hodges (1959). Minimal Variance Stratification. Journal of the American Statistical Affiliation, 54, 88-101.

[7] Neyman, J. (1934). “On the Two Completely different Elements of the Consultant Methodology: The Methodology of Stratified Sampling and the Methodology of Purposive Choice”, Journal of the Royal Statistical Society, Vol. 97, 558-606.

[8] Tim Hesterberg (1995). Weighted Common Significance Sampling and Defensive Combination Distributions. Technometrics, 37(2): 185 – 192.


Source link

Write a comment