Mind Your Units

[ad_1]



Table 4: Bucketed data that enables separate analysis for first observations.

While this approach requires more data storage than the straight bucketed data approach, it allows us to compute the metrics that are most relevant for evaluating the product. Again, we can apply the  jackknife, but now to the subset of the data where the counter value is first or subsequent. In other words, we have filtered our data by a dimension value. Similarly, you can see that you can add any filter into the bucketed data approach (as long as the filtering field is available in the raw data logs).

In summary, there are many different ways to account for the group structure when the experimental unit differs from the unit of observation. Regardless of how you do it, do remember to mind your units.

Acknowledgments

Many thanks to Dancsi Percival for his work on the impact of group-size heterogeneity on confidence intervals and to the many colleagues at Google who have evolved many of the ideas and tools described in this post.


Appendix: R code

ComputeBootstrapMeanCI <- function(data, data.grouping, num.replicates = 2000) {

 # Function to calculate confidence intervals for the mean via bootstrap.

 # Must pass in a grouping variable to do the sampling on the grouping units.

 # For bootstrap, we sample with replacement and each replicate is same size

 # as original data; 95% CI obtained empirically by taking percentiles.

 #

 # Args:

 #   data:           Numeric vector of data.

 #   data.grouping:  Numeric vector of grouping variable for data. Must be same

 #                   the same length as data.

 #   num.replicates: Number of replicates.

 #

 # Returns:

 #   Half-width of a 95% CI.


 # Split the data by the grouping variable.

 assertthat::assert_that(length(data) == length(data.grouping))

 data.split <- split(data, data.grouping)


 # Create replicates.

 n.groups <- length(unique(data.grouping))

 data.samples <-

   replicate(num.replicates,

             unlist(sample(data.split, replace = TRUE, size = n.groups)),

             simplify = FALSE)


 # Get sampling distribution and summarize empirically.

 bootstrap.means <- vapply(data.samples, mean, 1)

 half.width <- unname(diff(quantile(bootstrap.means, c(0.025, 0.975))) / 2)

 return(half.width)

}


set.seed(1237)


kNumGroups <- 50000  # Total number of groups for all simulations.

kGroupSd <- 1  # This is the standard deviation for group level means.

kNumReplicates <- 2000  # Number of replicates for bootstrap.

kWithinGroupShrink <- 0.25  # Shrinks standard dev. of w/in group observations.

kLambdaVector <- seq(0, 1.2, by = 0.2)  # Poisson parameter for group size.


# Generate the mean for each group and use this for all runs of simulation.

group.means <- rnorm(kNumGroups, 0, kGroupSd)


# Loop over simulation parameters and store results in a summary data frame.

summary.df <- data.frame()

for (param in kLambdaVector) {

 group.sizes <- rpois(kNumGroups, param) + 1


 # Make data frame with group size & group mean for ease of computation.

 group.df <- data.frame(id = 1:kNumGroups, group.means, group.sizes)

 group.info <- split(group.df, group.df$id)


 # Helper function to generate per-group observations.

 SimulateGroupObservations <- function(data) {

   rnorm(data$group.sizes, data$group.means, kWithinGroupShrink * kGroupSd)

 }


 # Generate vector of per-group observations for all groups.

 full.obs <- unlist(sapply(group.info, SimulateGroupObservations))


 # Compute CI pretending that each observation is independent.

 ignore.group.est.half.width <-

     ComputeBootstrapMeanCI(full.obs, 1:length(full.obs), kNumReplicates)


 # Get a vector of group id’s to incorporate dependence structure.

 group.id <- unlist(sapply(group.info, function(x) rep(x$id, x$group.sizes)))


 # Compute CI accounting for the group structure.

 include.group.est.half.width <-

     ComputeBootstrapMeanCI(full.obs, group.id, kNumReplicates)


 # Formulate data frame to summarize results.

 my.row <- data.frame(lambda = param,

                      n.multigroups = sum(group.sizes > 1) / kNumGroups,

                      n.observations = sum(group.sizes),

                      ignore.group.est.half.width,

                      include.group.est.half.width)

 summary.df <- rbind(summary.df, my.row)

[ad_2]




Source link

Write a comment