4  Egregious Aggregations (WIP)

Once armed with an understanding of the data and tools available for analysis, a common start to analysis is exploring data with aggregation. At its heart, any sort of data analysis is the process of condensing raw data into something more manageable and useful while giving up as little of the information as possible. From linear regressions and hypothesis testing to random forests and beyond, much of data analysis could truly be called “applied sums an averages”.

Many elementary tools for this task are much better at the comprehension task than the preservation one. We learn rigorous assumptions to consider and validate when studying linear regression, but basic arithmetic aggregation presents itself as agnostic and welcome to any type of data. However, the underlying distributions of our variables and the relationships between them have a significant impact on the how informative and interpretable various summarizations are.

In this chapter, we will explore different ways that univariate and multivariate aggregations can be naive or uninformative.

4.1 Motivating Example: Similar in Summary

To begin, we will look at a whimsical toy example. This may feel trite or manufactured, but the subsequent sections will aim to convince you that these issues are not just esoteric. Consider the “datasaurus dozen” dataset (Matejka and Fitzmaurice 2017) which is available within the datasauRus R package (R-datasauRus?).

This dataset contains 12 sets of data stacked on top of one another and identified by the dataset column.1. Besides the identifier column, the data is fairly small and contains only two variables x and y.

df <- datasauRus::datasaurus_dozen
head(df)
  dataset       x       y
1    dino 55.3846 97.1795
2    dino 51.5385 96.0256
3    dino 46.1538 94.4872
4    dino 42.8205 91.4103
5    dino 40.7692 88.3333
6    dino 38.7179 84.8718

A quick analysis of summary statistics reveals that each of the 12 datasets is very consistent in its summary statistics. The means and variances of x and y and even their correlations are nearly identifcal.

Summary statistics for Datasaurus Dozen datasets
dataset mean(x) mean(y) var(x) var(y) cor(x, y)
away 54.266 47.835 281.227 725.750 -0.064
bullseye 54.269 47.831 281.207 725.533 -0.069
circle 54.267 47.838 280.898 725.227 -0.068
dino 54.263 47.832 281.070 725.516 -0.064
dots 54.260 47.840 281.157 725.235 -0.060
h_lines 54.261 47.830 281.095 725.757 -0.062
high_lines 54.269 47.835 281.122 725.763 -0.069
slant_down 54.268 47.836 281.124 725.554 -0.069
slant_up 54.266 47.831 281.194 725.689 -0.069
star 54.267 47.840 281.198 725.240 -0.063
v_lines 54.270 47.837 281.232 725.639 -0.069
wide_lines 54.267 47.832 281.233 725.651 -0.067
x_shape 54.260 47.840 281.231 725.225 -0.066

However, as shown in Figure 4.1, when we visualize this data, we find that the 12 datasets reveal remarkably different patterns.

Figure 4.1: Scatterplots for Datasaurus Dozen datasets

This dataset is a more elaborate version of Anscombe’s Quartet, a well-known set of four datasets which exhibit similar properties. Examining a simiilar plot for Anscombe’s Quartet (with data from the anscombe dataset which ships in R’s datasets package), we can get better intuition for how the phenomenon is manufactured. Figure 4.2 shows as similar plot to Figure 4.1. Comparing datasets 1 and 3, for example, we can see a trade-off between a semi-strong trend with a consistent-seeming amount of noise and an nearly perfect linear trend with a single outlier.

Warning: package 'tidyr' was built under R version 4.0.5
Warning: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
Warning: `cols` is now required when using unnest().
Please use `cols = c(x, y)`
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'
Figure 4.2: Scatterplots for Anscombe’s Quartet

Figure 4.2 also plots the simple linear regression line for each dataset. Similar to the summary statistics, these are also identical. We know this because the regression coefficient for a simple linear regression is given by cov(x,y)/sd(x)sd(y). You’ll notice I do not write “we can see that…” because, in fact, we can only see similarity not equality. The message of this section may seem to be “don’t summarize your data without plotting it”, but conducting “visual analytics” without looking at the numbers is also problematic. We’ll explore the latter topic more in Chapter -Chapter 5 (Vexing Visualiztions).

While there are clearly a contrived example (and, if you so chose to check out the “Datasaurus Dozen” paper, a very cleverly contrived example!), its also a cautionary tale. Summary statistics are not just insufficient when they focus on central tendency (e.g. mean) instead of spread. In this example, even an examination of variation and covariation led to an overly simplistic view of the underlying data.

4.2 Averages (WIP)

4.2.1 Implicit assumptions (TODO)

When statistics students study linear regression, they are introduced to a number of canonical assumptions including:

  • The true functional form between the dependent and independent variables is linear / additive
  • Errors are independent
  • Errors have constant variance (that is, they are homoskedastic)
  • Errors have a normal distribution

Of course, whether or not these assumptions hold, there’s nothing stopping anyone from mechanically fit at linear regression2. Instead, these assumptions are required to make the output of a linear regression meaningful and, more specifically, for conducting correct inference.

Similarly, there are no limitations on mechanically computing an average

4.2.2 Averaging skewed data

Arithmetic average versus colloquial meaning of average as “typical”

Skewed data

Multimodal data / mixture models

4.2.3 No “average” observation

In the previous section, the average represented a point in the relevant data range even if it was not perhaps the one most representative of a “typical” observation. We discussed how in some situations this quantity may be a reasonable answer to certain types of questions and an aid for certain types of decisions.

However, when we seek an average profile over multiple variables, the problems of averages are further compounded. We may end up with a set of “average” summary statistics that are not representative of any part of our population.

To see this, let’s assume we are working with data for a company with a subscription business model. We might be interested in profiling the age of each account (how long they have been a subscriber) and their activity (measured by amount spent on an e-commerce platform, files downloaded on a streaming service, etc.)

The following code simulates a set of observations: 80% of accounts are between 0 to 3 years in age and have an average activity level of 100 while 20% of accounts are older than 3 years in age and have an average activity level of 500. (Don’t over-think the specific probability distributions lived here. We are concerned with interrogating the properties of the average and not with simulating a realistic data generating process. Giving ourselves permission to be wrong or “lazy” about unimportant things gives us more energy to focus on what matters.)

set.seed(123)

# define simulation parameters ----
## n: total observations
## p: proportion of observations in group 1
n <- 5000
p <- 0.8
n1 <- n*p
n2 <- n*(1-p)

# generate fake dataset with two groups ----
df <- 
  data.frame(
    age = c(runif(n1,   0,  3), runif(n2,   3, 10)),
    act = c(rnorm(n1, 100, 10), rnorm(n2, 500, 10))
  )

Figure 4.3 shows a scatterplot of the relationship between account age (x-axis) and activity level (y-axis). Meanwhile, the marginal rug plots shows the univariate distribution of each variable. The sole red dot denotes the coordinates of the average age and average activity. Notably, this dot exists in a region of “zero density”; that is, it is not representative of any customer. Strategic decisions made with this sort of observation in mind as the “typical” might not be destined for success.

Figure 4.3: A scatterplot of two variables and their averages

4.2.4 The product of averages

As the above example shows, averages of multivariate data can produce poor summaries – particularly when these variables are interrelated3.

A second implication of this observation is that deriving additional computations based on pre-averaged numbers is likely to obtain inaccurate results.

For example, consider that we wish to estimate the average dollar amount of returns per any e-commerce order. Orders may generally be a mixture of low-price orders (around $50 on average) and high-price orders (around $250 on average). Low-price orders may have a 10% probability of being returned while high price orders have a 20% probability. (Again, are these numbers, distributions, or relationships hyper-realistic? Not at all. However, once again we are telling ourselves a story just to reason about numerical properties, so we have to give ourselves permission to not focus on irrelevant details.)

set.seed(123)

# define simulation parameters ----
## n: observations per group
## pr[1|2]: mean price per group
n <- 100
pr1 <- 50
pr2 <- 250
pr_sd <- 5
re1 <- 0.1
re2 <- 0.2

# simulate spend amounts and return indicators ----
amt_spend  <- c(rnorm(n, pr1, pr_sd), rnorm(n, pr2, pr_sd))
ind_return <- c(rbinom(n, 1, re1),    rbinom(n, 1, re2))

# compute summary statistics ----
average_of_product <- mean(amt_spend * ind_return)
product_of_average <- mean(amt_spend) * mean(ind_return)

The true average amount returned across all of our orders is 36.0438333 (from the average_of_product variable). However, if instead we already knew an average spend amount and an average return proportion, we might be inclined to compute the product_of_average method which returns a value of 26.9922866. (This is a difference of 9.05 relative to an average purchase amount of 150.)

At first, this may seem unintuitive until we write out the formulas and realize that these metrics are, in fact, two very different quantities:

\[ \frac{ \sum_{1}^{n} Spend }{\sum_{1}^{n} 1} \ * \frac{ \sum_{1}^{n} I(Return) }{\sum_{1}^{n} 1} \] over all \(n\) orders

versus

\[ \frac{\sum_{1}^{n} Spend * I(Return)}{\sum_{1}^{n} 1} \]

If this still feels counterintuitive, we can see how much of the difference is accounted for by the interrelation between our two variables. In the following code, we break the relationship between the variables by randomly reordering the ind_return variable so it is no longer has any true relationship to the corresponding amt_spend variable.

# randomly reorder one of two variables to break relationships ----
ind_return <- sample(ind_return, size = 200)

# recompute variables ----
average_of_product <- mean(amt_spend * ind_return)
product_of_average <- mean(amt_spend) * mean(ind_return)

After redoing the calculations, we find that th two values are much closer. average_of_product is now 24.1041313 and product_of_average is now 26.9922866. These are notably still not the same number so that does not mean that these two equations are equivalent if variables are unrelated; however, this second result once again illustrates the extent to which interrelations can defy our naive intuitions.

4.2.5 Average over what? (TODO)

no such thing as an unweighted average (just sometimes weights are equal)

formal definition of expected value forces you to pick a probability distribution

eg avg mpg by time vs by mileage?

not strictly an error but our language allows an ill-defined problem

4.2.6 Dichotomization and distributions

n <- 1000

# simulate x and y: uniformly distributed x ----
x1 <- runif(n)
y1 <- 5 + 3*x1 + rnorm(n)

# simulate x and y: same relationship, more concentrated distribution of x ----
x2 <- c( runif(n*0.1, 0.00, 0.44), 
         runif(n*0.8, 0.45, 0.55), 
         runif(n*0.1, 0.55, 1.00) 
        )
y2 <- 5 + 3*x2 + rnorm(n)

# com
g1 <- ifelse(x1 < 0.5, 0, 1)
means1 <- c(mean(y1[g1 == 0]), mean(y1[g1 == 1]))
means1 
[1] 5.813698 7.254438
g2 <- ifelse(x2 < 0.5, 0, 1)
means2 <- c(mean(y2[g2 == 0]), mean(y2[g2 == 1]))
means2
[1] 6.218347 6.762943
means1
[1] 5.813698 7.254438
means2
[1] 6.218347 6.762943
cor(x1, y1)
[1] 0.6424276
cor(x2, y2)
[1] 0.3687667

4.2.7 Small sample sizes

4.3 Proportions (WIP)

note that these are of course just a type of average (average of indicators) but helpful to examine challenges separately

4.3.1 Picking the right denominator

4.3.2 Sample size effects

4.4 Variation (TODO)

x <- c(-20, -10, -5, 0, 5, 10, 20)
mean(x)
var(x)
[1] 0
[1] 175
x <- c(-15, -15, -5, 0, 5, 15, 15)
mean(x)
var(x)
[1] 0
[1] 158.3333

4.5 Correlation (WIP)

As shown in Figure 4.4,

Figure 4.4: Plots of x from 1 to 10 over a range of common functions

4.5.1 Linear relationships only

x <- 1:10
y <- list(linear = x, quadratic = x**2, cubic = x**3, quartic = x**4)

vapply(y, FUN = function(z) cor(x, z, method = "pearson"), FUN.VALUE = numeric(1))
   linear quadratic     cubic   quartic 
1.0000000 0.9745586 0.9283912 0.8816779 

4.5.2 Multiple forms

Traditional (Pearson) correlation depends on specific values whereas Spearman and Kendall focus on order statistics

# polynomials ----
x <- 1:10
y <- list(linear = x, quadratic = x**2, cubic = x**3, quartic = x**4)

vapply(y, FUN = function(z) cor(x, z, method = "pearson"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "spearman"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "kendall"), FUN.VALUE = numeric(1))
   linear quadratic     cubic   quartic 
1.0000000 0.9745586 0.9283912 0.8816779 
   linear quadratic     cubic   quartic 
        1         1         1         1 
   linear quadratic     cubic   quartic 
        1         1         1         1 

Similar results with a different set of functions

# other functional forms ----
x <- 1:10
y <- list(sin(x), sqrt(x), exp(x), log(x))

vapply(y, FUN = function(z) cor(x, z, method = "pearson"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "spearman"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "kendall"), FUN.VALUE = numeric(1))
[1] -0.1705046  0.9891838  0.7168704  0.9516624
[1] -0.1393939  1.0000000  1.0000000  1.0000000
[1] -0.1111111  1.0000000  1.0000000  1.0000000

4.5.3 Sensitivity to domain

The “strength of relationship” (completely deterministic) is the same in both cases
However, the summarization of the relationships changes

Here’s same case as before:

# polynomials ----
x <- 1:10
y <- list(linear = x, quadratic = x**2, cubic = x**3, quartic = x**4)

vapply(y, FUN = function(z) cor(x, z, method = "pearson"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "spearman"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "kendall"), FUN.VALUE = numeric(1))
   linear quadratic     cubic   quartic 
1.0000000 0.9745586 0.9283912 0.8816779 
   linear quadratic     cubic   quartic 
        1         1         1         1 
   linear quadratic     cubic   quartic 
        1         1         1         1 

And here’s a different range:

# polynomials, diff range ----
x <- -10:10
y <- list(linear = x, quadratic = x**2, cubic = x**3, quartic = x**4)

vapply(y, FUN = function(z) cor(x, z, method = "pearson"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "spearman"), FUN.VALUE = numeric(1))
vapply(y, FUN = function(z) cor(x, z, method = "kendall"), FUN.VALUE = numeric(1))
   linear quadratic     cubic   quartic 
1.0000000 0.0000000 0.9179069 0.0000000 
   linear quadratic     cubic   quartic 
        1         0         1         0 
   linear quadratic     cubic   quartic 
        1         0         1         0 

4.5.4 Partial correlation

A lot of EDA starts with some sort of correlation matrix
This won’t always account for the fact that some variables can mask correlation between others

Consider two groups with trends in different directions

n_obsvs <- 10
n_group <-  2

group <- rep(1:n_group, each = n_obsvs)
var1  <- rep(1:n_obsvs, times = n_group)
var2  <- var1 * rep(c(5, -5), each = n_obsvs)
var3  <- var1 * rep(c(1,  5), each = n_obsvs)

As Figure 4.5 shows

Figure 4.5: Subgroups demonstrating opposing linear relationships

Because of the opposing trends, their correlation becomes zero

cor(var1, var2, method = "pearson")
cor(var1, var2, method = "spearman")
cor(var1, var2, method = "kendall")
[1] 0
[1] 0
[1] 0

However, by group the correlation is 1

cor(var1[group == 1], var2[group == 1])
cor(var1[group == 2], var2[group == 2])
[1] 1
[1] -1

A similar thing happens when the relationship has the same sign but different slopes

cor(var1, var3, method = "pearson")
[1] 0.5703518

while the correlation is still one within group

cor(var1[group == 1], var3[group == 1])
cor(var1[group == 2], var3[group == 2])
[1] 1
[1] 1

Even partial correlation doesn’t help in case of opposing signs

library(ppcor)
Warning: package 'ppcor' was built under R version 4.0.5
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:patchwork':

    area
The following object is masked from 'package:dplyr':

    select
pcor(data.frame(var1, var2, group))$estimate
      var1       var2      group
var1     1  0.0000000  0.0000000
var2     0  1.0000000 -0.8864053
group    0 -0.8864053  1.0000000

It improves the strength of the estimated correlation in the case of var3 but still failes to estimate it correctly. Partial correlation would be assuming a form like var3 ~ var1 + group and not var3 ~ var1 * group

pcor(data.frame(var1, var3, group))$estimate
            var1      var3      group
var1   1.0000000 0.8320503 -0.7375338
var3   0.8320503 1.0000000  0.8864053
group -0.7375338 0.8864053  1.0000000

4.7 Comparisons (TODO)

4.7.1 Percents versus percentage points

4.7.2 Changes with small bases

(0.015 - 0.010) / 0.010
(0.805 - 0.800) / 0.800
[1] 0.5
[1] 0.00625
0.015 / 0.010
0.805 / 0.800
[1] 1.5
[1] 1.00625

4.8 Segmentation (TODO)

transient segmentation means can’t track trends within

4.9 Outliers (TODO)

4.10 Strategies (TODO)

4.11 Real World Disasters (TODO)

Straight vs weighted averages in COVID positivity rates (May 2020)

The changes could result in real-world differences for Hoosiers, because the state uses a county’s positivity rate as one of the numbers to determine which restrictions that county will face. Those restrictions determine how many people may gather, among other items.

Some Hoosiers may see loosened restrictions because of the changes. While Box said the county-level impact will be mixed, she predicted some smaller counties will see a decline in positivity rate after the changes.

“The change to the methodology is how we calculate the seven-day positivity rate for counties. In the past, similar to many states, we’ve added each day’s positivity rate for seven days and divided by seven to obtain the week’s positivity rate. Now we will add all of the positive tests for the week and divide by the total tests done that week to determine the week’s positivity rate. This will help to minimize the effect that a high variability in the number of tests done each day can have on the week’s overall positivity, especially for our smaller counties.”

three issues here

first straight versus weighted averages

avg_of_ratios <- (10/100 + 90/100) / 2

ratio_of_sums <- (10 + 90) / (100 + 100)

avg_of_ratios == ratio_of_sums

avg_of_ratios_uneq <- (10/100 + 180 / 200) / 2

ratio_of_sums_uneq <- (10 + 180) / (100 + 200)

avg_of_ratios_uneq == ratio_of_sums_uneq

weightavg_of_ratios_uneq <- (100/300)*(10/100) + (200/300)*(180/200)

ratio_of_sums_uneq == weightavg_of_ratios_uneq
[1] TRUE
[1] FALSE
[1] TRUE

then back to the data for why it matters.

if data is from same distribution, this could increase variance but shouldn’t effect mean

Recall that the standard deviation of sample proportion is \(\sqrt(p*(1-p)/n)\)

link to discussions of sample size and different types of averages

set.seed(123)

# define simulation parameters ----
## n: total draws from binomial distribution
## p: proportion of successes
p <- 0.5
n <- 1000

# sample from binomials of different size ----
s010 <- rbinom(n,  10, p) /  10
s100 <- rbinom(n, 100, p) / 100
s500 <- rbinom(n, 500, p) / 500

# set results as dataframe for inspection ----
df <- data.frame(
  s = rep(c(10, 100, 500), each = n),
  x = c(s010, s100, s500)
)
library(ggplot2)

ggplot(data = df) +
  aes(x = x, col = as.character(s)) +
  geom_density() +
  geom_vline(xintercept = p, col = 'darkgrey', linetype = 2) +
  labs(
    title = "Sampling Distribution for p = 0.5",
    col = "Sample Size"
  ) +
  scale_x_continuous(breaks = seq(0, 1, 0.1), labels = seq(0, 1, 0.1)) +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "bottom"
  )

but low sample days based on real world are probably also a sign of a different distribution (only very urgent cases get tested?)