Data Disasters

Navigating the complexities of real-world data analysis
Author

Emily Riederer

Preface

Training in data analysis often begins with Statistics 101 course. Students learn the “happy path” of answer data that adheres to specific assumptions (such as “independent and identically distributed with a Normal density”) and answers pre-specified questions (most notably, the infamous null hypothesis significance test). Then, they venture out into the world of real-world data analysis where non-experimental data is rarely so well behaved and the questions asked of it are far more nuanced.

No one course should aim to teach students everything they should know about statistics. In fact, one of the best parts about a career in statistics is the responsibility and privilege of life-long learning. However, the flaw of introductory statistics is not that it’s incomplete, but that it’s not obvious how it is not complete. Statistics is a bad salesman. There’s no season finale, no cliff hanger, no teasing and hinting and promising more and better to come. Student may leave their studies believing that answering more complex data analysis questions is trivially easy (by relying on the one-size-fits-all “panacea” that they learned) or intractably difficult (when the assumptions of that method are not met.)

This book attempts to add more color to all the dimensions of data analysis while showcasing the nuances throughout the true life cycle of data analysis using two strategies.

First, it attempts to highlight common pitfalls in all the parts of data analysis: from data management and computation to visualization, interpretation, and modeling and even to communication and collaboration. Data analysis is fundamentally a creative task, so there are rarely canonical one-size-fits-all solutions. Curiously, however, there are plenty of canonical issues even if they require different solutions in different settings. Thus, the goal of this book is to highlight common data disasters and, in doing so, help students cultivate an intuition for how to detect common problems before they occur in an important analysis.

Second, while exploring these data disasters, we humbly put forth a (woefully incomplete!) literature review of more advanced methods from statistics and other quantitative disciplines (e.g. economics, epidemiology), to help learners build a “mental index” of terms to search and techniques to study should they encounter a relevant problem.

The content in this book is currently being developed and is all subject to change.

Chapters and sections tagged as WIP (work-in-progress) have substantial content and are suitable for reading.

Chapters and sections tagged as TODO have minimal outlines or code examples (if that).

Main Topics

In particular, we will aim to help you avoid twelve types of data disasters:

  • Data Dalliances: Misinterpreting or misuing data based on how it was collected or what it represents
  • Computational Quandaries: Letting computers do what you said and not what you meant
  • Egregious Aggregations: Losing critical information when information is condensed
  • Vexing Visualization: Confusing ourselves or others with plotting choices
  • Incredible Inferences: Drawing incorrect conclusions for analytical results
  • Cavalier Causality: Falling prey to spurious correlations masquerading as causality
  • Mindless Modeling: Failing to get the most value out of models by not tailoring the features, targets, and performance metrics
  • Alternative Algorithms: Lacking an understanding of alternative methods which may be better suited for the problem at hand
  • Futile Findings: Asking and answering questions that aren’t useful
  • Complexifying Code: Making projects unwieldy or more difficult to understand than necessary
  • Rejecting Reproducibility: Working inefficiently instead of an efficient, reproducible, and sharable workflow
  • Mourning Mistakes: Letting the perfect be the enemy of the good

Common Themes

In each chapter, we will see numerous examples of each disaster and consider strategies to help us mitigate. Along the way, we’ll emphasize:

  • The importance of domain knowledge and the data-generating process to decide what it is you want to do
  • The utility of simulation as a tool to explore if, in fact, you are doing it
  • The exploration of counterexamples to build intuition for common patterns of problems even where common solutions don’t exist

As we go, we will notice how three common themes that challenge the focus of introductory statistics:

  • Summary statistics mask interesting stories that we see when focusing on the variation
  • Similarly, observations and variables are rarely independent; the story is in the covariance
  • Assumptions of Normality, or more broadly symmetry, are often in appropriate in wonky, highly skewed world
Note

Note that there are five types of callouts, including: note, tip, warning, caution, and important.