Data Disasters

Navigating the complexities of real-world data analysis

Author

Emily Riederer

Preface

Data Disasters

Training in data analysis often begins with Statistics 101 course. Students learn the “happy path” of answer data that adheres to specific assumptions (such as “independent and identically distributed with a Normal density”) and answers pre-specified questions (most notably, the infamous null hypothesis significance test). Then, they venture out into the world of real-world data analysis where non-experimental data is rarely so well behaved and the questions asked of it are far more nuanced.

No one course should aim to teach students everything they should know about statistics. In fact, one of the best parts about a career in statistics is the responsibility and privilege of life-long learning. However, the flaw of introductory statistics is not that it’s incomplete, but that it’s not obvious how it is not complete. Statistics is a bad salesman. There’s no season finale, no cliff hanger, no teasing and hinting and promising more and better to come. Student may leave their studies believing that answering more complex data analysis questions is trivially easy (by relying on the one-size-fits-all “panacea” that they learned) or intractably difficult (when the assumptions of that method are not met.)

This book attempts to add more color to all the dimensions of data analysis while showcasing the nuances throughout the true life cycle of data analysis using two strategies.

First, it attempts to highlight common pitfalls in all the parts of data analysis: from data management and computation to visualization, interpretation, and modeling and even to communication and collaboration. Data analysis is fundamentally a creative task, so there are rarely canonical one-size-fits-all solutions. Curiously, however, there are plenty of canonical issues even if they require different solutions in different settings. Thus, the goal of this book is to highlight common data disasters and, in doing so, help students cultivate an intuition for how to detect common problems before they occur in an important analysis.

Second, while exploring these data disasters, we humbly put forth a (woefully incomplete!) literature review of more advanced methods from statistics and other quantitative disciplines (e.g. economics, epidemiology), to help learners build a “mental index” of terms to search and techniques to study should they encounter a relevant problem.

The content in this book is currently being developed and is all subject to change.

Chapters and sections tagged as WIP (work-in-progress) have substantial content and are suitable for reading.

Chapters and sections tagged as TODO have minimal outlines or code examples (if that).

Main Topics

In particular, we will aim to help you avoid twelve types of data disasters:

Data Dalliances: Misinterpreting or misuing data based on how it was collected or what it represents
Computational Quandaries: Letting computers do what you said and not what you meant
Egregious Aggregations: Losing critical information when information is condensed
Vexing Visualization: Confusing ourselves or others with plotting choices
Incredible Inferences: Drawing incorrect conclusions for analytical results
Cavalier Causality: Falling prey to spurious correlations masquerading as causality
Mindless Modeling: Failing to get the most value out of models by not tailoring the features, targets, and performance metrics
Alternative Algorithms: Lacking an understanding of alternative methods which may be better suited for the problem at hand
Futile Findings: Asking and answering questions that aren’t useful
Complexifying Code: Making projects unwieldy or more difficult to understand than necessary
Rejecting Reproducibility: Working inefficiently instead of an efficient, reproducible, and sharable workflow
Mourning Mistakes: Letting the perfect be the enemy of the good

Common Themes

In each chapter, we will see numerous examples of each disaster and consider strategies to help us mitigate. Along the way, we’ll emphasize:

The importance of domain knowledge and the data-generating process to decide what it is you want to do
The utility of simulation as a tool to explore if, in fact, you are doing it
The exploration of counterexamples to build intuition for common patterns of problems even where common solutions don’t exist

As we go, we will notice how three common themes that challenge the focus of introductory statistics:

Summary statistics mask interesting stories that we see when focusing on the variation
Similarly, observations and variables are rarely independent; the story is in the covariance
Assumptions of Normality, or more broadly symmetry, are often in appropriate in wonky, highly skewed world

Note

Note that there are five types of callouts, including: note, tip, warning, caution, and important.