1: Monash University, Department of econometrics and business statistics
nicholas.tierney@gmail.com 2: Monash University, Department of econometrics and business statistics
dicook@monash.edu 3: Queensland University of Technology, ARC Centre of Excellence for Statistical and Mathematical Frontiers
milesmcbain@gmail.com Keywords - Missing Data
- Exploratory Data analysis
- Imputation
- Data Visualization
- Data Mining
- Statistical Graphics
Missing values are ubiquitous in data and need to be carefully explored and handled in the initial stages of analysis to avoid bias. However, exploring why and how values are missing is typically an inefficient process. For example, visualising data with missing values in ggplot2 results in omission of missing values with a warning, and base R silently omits missing values Wickham (2009). Additionally, imputed missing data are not typically distinguished in visualisation and data summaries. Tidy data structures described in Wickham (2014) provide an efficient, easy and consistent approach to performing data manipulation and wrangling, where each row is an observation and each column is a variable. There are currently no guidelines for representing missing data structures in a tidy format, nor simple approaches to visualising missing values. This paper describes an R package, naniar, for exploring missing values in data with minimal deviation from the common workflows of ggplot and tidy data. Naniar builds data structures and functions that ensure missing values are handled effectively for plotting and summarising data with missing values, and examining the effects of imputation.
References Wickham, Hadley. 2009.
Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
http://ggplot2.org.
———. 2014. “Tidy Data.”
Journal of Statistical Software 59 (1): 1–23.