Exploratory Data Analysis (EDA) is an approach to analyzing data that emphasizes exploring datasets for patterns and insights without any predetermined hypotheses.
The goal is to let the data “speak for themselves” and guide analysis, rather than imposing rigid structures or theories.
Purpose
Exploratory data analysis (EDA) mainly analyzes and investigates datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions.
EDA has several key goals:
- Quickly summarize and describe the characteristics of a dataset. By visualizing data distributions and calculating descriptive statistics, we can get an overview of the salient properties of variables in the dataset.
- Check the quality of the data and identify any issues. Data visualization and summary statistics readily reveal missing values, errors, and outliers that may need mitigation before proceeding with analysis.
- Formulate hypotheses and derive insights by exploring interesting aspects of the data. Patterns may suggest causal hypotheses to test via statistical modeling. Outliers often contain useful domain insights.
- Understand relations between variables. Visualizations can uncover the nature of bivariate relationships – shape, direction, form, outliers, etc. This guides correlation and regression modeling choices.
- Test assumptions for statistical models you intend to apply later. Histogram shapes indicate parameter distribution assumptions. Scatterplots check linearity assumptions. Identifying where assumptions break provides guidance for requisite data transformations or alternate models.
In essence, EDA entails active investigation of what our data contains even before formal modeling to guide choices, reveal issues needing resolution, and ensure we squeeze all potential value from our data resources.
The flexibility and lack of stringent assumptions make EDA invaluable for open-ended understanding.
Examples
Exploratory Data Analysis (EDA) emphasizes flexibility and exploring different approaches to let key aspects of datasets emerge, rather than rigidly testing hypotheses from the start.
It is an iterative cycle where we analyze, visualize, and transform data to extract meaning.
EDA principles underlie “data science” and complement traditional statistical inference. Smart use of EDA provides a rich understanding of phenomena that can guide the construction of causal theories and models.
Visualizations
Creating graphs, charts, and plots to visually inspect data distributions, relationships between variables, outliers, etc.
Pictures allow our powerful visual perception to notice things numerical summaries may miss.
For example, a scatterplot may reveal that income and education level have a curvilinear relationship, challenging the assumption that the relationship is linear.
Summarizing
Describing key statistics of datasets to understand central tendency (mean, median), spread (variance, percentiles), shape (skewness), outliers, and so on.
These numerical summaries complement visual inspection.
Data transformations
Applying mathematical functions to “reshape” datasets to simplify observed patterns.
For example, taking the logarithm of an extremely skewed variable like income may make its distribution more normal and amend issues for certain statistical tests.
Outlier detection
Identifying anomalies that distort overall patterns, and either correcting erroneous values or analyzing outliers specifically since they often reveal useful insights about the phenomenon under study.
Fence methods, studentized residuals, Cook’s distance, and other techniques help detect outliers.
Interrogating with different analysis techniques
Trying various statistical and machine learning techniques to understand different facets of datasets.
Rather than sticking to predetermined analysis plans, the focus is using diverse tools suited for particular datasets.
Techniques could include clustering algorithms, decision trees, linear regression, ANOVA, etc., based on the data characteristics and research goals.
Descriptive vs. Explorative Analysis
Descriptive analysis focuses on summarizing what the data shows on the surface. Exploratory analysis digs deeper to uncover subtle patterns and non-obvious trends in the data.
Descriptive analysis might tell a dataset’s average, median, and standard deviation. Exploratory analysis would use visualizations, transformations, and interrogating the data with different techniques to model the relationships between variables beyond just summary statistics.
So descriptive analysis describes what the data shows, while exploratory analysis explores nuances in the data to extract deeper meaning.
But good data analysis uses both techniques – summary statistics to complement the graphs and visuals revealing relationships.
Descriptive Analysis
- Summarizes and presents the data without making inferences or models
- Uses simple graphics like histograms, bar charts, summary statistics
- Goal is to describe patterns in the data
Exploratory Analysis
- Makes inferences about patterns, relationships, effects in the data
- Relies heavily on graphics and visualization
- Transforms/manipulates the data to extract meaning
- Iterative cycle to understand the data
- Goal is to extract deeper insights from the data
References
Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131–160.
Emerson, J. D., & Stoto, M. A. (1983). Transforming data. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey
(Eds.), Understanding robust and exploratory data analysis (pp. 97–128). Wiley.
Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991). Fundamentals of exploratory analysis of variance. Wiley.
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Velleman, P. F. (2008). Truth, damn truth, and statistics. Journal of Statistics Education: An International Journal on the Teaching and Learning of Statistics, 16(2).