Blog Archives

EDA: A Risk Mitigation Tool In Data Science

8/26/2018

Before becoming a data analyst, I worked for a Big Four accounting firm as a risk assurance associate. My team and I consulted for FS (financial services) and P&S (products & services) companies, helping them address risks, and test the efficiency and effectiveness of the controls they currently had in place. For bulge bracket banks like JPM or GS, risks can include fat finger error, a buy or sell market order that is way larger in volume than intended. Controls for the risk include limiting the market value of order by employee level, and requiring approval from managers or high level employees.

Dealing with data on a day-to-day basis, I can't help but think about the controls data people currently have in place. For instance, when analyzing basketball players' weight, how much EDA (exploratory data analysis) is involved? How does EDA even begin? If some rows have the values 17,18,19, will any flags be raised? (i.e. true weight values are 170,180,190) The person that generated the data matters and maybe his or her "0" key broke, or maybe the data was not inputted correctly.

There are a number of EDA options available to data professionals that prevent fat finger-like errors. Integrating some of these in your workflow can prevent drawing wrong conclusions. In the following example, I use the USArrests dataset to illustrate a few examples:

1) Use the Hmisc package, which conveniently returns n, n missing values, n distinct values, mean, 5 lowest and 5 highest values, and different percentile levels for different variables.

USArrests

4 Variables 50 Observations
-------------------------------------------------------------------------
Murder
n missing distinct Info Mean Gmd .05 .10
50 0 43 1 7.788 5.022 2.145 2.560
.25 .50 .75 .90 .95
4.075 7.250 11.250 13.320 15.400

lowest : 0.8 2.1 2.2 2.6 2.7, highest: 13.2 14.4 15.4 16.1 17.4
-------------------------------------------------------------------------
Assault
n missing distinct Info Mean Gmd .05 .10
50 0 45 1 170.8 96.44 50.25 56.90
.25 .50 .75 .90 .95
109.00 159.00 249.00 279.60 297.30

lowest : 45 46 48 53 56, highest: 285 294 300 335 337
-------------------------------------------------------------------------
UrbanPop
n missing distinct Info Mean Gmd .05 .10
50 0 36 0.999 65.54 16.74 44.00 45.00
.25 .50 .75 .90 .95
54.50 66.00 77.75 83.20 86.55

lowest : 32 39 44 45 48, highest: 85 86 87 89 91
-------------------------------------------------------------------------
Rape
n missing distinct Info Mean Gmd .05 .10
50 0 48 1 21.23 10.48 8.75 10.67
.25 .50 .75 .90 .95
15.08 20.10 26.17 32.40 39.74

lowest : 7.3 7.8 8.3 9.3 9.5, highest: 35.1 38.7 40.6 44.5 46.0
-------------------------------------------------------------------------

2) Use summary(data) from baseR which provides a high level overview of the range of values across different variables.

Murder Assault UrbanPop Rape
Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
Median : 7.250 Median :159.0 Median :66.00 Median :20.10
Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00

3) Use head or tail from baseR to inspect first or last n rows
Tip: you can specify how many rows instead of using the default 6

Murder Assault Urban Pop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1

Podcast Episode Recommendations:
In episodes 51 & 52 of Not So Standard Deviations, Roger Peng and Hilary Parker both talk about
A) why talking to people who generate data matters
(E52: https://bit.ly/2odmosk)
B) whether it's possible to evaluate a data analysis without knowing who the author is (E51: https://bit.ly/2PE8Z8Y)

0 Comments

EDA: A Risk Mitigation Tool In Data Science

gABE

Archives

Categories