Before becoming a data analyst, I worked for a Big Four accounting firm as a risk assurance associate. My team and I consulted for FS (financial services) and P&S (products & services) companies, helping them address risks, and test the efficiency and effectiveness of the controls they currently had in place. For bulge bracket banks like JPM or GS, risks can include fat finger error, a buy or sell market order that is way larger in volume than intended. Controls for the risk include limiting the market value of order by employee level, and requiring approval from managers or high level employees. Dealing with data on a day-to-day basis, I can't help but think about the controls data people currently have in place. For instance, when analyzing basketball players' weight, how much EDA (exploratory data analysis) is involved? How does EDA even begin? If some rows have the values 17,18,19, will any flags be raised? (i.e. true weight values are 170,180,190) The person that generated the data matters and maybe his or her "0" key broke, or maybe the data was not inputted correctly.
There are a number of EDA options available to data professionals that prevent fat finger-like errors. Integrating some of these in your workflow can prevent drawing wrong conclusions. In the following example, I use the USArrests dataset to illustrate a few examples:
1) Use the Hmisc package, which conveniently returns n, n missing values, n distinct values, mean, 5 lowest and 5 highest values, and different percentile levels for different variables.
USArrests
4 Variables 50 Observations ------------------------------------------------------------------------- Murder n missing distinct Info Mean Gmd .05 .10 50 0 43 1 7.788 5.022 2.145 2.560 .25 .50 .75 .90 .95 4.075 7.250 11.250 13.320 15.400 lowest : 0.8 2.1 2.2 2.6 2.7, highest: 13.2 14.4 15.4 16.1 17.4 ------------------------------------------------------------------------- Assault n missing distinct Info Mean Gmd .05 .10 50 0 45 1 170.8 96.44 50.25 56.90 .25 .50 .75 .90 .95 109.00 159.00 249.00 279.60 297.30 lowest : 45 46 48 53 56, highest: 285 294 300 335 337 ------------------------------------------------------------------------- UrbanPop n missing distinct Info Mean Gmd .05 .10 50 0 36 0.999 65.54 16.74 44.00 45.00 .25 .50 .75 .90 .95 54.50 66.00 77.75 83.20 86.55 lowest : 32 39 44 45 48, highest: 85 86 87 89 91 ------------------------------------------------------------------------- Rape n missing distinct Info Mean Gmd .05 .10 50 0 48 1 21.23 10.48 8.75 10.67 .25 .50 .75 .90 .95 15.08 20.10 26.17 32.40 39.74 lowest : 7.3 7.8 8.3 9.3 9.5, highest: 35.1 38.7 40.6 44.5 46.0 -------------------------------------------------------------------------
2) Use summary(data) from baseR which provides a high level overview of the range of values across different variables.
Murder Assault UrbanPop Rape
Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07 Median : 7.250 Median :159.0 Median :66.00 Median :20.10 Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18 Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
3) Use head or tail from baseR to inspect first or last n rows
Tip: you can specify how many rows instead of using the default 6
Murder Assault Urban Pop Rape
Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Connecticut 3.3 110 77 11.1
Podcast Episode Recommendations:
In episodes 51 & 52 of Not So Standard Deviations, Roger Peng and Hilary Parker both talk about A) why talking to people who generate data matters (E52: https://bit.ly/2odmosk) B) whether it's possible to evaluate a data analysis without knowing who the author is (E51: https://bit.ly/2PE8Z8Y)
0 Comments
What's wrong with this?
First, I want to start by saying it is ok if you've used the two lines of code at the start of your R script! I used to. I did not know it was bad practice and that there was a better way. Some R courses even integrate these two lines in their online courses! If self-proclaimed data experts do not see anything wrong with teaching new students these bad practices, then when is it expected for students to correct their behavior or at least become aware of Projects?
When I first used R, I did not realize that Projects existed and that they would make my workflow simpler and more effective. My favorite use case: Instead of opening new R Scripts for different tasks or assignments it might be helpful to integrate Projects in your workflow. Projects allow users to work in multiple workspaces, with data and variables unique to each project. So, no more figuring out if a dataframe belongs to task 1 or task 2! Dataframes in each project's global environment are generated from the project's respective R Scripts. For more information about the additional benefits of , check out the tidyverse article below! tidyverse.org article: https://bit.ly/2F1YcRl Credits to Jenny Bryan & Hadley Wickham
When I started graduate school in August 2016, data science was the talk of the town. What is it? This GIF sums it up.
I spent some time learning Python with Treehouse, SQL with Udemy, and STATA from graduate statistics courses. I didn't know where to start. With an overwhelming number of resources available, I didn't know how to best spend my time in between classes and internships. With that in mind, I created a list of blogs, Twitter profiles and courses in the Resources section of Dataclubb that have helped me jumpstart my data journey.
If anyone interested in Data asked me today "Where do I start?" I'd point them here. |
gABEData Scientist Archives
January 2019
Categories |