How Data Outliers Can Change Conclusions

James Cousins | Senior Statistical Analyst

This year, I set a goal for myself to read 24 books and, unlike some of those hard-to-reach New-Years resolutions, I stuck with it. So, I was thrilled when Goodreads emailed to ask me if I’d like to see my “year in review”- duh! Then it hit me that this is a perfect example of why descriptive statistics can be so fun (yes, I said fun).

For context, I’ve provided summaries below of my last two years of reading. First, I’d like some credit for finally removing Dr. Seuss books from my annual library. Second, most of what you see here are actually standard descriptive statistics! There’s the sum of all pages I read, the number of distinct books from which the pages came, my minimum length book, maximum length book, and the average page count per book.

2016 Goodreads Year in Review

2017 Goodreads Year in Review

These are all routinely reported numbers in statistical analyses. This is how exciting descriptive statistics can be! Once you take a moment and realize that these numbers have real meaning, they become so much more compelling. And for the keen-eyed among you, yes, I still count comic books in my list.

But there’s a whole extra layer to this little revelation of mine. Out of curiosity, I looked at last year’s review as well and I noticed I had read 8 fewer books, but only 900 pages fewer.

This seemed really bizarre to me because I expect 8 more books to be more than a 900 page difference. I realized, according to Goodreads, my longest book in 2016 was 1,796 pages (The Adventures of Sherlock Holmes).The version I have of the book is only 200 pages. That data discrepancy drove my average pages up significantly for 2016. This is understood more easily with a picture, I put together a visual representation using Veera:

Tracking my progress from 2016 to 2017, I dropped my average book size dramatically. With a little sleuthing and clue gathering, I discovered the incorrect data was just from one book, Sherlock Holmes. After this clever deduction, and recalculating my average page count once I removed the extra 1,596 pages from my total, I saw a more accurate result.

Critically, that means my average is 3 pages longer this year, not 81 pages shorter! Correcting the dirty data turns my year-to-year comparison completely around. So cool. This was a harmless conclusion- I was just curious if I’ve been reading longer books or not. Imagine if this were an organizational-critical number though. Going from thinking your key performance indicator decreased almost 21% to discovering it increased 0.99%? That potentially has groundbreaking implications.

Do any of you have Goodreads numbers you want to share? Any stories of dirty data getting discovered, and conclusions changing? I’d love to hear about your exciting stories of the unsung impact of descriptive statistics!