5 Ways To Verify What Your Big Data Is Telling You

5 Ways To Verify What Your Big Data Is Telling You This past Monday I watched the Bowl Championship Series title game between the (American-style) football teams from Auburn University and Florida State University.

As is my usual custom when watching sports between teams for which I have no rooting interest, I spend time looking for statistics to help me decide where to place my loyalties for the next few hours. I watch more pro sports than collegiate so I count the fewest number of criminals on the roster (accused or convicted) or fewest number of league suspensions for performance enhancing drugs. But with college sports I do things differently. I focus on the schools, not the players.

I’m not talking about the statistics for the teams. No, I want to know things like graduation rates and athletic budgets. Having worked within higher education for a handful of years in a previous life, I know which standard metrics are the important ones. As the game got started that is what I found myself doing: digging up education stats.

Here are the links to the data I was using while watching the game: http://spendingdatabase.knightcommission.org/reports and http://collegecompletion.chronicle.com. I needed to run reports more than a few times in order to get the info I was looking for. You can call it data mining if you want, but I think it was more akin to spelunking.

Data Spelunking

I started with graduation rates:

I then thought about tuition costs:

Comparing the two schools directly is not necessarily fair. You need to make certain you put your data into context. So I decided to learn more about these schools in relation to their conferences:

Next, the spending for a scholarship football player:

Now we are getting somewhere. What about the two schools playing tonight? This is how they fare against their conferences:

Wow. Auburn is way above the SEC average (167% more), while FSU is spending 116% more than other schools in the ACC. That alone is telling. And here is where we get to the most important distinction between these two schools:

Yeah, that’s right. It is very clear to me that the SEC, and particularly Auburn, is focused on athletic success much more than academic success.

What I loved about this (besides how much it enhanced the game for me, especially when Auburn lost) is that it was easy for me to get the data, to draw a conclusion, which would then lead me to a new question, for which I could go get more data and see if my conclusions were still valid. It was self-service BI, and it was awesome. OK, maybe it was self-service Sports Intelligence, but it was still awesome.

Correlation versus Causation

An easier way for you (or anyone) to get started on self-service BI would be to use Google Trends. Using Google Trends you can find out the level of interest different search terms have had over time. You can use this data to help draw some conclusions as well. For example, let’s look at the trends for the terms “HD DVD” and “Blu Ray”:

[trends h=”330″ w=”500″ q=”HD+DVD,+Blu+Ray”]

In February of 2008 the interest in those search terms were nearly identical. By Christmas of 2008 Blu-Ray was clearly in the lead and has not looked back since. You might say that Blu-Ray won the battle between the two formats. I think you’d be wrong though. As this article states, the future is in streaming, not in which player you have. Does that mean the data returned is lying to us? No, it’s not lying. We were drawing a conclusion from the data without understanding additional context that surrounded the data. Let’s look at another example:

[trends h=”330″ w=”500″ q=”Large+Hadron+Collider,+Sarah+Palin”]

Some people believed that bringing the Large Hadron Collider online at CERN would cause an end to the Earth. The graph above shows that about the time LHC came online, Sarah Palin was introduced as the running mate to Senator John McCain.

Case closed, I’d say. The LHC has doomed as all.

Again, our data isn’t lying to us here. Our analysis and our conclusion is wrong. This is the classic “correlation does not imply causation” warning we’ve all heard before. The increase in consumption of organic fruit is not responsible for an increase in autism. The decrease in Internet Explorer market share has no correlation to murder rates.

Confusing correlation and causation seems to be the most common error when dealing with data. Self-service BI is great until you put it into the hands of people who don’t understand the underlying nature of data or statistics, or the quality of the data.

Data analytics needs to be done by someone you trust knows what they are doing. Just because people know what they are doing doesn’t mean you should trust them, and just because you trust someone doesn’t mean they know what they are doing.

Lots of people don’t know what they are doing, so while they have good intentions, their analysis is faulty.

So, what can you do to verify that the analysis is correct? How do you check their work? Tools like Master Data Services and Data Quality Services only go so far. Remember, it is not the data that is necessarily the issue, it is the human that does the interpretation of the data.

5 Ways To Verify What Your Big Data Is Telling You

Here are five questions you can ask to help confirm the validity of the conclusions your big data is telling you:

1. Where did the data came from?
You want to know if the data came from a reliable source. A data set from Data.gov or the National Weather Service may be considered more trustworthy than Fred’s Data Mart. You may really like Fred, but you may want to get market data from Bloomberg instead.

2. What was done to the data after it was obtained?
Was it scrubbed by an internal ETL process, perhaps something similar to Data Quality Services? Or is the data you are discussing the same raw data that arrived? When I looked at graduation rates I was told that not all undergraduate students are included. Part-time freshman, for example, are excluded from the 4-year graduation rates. This may be an important factor, depending on the questions we have, and is something you would want to know.

3. What is also known about the data?
What metadata came with your data? Make certain you know what each data point is meant to represent. The easiest thing to overlook here are units of measure. Even NASA gets this wrong from time to time. Knowing who has curated the data is vital as well. Have they manipulated it in any way prior to it being sent? The organization that curates the data (for example, the FBI or NSA) may have their own agenda. As a result, the data set they offer may not be as complete as you might like.

4. How old is the data?
You’ll want to know if the data you are analyzing is relevant for the time period you are investigating. Just ask any sales person who does cold calling how important it is to have current data.

5. Has the data been leveled?
This is the apples-to-apples comparison you need. It helps to frame the discussion in a way for others to have an “a-ha!” moment. The example above I give with regards to Auburn spending 31x more than an average student helps to put their spending in perspective with other schools, including Florida State. It helps everyone to understand the value Auburn places on athletics over academics. (For fun, read about what fellow SEC school Florida is doing with their Computer Science program and then go check what they spend on their athletes.)

At the PASS BA Conference last April, Steven Levitt mentioned that there was (and is) a dearth of good people who can analyze data properly. This holds true for many professions, and it is one of skills needed most in today’s business world. Based upon the troubleshooting skills (or lack thereof) I see these days I’d argue that it is a skill lacking by many server administrators and database administrators.

Everyone needs to be able to analyze data properly. Everyone.

Do yourself a favor and start learning now, before you find yourself murdered by an autistic Internet Explorer that has been eating nothing but organic fruit.

10 thoughts on “5 Ways To Verify What Your Big Data Is Telling You”

  1. I might also ask data:
    – Where have you been: running data through processes can change it. Datatypes can truncate data, bad application code can hurt it, etc.
    – How much data is this finding based on? Statistical anomalies need to be understood.
    – Was the dataset cherry picked? Scott Ambler likes to throw out data with “I don’t know” responses in is surveys, under the guise of it it not applicable. That’s a huge change to the dataset.

    I picture data sitting on a chair with a hot spot light focused on its forehead (headers?) being interrogated by a good cop and a bad cop.

    Reply
  2. this is a really nice post! great job sharing your approach and some of the obstacles with data-driven conclusions.

    Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.