SQL 2008 Data Mining

I often hear people use the terms “Data Mining” and “Data Analysis” as if they are interchangeable. To me, at least, they mean two very distinct things. Hopefully this post will help to explain what I believe to be the difference in their meanings.

So, let’s say you are opening up a mine. First things first, do you know what you are mining for? If it were a real mine, say an old-fashioned gold mine, then the answer is yes, you do know what you are looking for. You are looking for all the yellow rocks you can find. Data mining is different in the sense that you have no idea what you are looking for, but you will know it when you find it. Make sense? No?

Think of it this way. You know you have data, you can see rows and columns and numbers and letters. It’s all right there in front of you. But what does it all mean? Well, like a gold mine, you would have all sorts of dirt and rock around you as you tunnel into a mountain or along a stream. With a gold mine you have an idea of what the final product is like, but you are surrounded by tons of material that is fairly useless. Now, let’s say I told you to go into the gold mine and keep digging until you found something unusual. You come back out with a few shiny rocks, some pink, some white, some clear, and some yellow.

Yeah, it kinda feels like this...

Why those rocks? Well, because they were unusual, they stood out from the rest of the rocks you were breaking apart. It was easy to see that something was different about them. Data mining is the exact same thing, you go looking for pieces of data (rock) that stand out from the rest. You may not know what you are looking for, except to say that you are looking for something unusual. Often times people will say they are trying to find hidden patterns in the data, but it could also be the case that finding no pattern can be just as useful. In short, you are looking for those rows and columns that seem to be slightly different than the rest.

So, your mine is open for business, you are excavating tons of rocks, and a bunch of them are yellow. But how do you know they are gold? Well, someone needs to do some analysis on those rocks to verify what they are. So, if you are mining for gold, the steps would be as follows:

  1. Go digging through dirt and rocks.
  2. Pull out the ones that look different.
  3. Tell someone which pile of rocks is the one you think contains gold.
  4. Have someone analyze the rocks in the pile to confirm.
  5. Profit.

With data mining the same principles apply.

  1. Go digging through your data.
  2. Pull out the data that looks different.
  3. Propose a hypothesis about why the data is different.
  4. Perform some analysis on a larger set of data to see if your hypothesis is valid.
  5. Profit.

Now, I am simplifying things here greatly. I am not talking about confidence intervals, or different statistical models, or anything in depth. I am just trying to explain that there is a difference in the mining of data versus the analysis of data. To me, at least, the analysis phase comes later, after you have formed a hypothesis, which is brought about by some mining.

I have seen some people describe data mining as a particular data analysis technique. I suppose that is true as well. For example, you wouldn’t start mining for gold unless you already had a hypothesis that there was gold in ‘dem ‘thar hills. That is also a fair definition, but it does highlight that the mining is still something not entirely equal to the analysis.

They are different tools, and different methods, but with a common purpose. In the case of Business Intelligence it is to get the right information, to the right people, at the right time, so they can make the right decisions.

Data Mining Tools

Microsoft has two different flavors of data mining tools that you can add easily into your office products, specifically for use with Excel.

If you have access to an instance of SSAS at all times, then go ahead and download the tools found here:

http://www.microsoft.com/downloads/details.aspx?FamilyID=7C76E8DF-8674-4C3B-A99B-55B17F3C4C51&displaylang=en

And for those times when you are not connected to an actual SSAS instance, but are connected to the intertubz, then you will want to download this set of tools:

http://www.sqlserverdatamining.com/cloud/

Feel free to load both of them if you want, they can be run side by side. Why would you want to do this? That’s a great question. Let me explain by continuing the gold mine analogy.

So, if you were serious about going to mine for gold, would you grab a pick-axe and make way for the Yukon Trail? Probably not. And even for those people that did do exactly that during the various Gold Rush eras in history, they only did so because someone else had already found some gold. In other words, the preliminary exploration had been done, gold was discovered, word spread, and before you know it we have enough Yukon Jack to last a lifetime.

If you were going to open a gold mine today you would look to do some preliminary exploration as well before you ever broke ground somewhere. Think of the tools that you download as exactly that: tools that help you (or others) do some preliminary exploration. In this case, exploration of data, no dirt. The tools help to highlight chunks of data that you may find to be of some interest.

So, the tools help you with some preliminary exploration, at which point you can really start to mine for more information, then you form a hypothesis, perform some analysis, and a-bing, a-bang, a-bung you are on your way!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.