Data is the new oil.Does the weather affect your sales, either by location or by season?

Maybe it does, maybe not. The only way to know for certain is to collect the data. But collecting such data may prove to be more expensive than it is worth. After all, you can’t control the weather.

How about happiness and stock prices? What if you could show that the stock market goes up when people are happy, and nosedives when we are miserable. You don’t control happiness, but you can control the timing of your purcahse and sales of stocks. How much would *that* data be worth to you?

Done right, it could be the same amount that Rockefeller made off of the Pennsylvania oil fields in the mid-nineteenth century. That’s how much.

And that’s why data is the new oil.

The title of this post is a direct quote from the book Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die by Eric Siegel. I found the book to be a well written overview for the field of predictive analytics (PA). You can find the book in my bookshelf but I also wanted to share some thoughts on the book itself.

First, it’s an overview of PA as a whole, so if you are looking for a deep technical book with samples and examples, this isn’t the one for you. And yet, if you read this book, you do come away with a feeling that you could go off and start doing PA for yourself. All you need is some data, some math, and some logical reasoning. I enjoy books that are engaging, and this one certainly was for me.

The book does a wonderful job of weaving in stories about PA use cases that are familiar to many. Examples included were when retail giant Target decided they wanted to predict which customers are likely pregnant, the creation of the Netflix Prize, and the creation of the Watson supercomputer by IBM. Each story helped to reinforce a specific point or challenge about PA.

With Target, the issue was one of ethics and privacy. With all of this data available it can be possible to determine which customer may be pregnant, but does that mean you should send them coupons for baby related products? There is also work being done with PA in regards to police trying to predict crime. Even Hewlett Packard is using PA to determine if employees are likely to quit, but is that data that should be made known to everyone or just a select few, or simply not known at all? PA is not supposed to invade privacy. In fact much of the data is supposed to remain free of personal identification. The idea is to build a predictive model that works for any resulting data set.

The creation of the Netflix Prize serves as an excellent example in how having multiple predictive models working together can boost accuracy over a single model, something Siegel refers to as the Ensemble Effect. This also leads to a discussion on crowdsourcing in general, and the book shows examples of where crowdsourcing works only to a point, and after that point you get diminishing returns (*COUGH* Wikipedia *COUGH*).

IBM’s Watson computer is a triumph of predictive analytics. All of the work that went into Watson is astounding. The end result is a machine (essentially an Ensemble of Ensembles) that is able to come back with a textual reply that is most likely the answer to a written question. Well, being Jeopardy, it’s really the question to a written answer (which makes it even more difficult). Watson even has it’s own Twitter account, where you can find out that it is answering questions you didn’t even think to ask.

Lastly, I’ve talked before about how PA can’t predict the future. I think PA has a poorly chosen name because business falsely believe that they can use PA to predict future events. Siegel does a great job in this book explaining the same idea; PA doesn’t predict the future, PA gives you data-driven insight into trends that may (or may not) be useful for you to act upon.

I highly recommend this book for any data geek that is eager to learn more about PA, predictive models, machine learning, and how to effectively use data to make better decisions.