26 Jul Why I’m Learning Data Science
To be fair, it is more a case of me re-learning data science, as the concepts are familiar already. With an MS in Mathematics, I have dabbled in statistics for more than 30 years now. So when Microsoft announced they were partnering with edX to offer a certificate in Data Science I decided it was the perfect time to dust off my Z-tables and get back to my roots. Launched just over a year ago, the online course allows for anyone to take classes for free. There is an option to pay individual course fees of $49 and $99 to earn a verified certificate. With ten courses in total, your final costs are about $540 USD. That’s more than a fair price for this content.
The quality and quantity of the online content were a perfect mix for me. Besides the math and statistics you would expect, there was also PowerBI, Excel, R, Python, as well as SQL Server. Here’s a partial list of topics:
- Query relational data using T-SQL
- Analyze and visualize data using PowerBI (or Excel, if preferred)
- Understanding statistics
- Exploring data with code (using R or Python)
- Understanding core data science concepts
- Principles of machine learning
- Using code to manipulate and model data (again, using R or Python)
- Applied machine learning
For a data geek like myself, this was heaven. You can see a full list of the courses here: https://academy.microsoft.com/en-us/tracks/data-science
One of the course highlights for me was finding Wayne Winston as an instructor. Imagine being able to learn statistics from Wayne Winston! It costs thousands of dollars to attend Indiana University, where he is a professor. His books cost money, too. But in this edX course THIS KNOWLEDGE CAN BE YOURS FOR FREE.
I started the courses towards the end of 2016, but in February I made it a priority to get them all done. I finished all but the final project by mid-April. The final happens every three months, and I missed the April deadline. I had to wait until July to try again. Last week, as I returned from the beach family vacation, I pushed aside everything on my schedule to work on the final project. When I woke last Friday, this was waiting for me:
Honestly, this certificate means more to me than my SQL Server MCM. However, I liked the structure of this course so much that I wish Microsoft would construct something similar for SQL Server and do a reboot of the MCM program. (If anyone from Microsoft Learning is reading this, email me, I’d love to help.)
So, besides this coursework being a way for me to turn back the clock, why would I want to spend the time to learn data science? Let’s break it down.
The traditional role of the DBA is being automated away, right in front of our keyboards. It’s easy to throw hardware at a query, or a database, and make things run faster. Platforms such as CosmosDB are the beginning of the end for DBAs as the machines are close to automating away your job as a DBA.
It won’t be long before fiscal-minded people will use cloud platforms as a gauge for DBA salaries. When companies understand that the systems can tune themselves it’s going to be harder to earn a dollar tuning queries. Sure, there might be a need for a sysadmin to help configure that system, but the number of DBAs is a shrinking pool, not growing.
The reason for this trend is easy to comprehend when you understand that computers are only good at providing answers. It is up to us to ask the questions. Humans are better at understanding if the answers make sense. There is a dearth of people in this world that can analyze data well. Data science and analytics is a growth field. Data administration is not. Hitch your career path to something on the rise, not to something that can be replaced by a handful of PowerShell scripts.
We know that the world of tech moves quickly, and just gets faster. In the world of data science, there are new tools and integrations introduced weekly. The acceleration of new tools to the market is a good thing. As new tools come on the market it becomes easier for everyone to have access to insights that data will bring. As an example, after my project was complete Buck Woody (blog | @buckwoodymsft) told me about XGBoost. I didn’t know it existed, and now I can’t wait to see if that will help make my predictive model even better.
Getting tools to the masses so that everyone can work with data has ancillary benefits as well. If everyone practiced or learned data science, we would be building systems that treated data like a first class citizen. Right now, data gets overlooked as a critical asset. But data lasts longer than code. It’s about time we treat data as the most critical asset your company owns. Because most companies don’t until it is too late.
The volume and velocity of data available today can make the simplest data science project difficult. The result is that data cleansing is 93.7% of any data science project. Nobody goes to school to become a data janitor. If you decide to dive into data science you must understand how data gets cleaned. As a DBA for many years the idea of replacing missing data with zeros seemed…well…flat out wrong. Until I saw my Root Mean Square Error drop down into the .23 range, and then those zeros didn’t matter at all to me.
Speaking of that, I loved trying to tune my model and improve upon my final score. Data science isn’t just sexy, it’s addictive. I spent far too much time trying to work my way to the top of the leaderboard even after my grade was complete. Part of that was my general competitive nature, but it was also part of the learning process. Another part of the process is this:
Also part of the final course was that we had to submit a report. We then had to do a peer review of other student reports. Reading those reports I found myself thinking about what I could have done differently to approach a solution. Reading the reviews of my own report I also got a sense of collaboration about the nature of the project. For me, data science invokes an almost philosophical discussion of the data and the problem we are trying to solve. The focus is more on the process, what the data was telling us, and the outcome. And it all appeared to be far less combative than what you would find in any forum on query tuning.
Getting back on the data science train is a smart move for anyone these days. I’m not saying you need to quit your job. What I *am* saying is that you should look to augment your current job with some data science skills. Microsoft has shifted their data platform offerings for a reason.
And as any good Microsoft MVP knows, it’s not a bad thing to keep pace with trends as Microsoft shifts.
For me, that area is data science.