Data Analytics Archives - Thomas LaRock

Microsoft Fabric is the New Office

Thomas LaRock — Thu, 11 Jul 2024 21:35:38 +0000

At Microsoft Build in 2023 the world first heard about a new offering from Microsoft called Microsoft Fabric. Reactions to the announcement ranged from “meh” to “what is this?” To be fair, this is the typical reaction most people have when you talk data with them.

Many of us had no idea what to make of Fabric. To me, it seemed as if Microsoft were doing a rebranding of sorts. They changed the name of Azure Synapse Analytics, also called a Dedicated SQL Pool, and previously known as Azure SQL Data Warehouse. Microsoft excels (ha!) at renaming products every 18 months, keeping customers guessing if anyone is leading product marketing.

Microsoft Fabric also came with this thing called OneLake, a place for all your company data. Folks with an eye on data security, privacy, and governance thought the idea of OneLake was madness. The idea of combining all your company data into one big bucket seemed like a lot of administrative overhead. But OneLake also offers a way to separate storage and compute, allowing for greater scalability. This is a must-have when you are competing with companies like Databricks and Snowflake, and other cloud service providers such as AWS and Google.

After Some Thought…

After the dust had settled and time passed, the launch and concept of Fabric started to make more sense. For the past 15+ years, Microsoft has been building the individual pieces of Fabric. Here’s a handful of features and services Fabric contains:

Data Warehouse/Lakehouse – the storing of large volumes of structured and unstructured data in OneLake, which separates storage and compute
Real-time analytics – the ability to stream data into OneLake, or pull data from external sources such as SnowFlake
Data Engineering – the ability to extract, load, and transform data including the use of notebooks
Data Science – leverage machine learning to gain insights from your data
PowerBI – create interactive reports and dashboards

Many of these services were built to support traditional data storage, retrieval, and analytical processing. This type of data processing focuses on data at rest, as opposed to streaming event data. This is not to say you couldn’t use these services for streaming, you could try if you wanted. After all, the building blocks for real-time analytics go back to SQL Server 2008, with the release of StreamInsight, a fancy way to build pipelines for refreshing dashboards with up to date data.

Streaming event data is where the real data race is taking place today. According to the IDC, by 2025 nearly 30% of data will need real-time processing. This is the market Microsoft, among others, is targeting, which is roughly 54 ZB in size.

So, it seems the more data collected, the more likely it is used for real-time processing. Therefore, if you are a cloud company, it is rather important to your bottom line to find a way to make it easy for your customers to store their data in your cloud. The next best thing, of course, is making it easy for your customers to use your tools and services to work with data stored elsewhere. This is part of the brilliance of Fabric, as it allows ease of access to real time data you are already using in places like Databricks, Confluent, and Snowflake.

The Bundle

Now, if you are Microsoft, with a handful of data services ready to meet the needs of a growing market, you have some choices to make. You could continue to do what you have done for 15+ years and keep selling individual products and services and hope you earn some of the market going forward. Or you could bundle the products and services, unifying them into one platform, and make it easy for users to ingest, transform, analyze, and report on their data.

Well, if you want to gain market share, bundling makes the most sense. And Microsoft is uniquely positioned to pull this off for two reasons. First, they have a comprehensive data platform which is second to none. Sure, you can point to other companies who might do one of those services better, but there is no company on Earth, or in the Cloud, which offers a complete end-to-end data platform like Fabric.

Second, bundling software is something Microsoft has a history of doing, and doing it quite well in some cases. People reading this post in 2024 may not be old enough to recall a time when you purchased individual software products like Excel and Word. But I do recall the time before Microsoft Office existed. Bundling everything into Fabric allows users to work with their data anywhere and, most importantly to Microsoft’s bottom line, the result is more data flowing to Azure servers.

I am not here to tell you everything is perfect with Fabric. In the past year I have seen a handful of negative comments about Fabric, most of them nitpicking about things like brand names, data type support, and file formats. There is always going to be a person upset about how Widget X isn’t the Most Perfect Thing For Them at This Moment and They Need to Tell the World. I think most people believe when a product is released, even if it is marked as “Preview”, it should be able to meet the demands of every possible user. It is just not practical.

Summary

Microsoft Fabric was announced at Build this year to be GA, which also makes users believe it should meet the demands of every possible user. The fastest way for Microsoft to grab as much market share as possible is to focus on the customer experience and remove those barriers. You can find roadmap details here, giving you an idea about the effort going on behind the scenes with Fabric today. For example, for everyone who has raised issues with security and governance, you can see the list of what has shipped and what is planned here.

It is clear Microsoft is investing in Fabric, much like they invested in Office 30+ years ago. If there is one thing Microsoft knows how to do, it is creating value for shareholders:

Since the announcement of Fabric last May, Microsoft is up over 25%. I am not going to say the increase is the direct result of Fabric. What I am saying is Microsoft might have an idea about what they are doing, and why.

Microsoft Fabric is the new Office – it is a bundle of data products, meant to boost productivity for data professionals and dominate the data analytics landscape. Much in the same way Office dominates the business world.

The post Microsoft Fabric is the New Office appeared first on Thomas LaRock.

Book Review: The AI Playbook

Thomas LaRock — Tue, 27 Feb 2024 13:38:58 +0000

Imagine you conceive an idea which will save your company millions of dollars, reduce workplace injuries, and increase sales. Now imagine company executives dislike the idea because it seems difficult to implement, and the implementation details are not well understood. Despite the stated benefits of saving money, reducing injuries, and increasing sales your idea hits a brick wall and falls flat.

Welcome to the world of artificial intelligence (AI) and machine learning (ML), where the struggle is real.

At some point in your career, you have experienced a failed project. If not, don’t worry, you will. Projects fail for all sorts of reasons. Unclear objectives. Unrealistic expectations. Poor planning. Lack of resources. Scope creep. Just to name a few of the more common reasons.

When it comes to projects with AI/ML at the core, all those same reasons apply, plus a few new ones. AI/ML is perhaps the most important piece of general-purpose technology today, which means we are bombarded with AI/ML solutions to solve random or ill-defined problems in much the same way we are bombarded by blockchain solutions for tracking fruit trucks or visiting the dentist.

The overhype of AI/ML has left people skeptical regarding the promises made through project proposals. Even if you manage to get a project funded, the initial results produced by your model may be difficult to explain, leading to apprehension about deploying solutions which cannot be understood. Nobody wants to blindly follow the decisions and predictions produced by machine learning models no one understands.

It is clear the business world needs a way to build, deploy, and maintain AI/ML models in a consistent manner, with a higher rate of success than failure, and completed on time and within budget.

bizML

Thankfully, there exists a modern approach to AI/ML projects. It is called bizML, and it is the core subject inside the new book by Dr. Eric Siegel – The AI Playbook.

For any project, not just AI/ML projects, to succeed there must be a rigorous and systematic approach for real-world deployments. Every successful project has similar characteristics – measurable goals, stakeholder involvement, risk management, resource allocation, fighting scope creep, effective communication, and monitoring project progress before, during, and after deployment.

The AI Playbook breaks this down into digestible sections for anyone with business experience to understand. It outlines bizML as a six-step process for guiding AI/ML projects from conception to deployment: define, measure, act, learn, iterate, and deploy. Using stories from familiar companies such as UPS, FICO, and various dot-coms, Dr. Siegel leans on his experience to help the reader understand how and why even the best ideas often fail.

I don’t want to give away the surprise ending, so I will just say the real secret behind bizML is starting with the end state in mind. Many projects fail due to stakeholders not aligned with the reality of deployment versus expectations. bizML attempts to remove this roadblock by getting everyone aligned with what the end state will look like, and then build towards the agreed upon state.

I read through the book in less than a couple of days, absorbing the material as fast as possible. The use of personal stories was easier to read as opposed to a purely technical book focusing on code and examples. I cannot emphasize enough how this book is not a technical manual, but a business guide for business professionals, executives, managers, consultants, and anyone else wanting to learn how to capitalize on AI/ML tech and collaborate with data professionals.

Summary

As AI/ML solutions continue to gain traction in the market, this book provides the right framework (bizML) for successful AI/ML deployments at the right time. Anyone, or any company, looking to deploy (or has deployed) AI/ML projects should buy copies of this book for all stakeholders.

I’m putting this onto my bookshelf and 15/10 would recommend.

The post Book Review: The AI Playbook appeared first on Thomas LaRock.

Export to CSV in Azure ML Studio

Thomas LaRock — Wed, 17 Jan 2024 20:48:02 +0000

The most popular feature in any application is an easy-to-find button saying “Export to CSV.” If this button is not visibly available, a simple right-click of your mouse should present such an option. You really should not be forced to spend any additional time on this Earth looking for a way to export your data to a CSV file.

Well, in Azure ML Studio, exporting to a CSV file should be simple, but is not, unless you already know what you are doing and where to look. I was reminded of this recently, and decided to write a quick post in case a person new to ML Studio was wondering how to export data to a CSV file.

When you are working inside the ML Studio designer, it is likely you will want to export data or outputs from time to time. If you are starting from a blank template, the designer does not make it easy for you to know what module you need (similar to my last post on finding sample data). Would be great if CoPilot was available!

Now, if you are similar to 99% of data professionals in the world, you will navigate to the section named Data Input and Output, because that’s what you are trying to do, export data from the designer. It even says in the description “Writes a dataset to…”, very clear what will happen.

So, using the imdb sample data, we add a module to select all columns, then attach the module to the Export Data model. So easy!

When you attach you need to configure some details for the module. Again, so easy!

We save our configuration options and submit the job to run. When the job is complete, we navigate to view the dataset.

Uh-oh, I was expecting a different set of options here. Viewing the log and various outputs does not reveal any CSV file either. Maybe I need to choose the select columns module:

Ah, that’s better.

Except it isn’t. Instead of showing me the location of the expected CSV file, what I find is this:

I can preview the data from the select columns module, but there isn’t a way to access the CSV file I was expecting. I suspect this export module is really meant to pass data between pipelines or services. But the purpose and description of the export module is not clear, and a novice user would be unhappy to head down this path only to be disappointed and frustrated.

What we really want to use here is the Convert to CSV module:

Viewing the results will display this:

Which has what we are looking for, a download button:

Selecting Download will either default to your browser settings, or you can do a Save As.

As I wrote at the beginning of this post, exporting to a CSV file from within Azure ML Studio is easy to do, if you already know what you are doing. If you are new to Azure ML Studio, you may find yourself frustrated if you expect the Export Data module to produce a CSV file. You will want to use the Convert to CSV module instead.

The post Export to CSV in Azure ML Studio appeared first on Thomas LaRock.

You Can’t Marry Your Database, But You Can Have Relations

Thomas LaRock — Mon, 01 Feb 2021 23:15:29 +0000

There’s something you should know about relational databases.

They were designed to store data efficiently, protecting the quality of the data written and stored to disk. I’ve written before about relational engines favoring data quality and integrity, and how relational databases were not designed for the reading of data.

Of course, if you are going through the trouble of writing the data into a relational database, it makes sense that you would want to retrieve the data at some point. Otherwise, why go through the exercise of storing the data inside the database?

The trouble with reading data from a relational database is due to the data not being stored in a format that is friendly for viewing, reading, or retrieving. That’s why we have data professionals, like me, to help you write queries that return the correct data, in the correct format, for you to analyze.

I’m here today to tell you we’ve been doing data wrong the whole damn time.

Let me show you what I mean.

Traditional Data Flow Patterns

Here’s what companies do, every day:

Step 1 – Identify useful data

Step 2 – Import that data into a database

Step 3 – Analyze the data

Step 4 – Export the data into dashboards

Step 5 – Profit (maybe)

The trouble with this process is Step 3, the analyzing of the data. Relational databases were not designed for analytical processing. Relational databases do not store data in a way that is readable, or friendly, for human analysis.

That’s not to say you can’t do analytics inside of a relational database. What I am saying is that it could be better for you not to spin the CPU cycles there, and instead do the analytics somewhere else.

For example, data warehouses help with data storage, retrieval, and analytics. But even a data warehouse can fall short when it comes to the use of unstructured data sources. As a result, we’ve spent decades building ETL processes to curate, collate, consolidate, and consume data.

And we’ve been doing it wrong.

Data Mining for the Data Expert

So, we understand that people find data, store it, and try to use it later. They are engaging in the process of data mining, hoping to find gold in the form of insights, leading to better business decisions.

But as I mentioned before, the data isn’t in a readable format. Let’s look at an example.

Following the common data flow pattern, I found some useful data at NFL Savant https://nflsavant.com/about.php, and imported the data into a SQL Server database:

It looks like any other table in a relational database. In this case, it is a table containing time series data pertaining to play by play records for the 2018 NFL season. Each row represents an entity (a play at a point in time of an NFL game), and the columns represent attributes of the play (down, distance, yards to go, etc.)

Nothing out of place here, this is how data is written to a relational database. In an orderly fashion. As a DBA, I love this type of orderly storage. It’s efficient, and efficient is good.

As a data analyst, I’m not a fan. At least, not yet. I have a bunch of data, but what I want are some answers. So, it’s up to me to ask some questions of the data, find some answers, and use that to help make better business decisions.

For this data, here’s an example of a simple question: What are the average yards to go for NFL teams in 2018? I can get that answer with some simple T-SQL:

This is great! I was able to take my data, ask a question, and get an answer. What could be better, right?

Of course, now I have more questions about my data. And here’s the first issue you will discover when trying to analyze data stored in a traditional relational database.

T-SQL is excellent at answering one question at a time, but not as great when you need more than one question answered.

So, if we have more questions, we will need to write more queries.

Here’s a good follow-up question that we might want to be answered: Can we examine this data broken down by each quarter?

Fortunately, the answer is yes, because T-SQL comes with a bunch of statements and functions that will help. In this case, I am going to use the PIVOT operator, as follows:

Easy, right?

No, not easy.

And not readable, either. What’s with that row saying NULL? Why do I not have a result for some teams in that last column?

As it turns out, you need a lot of experience writing T-SQL to get to that query. And you need more experience understanding the result set, too. You don’t start on Day 0 as a data professional writing PIVOT queries against a SQL Server database.

Here’s the good news: You don’t need to write PIVOT queries, ever.

Data Mining for the Masses

The data import from NFL Savant was in the form of a CSV file, which I then imported into my database. Because that’s how ETL is done (see above for the common data flow process).

What if…now hear me out…we skipped step 2? Forget about doing the import process. Instead, let’s open that CSV file in Excel.

Here’s what it would look like:

Back to our football questions. We’ve seen examples in T-SQL, let’s look at how to do this in Excel using a Pivot table.

I click on one cell in Excel, insert a pivot table, drag the offense teams as a row, and the downs to go as a value, change it to an average, and we are done. Have a look:

It took but a few seconds to get this magic to happen. Here’s what I want you to know:

1. No T-SQL is necessary. None. Not one line of code.

2. I have the entire table as a pivot table, allowing me to answer more questions WITHOUT needing to write more T-SQL.

3. There is no code. None.

Let’s say that I want to know the yards to go broken down by quarter. With T-SQL, I would need to write a new query. With the pivot table, it’s a simple drag and drop, like this:

Fin.

There is no need to rewrite code to get this result. Because there is no code, it’s drag and drop, and then I have my answer.

And that’s why I believe the inclusion of pivot table inside Excel is the greatest advancement in the 21^st century for data professionals.

Fight me.

Summary

I did not come here to bury relational databases. I came here to help you understand relational databases may not be the right place to do analytical processing.

When it comes to curating and consuming data, I have three simple rules for you to follow:

Rule #1 – Only collect data that you need. Don’t collect data “just in case you may need it later.” The data you collect must be relevant for your needs right now.

Rule #2 – Understand that all data is dirty. You could build a perfect analytical solution but based on inaccurate data. Know the risks involved in making business decisions based on dirty data.

Rule #3 – Before you collect any data, consider where the data will be processed. Don’t just assume that your database will do everything you need. Take time to list out all the available tools and systems at your disposal. The result may be a simpler solution than first imagined.

I wrote this post to help you understand Rule #3. Analysis of NFL play by play data is best done in a tool such as Excel, or PowerBI, and not (necessarily) inside of SQL Server.

SQL Server is a robust relational database engine, containing integrations with data science-y stuff such as R and Python. Just because you could do your analysis inside the SQL Server engine doesn’t mean you should.

This post originally appeared on PowerPivotPro and I was reminded about its existence while talking with Rob Collie during our Raw Data podcast. I asked Rob if I could repost here. He said yes. True story.

The post You Can’t Marry Your Database, But You Can Have Relations appeared first on Thomas LaRock.

Book Review: Calling Bullshit

Thomas LaRock — Mon, 19 Oct 2020 21:32:19 +0000

Each year, I try to find a good book to bring with me to the beach. A few months ago, I came across Calling Bullshit: The Art of Skepticism in a Data-Driven World while doom scrolling Twitter one night. I ordered the book and did not wait for the beach to get started reading.

Written by Carl Bergstrom and Jevin West, Calling Bullshit is their effort at helping everyone develop the necessary skills for critical thinking. The book reads as if you were following a college lecture. And this makes sense, since the authors are professors at the University of Washington in Seattle. You can see the course syllabus here. The book is organized close to what you see in the syllabus.

The authors start strong with “The world is awash with bullshit, and we’re drowning in it.” They point out how creating bullshit is easier and often simpler than speaking the truth. You have likely heard the phrase “the amount of energy required to refute bullshit is an order of magnitude bigger than [that needed] to produce it” from Italian software engineer Alberto Brandolini.

Everyone knows this, everyone wishes the bullshit would go away, and yet we seem to accept nothing can be done.

Well, one of the main purposes of education is to teach students to think critically. And the authors want to help people separate fact from fiction. Therefore, the need for the course, and this book.

I thoroughly enjoyed this book. My favorite chapter was ‘calling bullshit on big data’. I want to personally thank the authors for those 25 wonderful pages. This section alone is worth the price of the book. Anyone currently using, or considering, data science projects at their company will want to read this chapter.

The chapters on numbers and nonsense, selection bias, and data visualization all struck a chord for me. The authors do a wonderful job of detailing their thoughts and using practical examples. And they don’t just tell you how to call bullshit, they remind you to do so in a respectful way, again with examples. Part of the problem when trying to refute bullshit tossed at you from your crazy uncle at Thanksgiving involves confirmation bias, a similar topic discussed in the social dilemma. You must find a way to separate identity from the topic being debunked.

I’ve added this book to my bookshelf. With the holidays coming up, you may want to consider buying a few copies for your friends and relatives. Might make holiday dinners a bit more palatable.

The post Book Review: Calling Bullshit appeared first on Thomas LaRock.

Your Dashboards Still Suck

Thomas LaRock — Tue, 29 Oct 2019 16:51:52 +0000

I’ve already written a post about how dashboards are a horrible way to communicate. I’m here today to remind you that your dashboards still suck. Let’s start with the most recent example.

This image is a useless piece of information. I’m certain somewhere there is a developer proud of how they took a donut chart and made it prettier. And I would agree it is pretty…it’s pretty useless.

Let’s break it down.

This graphic doesn’t tell me anything about the amount of fuel, in gallons (or liters for my non-US readers). And that’s really the most important piece of information. A close second is displaying the range (number of miles/km remaining before empty). Telling me I have 35% fuel remaining has no value unless you know (1) how much fuel is left or (2) how far you can travel before empty.

This is why your dashboards still suck. Right now, you’ve built something, some chart, and the chart is hiding data behind an aggregate, a summation, or a percentage. And I bet it is leading to bad business decisions.

When I pull up to the pump the question I have is “how many gallons can I put in my Jeep”, not “how much percentage”. (As an aside, Jeep does not provide me the size of my tank in the operating manual. I needed to Google that information, it’s 21.5 gallons). If the app can tell me I am at 35% full, they can also tell me I have 7.525 gallons remaining, or that I need 13.975 gallons to fill my tank.

You can only see 35% of my new Jeep.

Stop building images that take good data and make it useless.

You’re better than this.

The post Your Dashboards Still Suck appeared first on Thomas LaRock.

Three Ways to Become Data-Centric

Thomas LaRock — Wed, 16 Jan 2019 21:49:42 +0000

The conservation of quantum information theory states information can neither be created nor destroyed. Stephen Hawking used this theory to explain how a black hole does not consume photons like a giant cosmic eraser. It is clear to me that neither Stephen Hawking, nor any quantum physicist, has ever worked in IT.

Outside the realm of quantum mechanics we have the physical world of corporate offices. And in the physical world information is generated, curated, and consumed at an accelerated pace with each passing year. The similarity between both realms? Data is never destroyed.

We are now a nation, and a world, of data hoarders.

Thanks to popular processes such as DevOps, we obsess over telemetry and observability. System administrators are keen to collect as much diagnostic information as possible to help troubleshoot servers and applications when they fail. And the Internet of Things has a billion devices broadcasting data to be easily consumed into Azure and AWS.

All of this data hoarding is leading to an accelerated amount of ROT (Redundant, Outdated, Trivial information).

Stop the madness.

It’s time to shift our way of thinking about how we collect data. We need to become more data-centric and do less data-hoarding.

Becoming data-centric means you define goals or problems to solve BEFORE collecting or analyzing data. Once defined, you begin the process of collecting the necessary data. You want to collect the right data to help you make informed decisions about what actions are necessary.

Three Ways to Become Data-Centric

Here are three things you can start today in an effort to become data-centric. No matter what your role, these three ways will help put you on the right path.

Start with the question you want answered. This doesn’t have to be a complicated question. Something simple as, “How many times was this server rebooted?” is a fine question to ask. You could also ask, “How long does it take for a server to reboot?” These examples are simple questions, yes. But I bet your current data collections do not allow for simple answers without a bit of data wrangling.

Have an end-goal statement in mind. Once you have your question(s) and you have settled on the correct data to be collected, you should think about the desired output. For example, perhaps you want to put the information into a simple slide deck. Or maybe build a real-time dashboard inside of Power BI. Knowing the end goal may influence how you collect your data.

Learn to ask good questions. Questions should help to uncover facts, not opinions. Don’t let your opinions affect how you collect or analyze your data. It is important to understand how assumptions form the basis for many questions. It’s up to you to decide if those assumptions are safe. To me, assumptions based upon something measurable are safe. For example, your gut may tell you that server reboots are a result of O/S patches applied too often. Instead of asking, “How often are patches applied?” a better question would be, “How many patches need a reboot?” then compare that number to the total number of server reboots.

Summary

When it comes to data, no one is perfect. These days, data is easy to come by, making it a cheap commodity. When data is cheap, attention becomes a premium. By shifting to a data-centric nature, you can avoid data hoarding and the amount of ROT in your enterprise. With just a little bit of effort, you can make things better for yourself, your company, and help set the example for everyone else.

The post Three Ways to Become Data-Centric appeared first on Thomas LaRock.