SQL MVP Archives - Thomas LaRock https://thomaslarock.com/category/sql-mvp/ Thomas LaRock is an author, speaker, data expert, and SQLRockstar. He helps people connect, learn, and share. Along the way he solves data problems, too. Thu, 11 Jul 2024 21:35:40 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://thomaslarock.com/wp-content/uploads/2015/07/gravatar.jpg SQL MVP Archives - Thomas LaRock https://thomaslarock.com/category/sql-mvp/ 32 32 18470099 Microsoft Fabric is the New Office https://thomaslarock.com/2024/07/microsoft-fabric-is-the-new-office/ https://thomaslarock.com/2024/07/microsoft-fabric-is-the-new-office/#respond Thu, 11 Jul 2024 21:35:38 +0000 https://thomaslarock.com/?p=29269 At Microsoft Build in 2023 the world first heard about a new offering from Microsoft called Microsoft Fabric. Reactions to the announcement ranged from “meh” to “what is this?” To be fair, this is the typical reaction most people have when you talk data with them. Many of us had no idea what to make ... Read more

The post Microsoft Fabric is the New Office appeared first on Thomas LaRock.

]]>
At Microsoft Build in 2023 the world first heard about a new offering from Microsoft called Microsoft Fabric. Reactions to the announcement ranged from “meh” to “what is this?” To be fair, this is the typical reaction most people have when you talk data with them.

Many of us had no idea what to make of Fabric. To me, it seemed as if Microsoft were doing a rebranding of sorts. They changed the name of Azure Synapse Analytics, also called a Dedicated SQL Pool, and previously known as Azure SQL Data Warehouse. Microsoft excels (ha!) at renaming products every 18 months, keeping customers guessing if anyone is leading product marketing.

Microsoft Fabric also came with this thing called OneLake, a place for all your company data. Folks with an eye on data security, privacy, and governance thought the idea of OneLake was madness. The idea of combining all your company data into one big bucket seemed like a lot of administrative overhead. But OneLake also offers a way to separate storage and compute, allowing for greater scalability. This is a must-have when you are competing with companies like Databricks and Snowflake, and other cloud service providers such as AWS and Google.

After Some Thought…

After the dust had settled and time passed, the launch and concept of Fabric started to make more sense. For the past 15+ years, Microsoft has been building the individual pieces of Fabric. Here’s a handful of features and services Fabric contains:

  • Data Warehouse/Lakehouse – the storing of large volumes of structured and unstructured data in OneLake, which separates storage and compute
  • Real-time analytics – the ability to stream data into OneLake, or pull data from external sources such as SnowFlake
  • Data Engineering – the ability to extract, load, and transform data including the use of notebooks
  • Data Science – leverage machine learning to gain insights from your data
  • PowerBI – create interactive reports and dashboards

Many of these services were built to support traditional data storage, retrieval, and analytical processing. This type of data processing focuses on data at rest, as opposed to streaming event data. This is not to say you couldn’t use these services for streaming, you could try if you wanted. After all, the building blocks for real-time analytics go back to SQL Server 2008, with the release of StreamInsight, a fancy way to build pipelines for refreshing dashboards with up to date data.

Streaming event data is where the real data race is taking place today. According to the IDC, by 2025 nearly 30% of data will need real-time processing. This is the market Microsoft, among others, is targeting, which is roughly 54 ZB in size.

So, it seems the more data collected, the more likely it is used for real-time processing. Therefore, if you are a cloud company, it is rather important to your bottom line to find a way to make it easy for your customers to store their data in your cloud. The next best thing, of course, is making it easy for your customers to use your tools and services to work with data stored elsewhere. This is part of the brilliance of Fabric, as it allows ease of access to real time data you are already using in places like Databricks, Confluent, and Snowflake.

The Bundle

Now, if you are Microsoft, with a handful of data services ready to meet the needs of a growing market, you have some choices to make. You could continue to do what you have done for 15+ years and keep selling individual products and services and hope you earn some of the market going forward. Or you could bundle the products and services, unifying them into one platform, and make it easy for users to ingest, transform, analyze, and report on their data.

Well, if you want to gain market share, bundling makes the most sense. And Microsoft is uniquely positioned to pull this off for two reasons. First, they have a comprehensive data platform which is second to none. Sure, you can point to other companies who might do one of those services better, but there is no company on Earth, or in the Cloud, which offers a complete end-to-end data platform like Fabric.

Second, bundling software is something Microsoft has a history of doing, and doing it quite well in some cases. People reading this post in 2024 may not be old enough to recall a time when you purchased individual software products like Excel and Word. But I do recall the time before Microsoft Office existed. Bundling everything into Fabric allows users to work with their data anywhere and, most importantly to Microsoft’s bottom line, the result is more data flowing to Azure servers.

I am not here to tell you everything is perfect with Fabric. In the past year I have seen a handful of negative comments about Fabric, most of them nitpicking about things like brand names, data type support, and file formats. There is always going to be a person upset about how Widget X isn’t the Most Perfect Thing For Them at This Moment and They Need to Tell the World. I think most people believe when a product is released, even if it is marked as “Preview”, it should be able to meet the demands of every possible user. It is just not practical.

Summary

Microsoft Fabric was announced at Build this year to be GA, which also makes users believe it should meet the demands of every possible user. The fastest way for Microsoft to grab as much market share as possible is to focus on the customer experience and remove those barriers. You can find roadmap details here, giving you an idea about the effort going on behind the scenes with Fabric today. For example, for everyone who has raised issues with security and governance, you can see the list of what has shipped and what is planned here.

It is clear Microsoft is investing in Fabric, much like they invested in Office 30+ years ago. If there is one thing Microsoft knows how to do, it is creating value for shareholders:

Since the announcement of Fabric last May, Microsoft is up over 25%. I am not going to say the increase is the direct result of Fabric. What I am saying is Microsoft might have an idea about what they are doing, and why.

Microsoft Fabric is the new Office – it is a bundle of data products, meant to boost productivity for data professionals and dominate the data analytics landscape. Much in the same way Office dominates the business world.

The post Microsoft Fabric is the New Office appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2024/07/microsoft-fabric-is-the-new-office/feed/ 0 29269
Microsoft Data Platform MVP – Fifteen Years https://thomaslarock.com/2023/08/microsoft-data-platform-mvp-fifteen-years/ https://thomaslarock.com/2023/08/microsoft-data-platform-mvp-fifteen-years/#respond Thu, 17 Aug 2023 19:53:36 +0000 https://thomaslarock.com/?p=27668 I am happy, honored, and humbled to receive the Microsoft Data Platform MVP award for the fifteenth (15th) straight year. Receiving the MVP award during my unforced sabbatical this summer was a bright spot, no question. It reinforced the belief I have in myself – my contributions have value. Microsoft puts this front and center ... Read more

The post Microsoft Data Platform MVP – Fifteen Years appeared first on Thomas LaRock.

]]>
I am happy, honored, and humbled to receive the Microsoft Data Platform MVP award for the fifteenth (15th) straight year.

Receiving the MVP award during my unforced sabbatical this summer was a bright spot, no question. It reinforced the belief I have in myself – my contributions have value. Microsoft puts this front and center on the award by stating (emphasis mine):

“We recognize and value your exceptional contributions to technical communities worldwide.”

I’m running out of room.

I recall the aftermath of my first award, when I was told I was the “least technical SQL Server MVP ever awarded”. Talk about feeling you have no value! And that was certainly the feeling I had two months ago.

It’s amazing how something as simple as being recognized by your peers can go so far in making a person feel valued. We should all strive to go out of our way daily to help another human feel valued.

There are plenty of people in the world who are recognized as experts in the Microsoft Data Platform. I’d like to think I am one of them. I also happen to be fortunate enough to know Microsoft recognizes me as one as well.

But MVPs advocate for Microsoft because we want to, not because we want an award. After all these years I’m still crazy for Microsoft, and I am happy to help promote the best data platform on the planet.

For my fellow MVPs renewed this year, I offer this suggestion – say thank you. Then say it again. Email the person on the product team who made the widget you enjoy using over and over and tell them how much you appreciate their effort. Email your MVP lead(s) and thank them for all their hard work as well.

A little kindness goes a long way. You never know how much reaching out could mean to that person at that moment.

The post Microsoft Data Platform MVP – Fifteen Years appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2023/08/microsoft-data-platform-mvp-fifteen-years/feed/ 0 27668
Stop Using Production Data For Development https://thomaslarock.com/2022/01/stop-using-production-refresh-development/ https://thomaslarock.com/2022/01/stop-using-production-refresh-development/#comments Mon, 31 Jan 2022 19:33:43 +0000 https://thomaslarock.com/?p=21592 A common software development practice is to take data from a production system and restore it to a different environment, often called “test”, “development”, “staging”, or even “QA”. This allows for support teams to troubleshoot issues without making changes to the true production environment. It also allows for development teams to build new versions and ... Read more

The post Stop Using Production Data For Development appeared first on Thomas LaRock.

]]>
A common software development practice is to take data from a production system and restore it to a different environment, often called “test”, “development”, “staging”, or even “QA”. This allows for support teams to troubleshoot issues without making changes to the true production environment. It also allows for development teams to build new versions and features of existing products in a non-production environment. Using production to refresh development is just one of those things everyone accepts and does, without question.

Of course the idea of testing in a non-production environment isn’t anything new. Consider Haggis. No way someone thought to themselves “let me just shove everything I can into this sheep’s stomach, boil it, and serve it for dinner tonight.” You know they first fed it to the neighbor nobody liked. Probably right after they shoved a carton of milk in their face and asked “does this smell bad to you?”

For decades software development has made it a standard practice to create copies of production data and restore it to other non-production environments. It was not without issues, however. For example, as data sizes grew so did the length of time to do a restore. This also clogged network bandwidth, not to mention the costs associated with storage.

And then there is this:

If you read that tweet and thought “yeah, what’s your point?” then you are part of the problem.

As an industry we focus on access to specific environments, but not the assets in the environments. This is wrong. The royal family knows where the Crown Jewels are stored but if they are moved to another location you know the Jewels are heavily guarded at all times. Access to the jewels is important no matter where the jewels are located. The same should be true of your production data.

Use production to refresh development.
Then again, that stick might be pointy enough to fend off any attacker.

Data is the most critical asset your company owns. If you make efforts to lock down production but allow production data to flow to less-secure environments, then you haven’t locked down production.

It is ludicrous to think about the billions of dollars spent to lock down physical access to data centers only to allow junior developers to stuff customer data on a laptop they will then leave behind on a bus. Or senior developers leaving S3 buckets open. Or forgetting they pushed credentials to a GitHub repo.

If you are still moving production data between environments you are a data breach waiting to happen. I don’t care what the auditors say, you are at an elevated and unnecessary risk. Like when Obi-Wan decides to protect baby Luke by keeping his name and taking him to Darth Vader’s home planet. Nice job, Ben, no way this ends up with you dying, naked, in front a few dozen onlookers.

I think what frustrates me most is this entire system is unnecessary. You have options when moving production data. You can use data masking, obfuscation, and encryption in order to reduce your risk. But the best method is to not move your data at all.

After years of being told “don’t test in production” it’s time to think about testing in production. Continuous integration and continuous delivery/deployment (CI/CD) allow for you to achieve this miracle. And for those that say “No, you dummy, CI/CD is what you do in test before you push to production,” I offer the following.

Use dummy data.

You don’t need production data, you need data that looks like production data. You don’t need actual customer names and address, you need similar names and address. And there are ways to simulate the statistics in your database, too, so your query plans have the same shape as production without the actual volume of data.

It’s possible for you to develop software code against simulated production data, as opposed to actual production data. But doing so requires more work, and nobody likes more work.

Until you are breached, of course. Then the extra work won’t be optional.

The post Stop Using Production Data For Development appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2022/01/stop-using-production-refresh-development/feed/ 5 21592
Microsoft Data Platform MVP – A Baker’s Dozen https://thomaslarock.com/2021/07/microsoft-data-platform-mvp-a-bakers-dozen/ https://thomaslarock.com/2021/07/microsoft-data-platform-mvp-a-bakers-dozen/#respond Thu, 29 Jul 2021 16:28:28 +0000 https://thomaslarock.com/?p=21237 I am happy, honored, and humbled to receive the Microsoft Data Platform MVP award for the thirteenth straight year.

The post Microsoft Data Platform MVP – A Baker’s Dozen appeared first on Thomas LaRock.

]]>
No Satya, thank you. And you’re welcome. Let’s do lunch next time I’m in town.

This past week I received another care package from Satya Nadella. Inside was my Microsoft Data Platform MVP award for 2021-2022. I am happy, honored, and humbled to receive the Microsoft Data Platform MVP award for the thirteenth straight year. I still recall my first MVP award and how it got caught in the company spam folder. Good times.

I am not able to explain why I am considered an MVP and others are not. I have no idea what it takes to be an MVP. And neither does anyone else. Well, maybe Microsoft does since they are the ones that bestow the award on others. But there doesn’t seem to be any magical formula to determine if someone is an MVP or not.

I do my best to help others. I value people and relationships over money. And I play around with many Microsoft data tools and applications, and blog about the things I find interesting. Sometimes those blog posts are close to fanboi level, other times they are not. But I do my best to remember that there are people over in Redmond that work hard on delivering quality. Sometimes they miss, and I do my best to help them stay on target. Maybe that’s why they keep me around.

Looking back to 2009 and my first award, I recognize the activities I was doing 13 years ago which earned my my first MVP award are not the same activities I am doing today. And I think that is to be expected, as I’m not the same person today as I was then. I have a different role, different responsibilities, and different priorities. I’ve branched out into the world of data security and privacy as well as data science.

I now spend time writing for multiple online publications instead of my here on my personal blog. Often those articles are not about SQL Server, but are almost always about data. I remain an advocate for Microsoft technologies, and continue to do my best to influence others to see the value in the products and services coming out of Redmond.

This past year was a bit different, of course, as COVID affected different parts of the world in different ways, at different times. As such, the Microsoft MVP program made the decision to auto renew MVPs without evaluating our current activity. This year’s award is the equivalent of Free Parking in Monopoly, Sure, you’re still in the game, but you aren’t doing much, things are happening around you, and everything is OK.

Summary

I’m going to enjoy this ride while it lasts. I’m also going to do my part to make certain that the ride lasts as long as possible for everyone. Here’s a #ProTip for those of us renewed this year:

Say thank you. Then say it again. Be grateful for what we have. Email the person that made the widget that you enjoy using over and over and tell them how much you appreciate their effort. Email your MVP lead and thank them for all their hard work as well.

MVPs do advocate for Microsoft because we want to, not because we want an award. After all these years I’m still crazy for Microsoft, and I am happy to help promote the best data platform on the planet.

The post Microsoft Data Platform MVP – A Baker’s Dozen appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2021/07/microsoft-data-platform-mvp-a-bakers-dozen/feed/ 0 21237
SET NOCOUNT For SQL Server https://thomaslarock.com/2021/03/set-nocount-for-sql-server/ https://thomaslarock.com/2021/03/set-nocount-for-sql-server/#comments Tue, 30 Mar 2021 01:05:55 +0000 https://thomaslarock.com/?p=20845 Last week I was reviewing an article and found myself needing information on the use of NOCOUNT as a standard for writing stored procedures. A quick internet search found this old post of mine, written back when I used to work for a living. Apparently, I was once asked to enable NOCOUNT for a specific ... Read more

The post SET NOCOUNT For SQL Server appeared first on Thomas LaRock.

]]>
Last week I was reviewing an article and found myself needing information on the use of NOCOUNT as a standard for writing stored procedures. A quick internet search found this old post of mine, written back when I used to work for a living. Apparently, I was once asked to enable NOCOUNT for a specific SQL Server database. As the post suggests, this is not possible. The options for NOCOUNT are to set for the entire instance, for your specific connection, or within your T-SQL code.

Since the post was written well before the new-ish ALTER DATABASE SCOPED CONFIGURATION statement, I was hopeful enabling NOCOUNT for a database was now possible. Turns out you cannot, as the set options listed here do not include NOCOUNT. Sad trombone music.

But of course I tried anyway.

And I failed.

Really failed.

I tried to enable NOCOUNT for my instance of SQL 2019 and it wouldn’t take. At all.

Let me explain.

The Flop

Using the code from my previous post, you enable NOCOUNT for the instance by configuring the user option to 512, like this:

EXEC sys.sp_configure 'user options','512'
GO
   
RECONFIGURE   
GO

Now, open a new query window in SQL Server Management Studio (SSMS), set the results to text to make the output easier to see, and run a query. If you are like me, you will see this:

Not exactly the expected behavior! My initial reaction is to assume I have screwed this up somehow. I decide to try Azure Data Studio (ADS) to connect and run the query:

Same result. Two tools, and the result set is showing a count of rows affected, despite the user option clearly having been set.

And the SSMS GUI verifies this as well:

The Turn

Before I go any further, I want to take note that SET NOCOUNT OFF is one of those horrible phrases we come across in tech where our brains are forced to think twice about what we are doing. Whoever named it this way should be sacked. A simple SET ROWRESULTS ON|OFF would be far simpler to comprehend. </rant>

Anyway, I spend time trying to debug what is happening. I am able to manually set NOCOUNT on and off inside of T-SQL and see a count of rows affected returned (or not). I check and recheck everything I can think of and feel as if I have lost my mind. I’m starting to question how I ever became certified in SQL Server.

I mean, it’s a simple configuration change. This isn’t rocket surgery.

So I do what anyone else in this situation would do.

I turn off my laptop and forget about everything for a few days.

The River

Eventually I decide to reopen my laptop and try again. I am able to reproduce everything. So I ask some friends if they are also seeing similar issues. One friend, Karen López (@datachick), asked me a few follow up questions. These questions get my mind thinking about other ways to test behavior and debug. I suddenly recall I can check for the options set for my current connection:

DECLARE @options INT
SELECT @options = @@OPTIONS

PRINT @options
IF ( (1 & @options) = 1 ) PRINT 'DISABLE_DEF_CNST_CHK'
IF ( (2 & @options) = 2 ) PRINT 'IMPLICIT_TRANSACTIONS'
IF ( (4 & @options) = 4 ) PRINT 'CURSOR_CLOSE_ON_COMMIT'
IF ( (8 & @options) = 8 ) PRINT 'ANSI_WARNINGS'
IF ( (16 & @options) = 16 ) PRINT 'ANSI_PADDING'
IF ( (32 & @options) = 32 ) PRINT 'ANSI_NULLS'
IF ( (64 & @options) = 64 ) PRINT 'ARITHABORT'
IF ( (128 & @options) = 128 ) PRINT 'ARITHIGNORE'
IF ( (256 & @options) = 256 ) PRINT 'QUOTED_IDENTIFIER'
IF ( (512 & @options) = 512 ) PRINT 'NOCOUNT'
IF ( (1024 & @options) = 1024 ) PRINT 'ANSI_NULL_DFLT_ON'
IF ( (2048 & @options) = 2048 ) PRINT 'ANSI_NULL_DFLT_OFF'
IF ( (4096 & @options) = 4096 ) PRINT 'CONCAT_NULL_YIELDS_NULL'
IF ( (8192 & @options) = 8192 ) PRINT 'NUMERIC_ROUNDABORT'
IF ( (16384 & @options) = 16384 ) PRINT 'XACT_ABORT'

Running the above code returns the following result for my connection:

And then it hits me. My connection does not have NOCOUNT enabled! I mean, I’m not really surprised, but it is helpful to see that it is missing. I then decide to open a connection with SQLCMD and observe the default behavior for the connection. Sure enough, NOCOUNT is enabled, as expected:

My connection string has no additional options, and the NOCOUNT is respected. This is the expected behavior for the instance.

Now I need to verify what is happening under the hood when you connect to SQL Server using SSMS or ADS. Using the default xEvents session I capture the connection string sent when connecting from ADS and find this gem:

The NOCOUNT user option configuration item is not recognized when you connect using those tools. Other user options appear to be respected, but for some reason NOCOUNT is ignored. This explains why I was see the unexpected behavior.

I will keep my certifications for now.

Summary

I don’t know if this is a bug or a feature. But it is certainly a frustrating experience for an end user like myself. But if I SET NOCOUNT for SQL Server I expect it to apply to all users connecting from that point forward. Since other user options appear to be respected, there must be something different about NOCOUNT.

It should not matter how users are connecting. SSMS and ADS should both respect the server settings. I suspect other tools likely use the same code as SSMS and ADS, meaning you should double check the actual connection string used from your application. It could explain unexpected behavior.

The post SET NOCOUNT For SQL Server appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2021/03/set-nocount-for-sql-server/feed/ 1 20845
You Can’t Marry Your Database, But You Can Have Relations https://thomaslarock.com/2021/02/you-cant-marry-your-database-but-you-can-have-relations/ https://thomaslarock.com/2021/02/you-cant-marry-your-database-but-you-can-have-relations/#respond Mon, 01 Feb 2021 23:15:29 +0000 https://thomaslarock.com/?p=20575 There’s something you should know about relational databases. They were designed to store data efficiently, protecting the quality of the data written and stored to disk. I’ve written before about relational engines favoring data quality and integrity, and how relational databases were not designed for the reading of data. Of course, if you are going through the trouble ... Read more

The post You Can’t Marry Your Database, But You Can Have Relations appeared first on Thomas LaRock.

]]>
There’s something you should know about relational databases.

They were designed to store data efficiently, protecting the quality of the data written and stored to disk. I’ve written before about relational engines favoring data quality and integrity, and how relational databases were not designed for the reading of data.

Of course, if you are going through the trouble of writing the data into a relational database, it makes sense that you would want to retrieve the data at some point. Otherwise, why go through the exercise of storing the data inside the database?

The trouble with reading data from a relational database is due to the data not being stored in a format that is friendly for viewing, reading, or retrieving. That’s why we have data professionals, like me, to help you write queries that return the correct data, in the correct format, for you to analyze.

I’m here today to tell you we’ve been doing data wrong the whole damn time.

Let me show you what I mean.

Traditional Data Flow Patterns

Here’s what companies do, every day:

Step 1 – Identify useful data

Step 2 – Import that data into a database

Step 3 – Analyze the data

Step 4 – Export the data into dashboards

Step 5 – Profit (maybe)

The trouble with this process is Step 3, the analyzing of the data. Relational databases were not designed for analytical processing. Relational databases do not store data in a way that is readable, or friendly, for human analysis.

That’s not to say you can’t do analytics inside of a relational database. What I am saying is that it could be better for you not to spin the CPU cycles there, and instead do the analytics somewhere else.

For example, data warehouses help with data storage, retrieval, and analytics. But even a data warehouse can fall short when it comes to the use of unstructured data sources. As a result, we’ve spent decades building ETL processes to curate, collate, consolidate, and consume data.

And we’ve been doing it wrong.

Data Mining for the Data Expert

So, we understand that people find data, store it, and try to use it later. They are engaging in the process of data mining, hoping to find gold in the form of insights, leading to better business decisions.

But as I mentioned before, the data isn’t in a readable format. Let’s look at an example.

Following the common data flow pattern, I found some useful data at NFL Savant https://nflsavant.com/about.php, and imported the data into a SQL Server database:

It looks like any other table in a relational database. In this case, it is a table containing time series data pertaining to play by play records for the 2018 NFL season. Each row represents an entity (a play at a point in time of an NFL game), and the columns represent attributes of the play (down, distance, yards to go, etc.)

Nothing out of place here, this is how data is written to a relational database. In an orderly fashion. As a DBA, I love this type of orderly storage. It’s efficient, and efficient is good.

As a data analyst, I’m not a fan. At least, not yet. I have a bunch of data, but what I want are some answers. So, it’s up to me to ask some questions of the data, find some answers, and use that to help make better business decisions.

For this data, here’s an example of a simple question: What are the average yards to go for NFL teams in 2018? I can get that answer with some simple T-SQL:

This is great! I was able to take my data, ask a question, and get an answer. What could be better, right?

Of course, now I have more questions about my data. And here’s the first issue you will discover when trying to analyze data stored in a traditional relational database.

T-SQL is excellent at answering one question at a time, but not as great when you need more than one question answered.

So, if we have more questions, we will need to write more queries.

Here’s a good follow-up question that we might want to be answered: Can we examine this data broken down by each quarter?

Fortunately, the answer is yes, because T-SQL comes with a bunch of statements and functions that will help. In this case, I am going to use the PIVOT operator, as follows:

Easy, right?

No, not easy.

And not readable, either. What’s with that row saying NULL? Why do I not have a result for some teams in that last column?

As it turns out, you need a lot of experience writing T-SQL to get to that query. And you need more experience understanding the result set, too. You don’t start on Day 0 as a data professional writing PIVOT queries against a SQL Server database.

Here’s the good news: You don’t need to write PIVOT queries, ever.

Data Mining for the Masses

The data import from NFL Savant was in the form of a CSV file, which I then imported into my database. Because that’s how ETL is done (see above for the common data flow process).

What if…now hear me out…we skipped step 2? Forget about doing the import process. Instead, let’s open that CSV file in Excel.

Here’s what it would look like:

Back to our football questions. We’ve seen examples in T-SQL, let’s look at how to do this in Excel using a Pivot table.

I click on one cell in Excel, insert a pivot table, drag the offense teams as a row, and the downs to go as a value, change it to an average, and we are done. Have a look:

It took but a few seconds to get this magic to happen. Here’s what I want you to know:

1. No T-SQL is necessary. None. Not one line of code.

2. I have the entire table as a pivot table, allowing me to answer more questions WITHOUT needing to write more T-SQL.

3. There is no code. None.

Let’s say that I want to know the yards to go broken down by quarter. With T-SQL, I would need to write a new query. With the pivot table, it’s a simple drag and drop, like this:

Fin.

There is no need to rewrite code to get this result. Because there is no code, it’s drag and drop, and then I have my answer.

And that’s why I believe the inclusion of pivot table inside Excel is the greatest advancement in the 21st century for data professionals.

Fight me.

Summary

I did not come here to bury relational databases. I came here to help you understand relational databases may not be the right place to do analytical processing.

When it comes to curating and consuming data, I have three simple rules for you to follow:

Rule #1 – Only collect data that you need. Don’t collect data “just in case you may need it later.” The data you collect must be relevant for your needs right now.

Rule #2 – Understand that all data is dirty. You could build a perfect analytical solution but based on inaccurate data. Know the risks involved in making business decisions based on dirty data.

Rule #3 – Before you collect any data, consider where the data will be processed. Don’t just assume that your database will do everything you need. Take time to list out all the available tools and systems at your disposal. The result may be a simpler solution than first imagined.

I wrote this post to help you understand Rule #3. Analysis of NFL play by play data is best done in a tool such as Excel, or PowerBI, and not (necessarily) inside of SQL Server.

SQL Server is a robust relational database engine, containing integrations with data science-y stuff such as R and Python. Just because you could do your analysis inside the SQL Server engine doesn’t mean you should.

This post originally appeared on PowerPivotPro and I was reminded about its existence while talking with Rob Collie during our Raw Data podcast. I asked Rob if I could repost here. He said yes. True story.

The post You Can’t Marry Your Database, But You Can Have Relations appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2021/02/you-cant-marry-your-database-but-you-can-have-relations/feed/ 0 20575
Book Review: Calling Bullshit https://thomaslarock.com/2020/10/book-review-calling-bullshit/ https://thomaslarock.com/2020/10/book-review-calling-bullshit/#comments Mon, 19 Oct 2020 21:32:19 +0000 https://thomaslarock.com/?p=20138 Each year, I try to find a good book to bring with me to the beach. A few months ago, I came across Calling Bullshit: The Art of Skepticism in a Data-Driven World while doom scrolling Twitter one night. I ordered the book and did not wait for the beach to get started reading. Written ... Read more

The post Book Review: Calling Bullshit appeared first on Thomas LaRock.

]]>
Each year, I try to find a good book to bring with me to the beach. A few months ago, I came across Calling Bullshit: The Art of Skepticism in a Data-Driven World while doom scrolling Twitter one night. I ordered the book and did not wait for the beach to get started reading.

Written by Carl Bergstrom and Jevin West, Calling Bullshit is their effort at helping everyone develop the necessary skills for critical thinking. The book reads as if you were following a college lecture. And this makes sense, since the authors are professors at the University of Washington in Seattle. You can see the course syllabus here. The book is organized close to what you see in the syllabus.

The authors start strong with “The world is awash with bullshit, and we’re drowning in it.” They point out how creating bullshit is easier and often simpler than speaking the truth. You have likely heard the phrase “the amount of energy required to refute bullshit is an order of magnitude bigger than [that needed] to produce it” from Italian software engineer Alberto Brandolini.

Everyone knows this, everyone wishes the bullshit would go away, and yet we seem to accept nothing can be done.

Well, one of the main purposes of education is to teach students to think critically. And the authors want to help people separate fact from fiction. Therefore, the need for the course, and this book.

calling bullshit

I thoroughly enjoyed this book. My favorite chapter was ‘calling bullshit on big data’. I want to personally thank the authors for those 25 wonderful pages. This section alone is worth the price of the book. Anyone currently using, or considering, data science projects at their company will want to read this chapter.

The chapters on numbers and nonsense, selection bias, and data visualization all struck a chord for me. The authors do a wonderful job of detailing their thoughts and using practical examples. And they don’t just tell you how to call bullshit, they remind you to do so in a respectful way, again with examples. Part of the problem when trying to refute bullshit tossed at you from your crazy uncle at Thanksgiving involves confirmation bias, a similar topic discussed in the social dilemma. You must find a way to separate identity from the topic being debunked.

I’ve added this book to my bookshelf. With the holidays coming up, you may want to consider buying a few copies for your friends and relatives. Might make holiday dinners a bit more palatable.

The post Book Review: Calling Bullshit appeared first on Thomas LaRock.

]]>
https://thomaslarock.com/2020/10/book-review-calling-bullshit/feed/ 1 20138