Thomas LaRock

Fail Better, Then Finish 21st

2026-04-09T12:35:38-04:00

Every March, data scientists and sports fans collide on Kaggle. This year, I came out ahead of 3,464 of them. 21st place out of 3,485 entries. Top 1%. I'll take it.

This is how I did it.

Why Am I Even Doing This?

I have an MS in Mathematics, way before some hipster decided to rename "statistics" as "data science". I spent the better part of a decade as a database administrator before deciding I wanted to do more with data than store it. About ten years ago I started consuming and learning all things data science, and lurking around other data folks at places like Kaggle. As you read this I am finishing my second MS degree, this one in Data Analytics from Georgia Tech, I graduate in a few weeks.

Now, I am not saying I spent $10,000+ and three-plus years of my life just to get better at Kaggle competitions.

But, I am not not saying that either.

The Kaggle March Machine Learning Mania competition is a good proving ground. It is time-boxed, the data is clean, the scoring is well-defined, and the competition is deep. This year there were 3,485 entries; not a trivial field. There are serious data scientists in that pool. People who do this professionally, people with larger compute budgets, people with more elegant solutions than mine. Finishing in the top 1% means something.

Well, it does to me, at least

It also means I have to write this post, now, because when I finish 2000th next year I will want people to forget I ever mentioned it.

The Approach

Let me walk you through how I built this, and more importantly, why I made the decisions I did.

The competition asks you to submit a prediction for every possible matchup between two teams in both the Men's and Women's tournaments. With 64 teams in each field, the submission contains 132,133 entries. Each entry has a team identifier and a probability between 0 and 1, representing the likelihood that the lower-seeded team ID wins the game, should these two teams meet. You do not pick teams to advance, you measure their probability of winning any given game. Honestly, this is more fun than a traditional bracket challenge.

The competition is scored using the Brier score, which measures the accuracy of probabilistic predictions. Lower is better. A perfect model scores 0. A model predicting 50/50 for every game scores 0.25. My final score was 0.1203942, which was the average of the two results for Men (0.1416504) and Women (0.0991382).

The easiest approach is to train a model, output some predictions, and call it a day. Some people do exactly that and get decent results. The problem is simple: a model trained carelessly will learn things which are not true, and you only find out after the games are played.

A few deliberate choices made the difference for me.

Symmetric Training Data

Every game appears twice in my training data — once with Team A listed first, once with Team B. This doubled the training data and eliminated perspective bias. I also focused exclusively on differentials rather than raw values. For example, instead of using each team's average points scored, I used the scoring gap between the two teams. This gives the model the right context to find meaningful patterns in a head-to-head matchup.

Walk-Forward Validation

Standard cross-validation does not work here. If you train on 2024 data and validate on 2022, the model has already seen the future. Walk-forward validation trains on all data before a given season and validates on that season only — which mirrors exactly what the prediction task requires. I validated across six seasons: 2019, 2021, 2022, 2023, 2024, and 2025, skipping 2020 because the tournament was cancelled that year for reasons I am sure everyone has forgotten by now. Recent seasons were weighted more heavily since the game evolves year to year.

Feature Engineering

All features were computed as differentials — Team A minus Team B — to keep the model focused on relative team quality. Key features included:

MOV-adjusted ELO, using a FiveThirtyEight-style margin-of-victory correction
Pythagorean win percentage and last-10-game momentum
Strength of schedule, consistency, and volatility
Seed matchup win rates — historical, recent 5-year, and recent 10-year windows
Multi-dimensional performance gap analysis across key seasonal metrics

None of these features are secret. What matters is how they work together to describe the relative quality and momentum of two teams heading into a neutral-site elimination game.

Where the Magic Happened

Five models. One ensemble. And a lot of compute time.

I trained five models independently: LightGBM, XGBoost, HistGradientBoosting, Logistic Regression, and a Neural Network. Each model saw the same features and the same walk-forward validation scheme. The question was not which model to use but rather how much to trust each one.

Rather than averaging the predictions equally, I used Optuna to optimize the ensemble weights separately for men and women, minimizing Brier score on the walk-forward predictions. The results were illuminating.

Model	Men	Women
LightGBM	58.4%	34.0%
Logistic Regression	21.9%	8.5%
XGBoost	18.4%	10.7%
Neural Network	0.1%	43.9%
HistGradientBoosting	1.3%	2.8%

A few things stand out here.

LightGBM dominated the men's side at 58.4%. No surprise, as it was the best individual model throughout tuning. Logistic Regression contributed a meaningful 21.9%, which tells you something about the value of a simple linear model when your features are well constructed.

The Neural Network result is the most interesting story in the table. For men, Optuna effectively ignored it, 0.1% weight is a rounding error. For women, it carried 43.9% of the load. This suggests the women's game has non-linear patterns the tree models simply do not capture. Whether this is due to differences in pace, efficiency, or seed dominance, I cannot say with certainty. But the data was clear.

HistGradientBoosting was essentially noise at 1-3% across both genders. Next year I would either force it to find a different signal or drop it entirely in favor of more tuning time on LightGBM.

The ensemble Brier score of 0.15515 beat every individual model. That is the point of an ensemble. Not to find the best single model, but to combine models in a way where their collective judgment outperforms any one of them alone.

The Insight — Or, How I Learned to Stop Being Clever

I played basketball. I coached basketball. And when I first entered this competition years ago I was certain my domain expertise would give me an edge over the data scientists who had never watched a college game in their lives.

I was wrong. Very, very wrong.

It turns out knowing basketball does not help you predict basketball outcomes any better than someone who has never seen a game, provided they know how to collect, transform, and analyze data properly. The sport has too much variance. The tournament has too much chaos. And the things I thought I knew, which teams were dangerous, which matchups favored underdogs, and when to trust a double-digit seed was just noise dressed up as intuition.

The humbling moment came when I started looking at my predictions for upset-prone matchups. My instinct, informed by years of watching March Madness, was to adjust my model's predictions to account for the possibility of an upset. If my model said a 1-seed had a 93% chance of winning, and I knew from experience upsets happen, why not nudge that number down a bit? Give the 9-seed a fighting chance in the predictions, just like they sometimes get in real life!

Remember the MS in mathematics I have? Yeah, so I did the math on this idea.

It turns out this instinct, however well-intentioned, is mathematically guaranteed to make your Brier score worse. Every time. Without exception. I wrote a short explanation of this on Kaggle which you can read here, but the short version is this: for any well-calibrated model, the expected cost of adjusting your prediction away from the true probability is exactly (p̂ - p)² per game. That number is always greater than zero. You are always making things worse. It does not matter if you are trying to account for an upset, or injuries, or anything. If you alter the model predicitons after they are output, you hurt yourself more often than not.

But the lesson was not just about upset predictions. It was bigger than that. Being a subject matter expert is not enough. Sure, it helps you ask better questions and interpret results more intelligently. But it is no substitute for knowing how to build models properly. Once I stopped trying to be clever and started trusting the data, my results improved.

Domain expertise and analytical rigor are not competing advantages. They are complementary ones. It just took me a few years of finishing in the middle of the pack to figure that out.

A Word on Tools

I used large language models throughout this project. For coding assistance, debugging, and as a sounding board when I was working through methodology decisions. I am not going to pretend otherwise, and I do not think there is anything to be ashamed of in saying so. Kaggle themselves published a starter notebook this year built around Gemini API calls, so the use of LLMs in this competition was hardly a secret.

The key is knowing enough to have a productive conversation. I shared context about my data, my validation scheme, and my methodology decisions. In return the models offered approaches I had not considered, caught errors I had missed, and helped me think through problems from angles I would not have found on my own. That is not a workflow where the human is in charge and the AI is a servant. It is a genuine collaboration, and knowing when to trust the output versus when to push back is itself a skill.

That skill comes from experience. Graduate-level coursework in statistics, machine learning, and data engineering gave me the foundation to evaluate what the tools were telling me. I knew when the output was good, when it was plausible but wrong, and when to ask again.

LLMs did not finish 21st at Kaggle. I did. The tools helped me get there faster.

Lessons Learned and What's Next

A few things I want to do differently next year.

HistGradientBoosting earned 1-3% ensemble weight across both genders. That is not a model contributing signal, that is a model taking up compute time. Next year it gets dropped or rebuilt from scratch with a fundamentally different approach. Five models sounds impressive. Four well-tuned models beat five redundant ones every time.

Massey Ordinals are sitting in the Kaggle data and I never used them. They represent a rich source of cross-season team quality rankings compiled by people who think about this more carefully than I do. That is low-hanging fruit I plan to pick next year.

Multiple seeds per model is something I want to explore. The 10th place solution this year used six random seeds per model and averaged the results. If LightGBM is performing well, how much of that is the model and how much is a fortunate random seed? Running multiple seeds adds diversity without adding new models, and the compute cost is manageable.

I also want to incorporate Polymarket data. Prediction markets aggregate the collective judgment of people with money on the line, which is a different and potentially complementary signal to anything a model trained on box scores can produce. Whether that signal survives feature selection is an open question, but it is worth finding out.

And the upset injection proof is now a permanent part of my toolkit. Not because I needed a mathematical proof to tell me to trust my model, but because having one means I will never second-guess it again in the heat of the tournament. There is something freeing about knowing the math is on your side.

21st place is a good result. It is not a finished result. I will be back next March with better features, a cleaner ensemble, and one fewer model nobody needed.

Microsoft Fabric is the New Office

2024-07-11T12:35:38-04:00

At Microsoft Build in 2023 the world first heard about a new offering from Microsoft called Microsoft Fabric. Reactions to the announcement ranged from “meh” to “what is this?” To be fair, this is the typical reaction most people have when you talk data with them.

Many of us had no idea what to make of Fabric. To me, it seemed as if Microsoft were doing a rebranding of sorts. They changed the name of Azure Synapse Analytics, also called a Dedicated SQL Pool, and previously known as Azure SQL Data Warehouse. Microsoft excels (ha!) at renaming products every 18 months, keeping customers guessing if anyone is leading product marketing.

Microsoft Fabric also came with this thing called OneLake, a place for all your company data. Folks with an eye on data security, privacy, and governance thought the idea of OneLake was madness. The idea of combining all your company data into one big bucket seemed like a lot of administrative overhead. But OneLake also offers a way to separate storage and compute, allowing for greater scalability. This is a must-have when you are competing with companies like Databricks and Snowflake, and other cloud service providers such as AWS and Google.

After Some Thought...

After the dust had settled and time passed, the launch and concept of Fabric started to make more sense. For the past 15+ years, Microsoft has been building the individual pieces of Fabric. Here’s a handful of features and services Fabric contains:

Data Warehouse/Lakehouse – the storing of large volumes of structured and unstructured data in OneLake, which separates storage and compute
Real-time analytics – the ability to stream data into OneLake, or pull data from external sources such as SnowFlake
Data Engineering – the ability to extract, load, and transform data including the use of notebooks
Data Science – leverage machine learning to gain insights from your data
PowerBI – create interactive reports and dashboards

Many of these services were built to support traditional data storage, retrieval, and analytical processing. This type of data processing focuses on data at rest, as opposed to streaming event data. This is not to say you couldn’t use these services for streaming, you could try if you wanted. After all, the building blocks for real-time analytics go back to SQL Server 2008, with the release of StreamInsight, a fancy way to build pipelines for refreshing dashboards with up to date data.

Streaming event data is where the real data race is taking place today. According to the IDC, by 2025 nearly 30% of data will need real-time processing. This is the market Microsoft, among others, is targeting, which is roughly 54 ZB in size.

So, it seems the more data collected, the more likely it is used for real-time processing. Therefore, if you are a cloud company, it is rather important to your bottom line to find a way to make it easy for your customers to store their data in your cloud. The next best thing, of course, is making it easy for your customers to use your tools and services to work with data stored elsewhere. This is part of the brilliance of Fabric, as it allows ease of access to real time data you are already using in places like Databricks, Confluent, and Snowflake.

The Bundle

Now, if you are Microsoft, with a handful of data services ready to meet the needs of a growing market, you have some choices to make. You could continue to do what you have done for 15+ years and keep selling individual products and services and hope you earn some of the market going forward. Or you could bundle the products and services, unifying them into one platform, and make it easy for users to ingest, transform, analyze, and report on their data.

Well, if you want to gain market share, bundling makes the most sense. And Microsoft is uniquely positioned to pull this off for two reasons. First, they have a comprehensive data platform which is second to none. Sure, you can point to other companies who might do one of those services better, but there is no company on Earth, or in the Cloud, which offers a complete end-to-end data platform like Fabric.

Second, bundling software is something Microsoft has a history of doing, and doing it quite well in some cases. People reading this post in 2024 may not be old enough to recall a time when you purchased individual software products like Excel and Word. But I do recall the time before Microsoft Office existed. Bundling everything into Fabric allows users to work with their data anywhere and, most importantly to Microsoft’s bottom line, the result is more data flowing to Azure servers.

I am not here to tell you everything is perfect with Fabric. In the past year I have seen a handful of negative comments about Fabric, most of them nitpicking about things like brand names, data type support, and file formats. There is always going to be a person upset about how Widget X isn’t the Most Perfect Thing For Them at This Moment and They Need to Tell the World. I think most people believe when a product is released, even if it is marked as “Preview”, it should be able to meet the demands of every possible user. It is just not practical.

Summary

Microsoft Fabric was announced at Build this year to be GA, which also makes users believe it should meet the demands of every possible user. The fastest way for Microsoft to grab as much market share as possible is to focus on the customer experience and remove those barriers. You can find roadmap details here, giving you an idea about the effort going on behind the scenes with Fabric today. For example, for everyone who has raised issues with security and governance, you can see the list of what has shipped and what is planned here.

It is clear Microsoft is investing in Fabric, much like they invested in Office 30+ years ago. If there is one thing Microsoft knows how to do, it is creating value for shareholders:

Since the announcement of Fabric last May, Microsoft is up over 25%. I am not going to say the increase is the direct result of Fabric. What I am saying is Microsoft might have an idea about what they are doing, and why.

Microsoft Fabric is the new Office – it is a bundle of data products, meant to boost productivity for data professionals and dominate the data analytics landscape. Much in the same way Office dominates the business world.

Book Review: The AI Playbook

2024-02-27T03:38:58-05:00

Imagine you conceive an idea which will save your company millions of dollars, reduce workplace injuries, and increase sales. Now imagine company executives dislike the idea because it seems difficult to implement, and the implementation details are not well understood. Despite the stated benefits of saving money, reducing injuries, and increasing sales your idea hits a brick wall and falls flat.

Welcome to the world of artificial intelligence (AI) and machine learning (ML), where the struggle is real.

At some point in your career, you have experienced a failed project. If not, don’t worry, you will. Projects fail for all sorts of reasons. Unclear objectives. Unrealistic expectations. Poor planning. Lack of resources. Scope creep. Just to name a few of the more common reasons.

When it comes to projects with AI/ML at the core, all those same reasons apply, plus a few new ones. AI/ML is perhaps the most important piece of general-purpose technology today, which means we are bombarded with AI/ML solutions to solve random or ill-defined problems in much the same way we are bombarded by blockchain solutions for tracking fruit trucks or visiting the dentist.

The overhype of AI/ML has left people skeptical regarding the promises made through project proposals. Even if you manage to get a project funded, the initial results produced by your model may be difficult to explain, leading to apprehension about deploying solutions which cannot be understood. Nobody wants to blindly follow the decisions and predictions produced by machine learning models no one understands.

It is clear the business world needs a way to build, deploy, and maintain AI/ML models in a consistent manner, with a higher rate of success than failure, and completed on time and within budget.

bizML

Thankfully, there exists a modern approach to AI/ML projects. It is called bizML, and it is the core subject inside the new book by Dr. Eric Siegel – The AI Playbook.

For any project, not just AI/ML projects, to succeed there must be a rigorous and systematic approach for real-world deployments. Every successful project has similar characteristics - measurable goals, stakeholder involvement, risk management, resource allocation, fighting scope creep, effective communication, and monitoring project progress before, during, and after deployment.

The AI Playbook breaks this down into digestible sections for anyone with business experience to understand. It outlines bizML as a six-step process for guiding AI/ML projects from conception to deployment: define, measure, act, learn, iterate, and deploy. Using stories from familiar companies such as UPS, FICO, and various dot-coms, Dr. Siegel leans on his experience to help the reader understand how and why even the best ideas often fail.

I don’t want to give away the surprise ending, so I will just say the real secret behind bizML is starting with the end state in mind. Many projects fail due to stakeholders not aligned with the reality of deployment versus expectations. bizML attempts to remove this roadblock by getting everyone aligned with what the end state will look like, and then build towards the agreed upon state.

I read through the book in less than a couple of days, absorbing the material as fast as possible. The use of personal stories was easier to read as opposed to a purely technical book focusing on code and examples. I cannot emphasize enough how this book is not a technical manual, but a business guide for business professionals, executives, managers, consultants, and anyone else wanting to learn how to capitalize on AI/ML tech and collaborate with data professionals.

Summary

As AI/ML solutions continue to gain traction in the market, this book provides the right framework (bizML) for successful AI/ML deployments at the right time. Anyone, or any company, looking to deploy (or has deployed) AI/ML projects should buy copies of this book for all stakeholders.

I’m putting this onto my bookshelf and 15/10 would recommend.

Export to CSV in Azure ML Studio

2024-01-17T10:48:02-05:00

The most popular feature in any application is an easy-to-find button saying "Export to CSV." If this button is not visibly available, a simple right-click of your mouse should present such an option. You really should not be forced to spend any additional time on this Earth looking for a way to export your data to a CSV file.

Well, in Azure ML Studio, exporting to a CSV file should be simple, but is not, unless you already know what you are doing and where to look. I was reminded of this recently, and decided to write a quick post in case a person new to ML Studio was wondering how to export data to a CSV file.

When you are working inside the ML Studio designer, it is likely you will want to export data or outputs from time to time. If you are starting from a blank template, the designer does not make it easy for you to know what module you need (similar to my last post on finding sample data). Would be great if CoPilot was available!

Now, if you are similar to 99% of data professionals in the world, you will navigate to the section named Data Input and Output, because that’s what you are trying to do, export data from the designer. It even says in the description “Writes a dataset to…”, very clear what will happen.

So, using the imdb sample data, we add a module to select all columns, then attach the module to the Export Data model. So easy!

When you attach you need to configure some details for the module. Again, so easy!

We save our configuration options and submit the job to run. When the job is complete, we navigate to view the dataset.

Uh-oh, I was expecting a different set of options here. Viewing the log and various outputs does not reveal any CSV file either. Maybe I need to choose the select columns module:

Ah, that’s better.

Except it isn’t. Instead of showing me the location of the expected CSV file, what I find is this:

I can preview the data from the select columns module, but there isn’t a way to access the CSV file I was expecting. I suspect this export module is really meant to pass data between pipelines or services. But the purpose and description of the export module is not clear, and a novice user would be unhappy to head down this path only to be disappointed and frustrated.

What we really want to use here is the Convert to CSV module:

Viewing the results will display this:

Which has what we are looking for, a download button:

Selecting Download will either default to your browser settings, or you can do a Save As.

As I wrote at the beginning of this post, exporting to a CSV file from within Azure ML Studio is easy to do, if you already know what you are doing. If you are new to Azure ML Studio, you may find yourself frustrated if you expect the Export Data module to produce a CSV file. You will want to use the Convert to CSV module instead.

Azure ML Studio Sample Data

2024-01-08T10:17:48-05:00

This is one of those posts you write as a note to "future you", when you'll forget something, do a search, and find your own post.

Recently I was working inside of Azure ML Studio and wanted to browse the sample datasets provided. Except I could not find them. I *knew* they existed, having used them previously, but could not remember if that was in the original ML Studio (classic) or not.

After some trial and error, I found them and decided to write this post in case anyone else is wondering where to find the sample datasets. You're welcome, future Tom!

First, you need to login to Azure ML Studio: https://ml.azure.com/. Once logged in, you will create a workspace. Once the workspace is ready, open it and you will see a splash screen with a lot of interesting widgets, but alas no sample datasets to select.

To locate the sample datasets you must create a Pipeline. You create a Pipeline either through the designer or the Pipeline menu on the left of the workspace screen, as selecting Pipeline | New Pipeline opens the Designer.

Once inside the Designer, create a Pipeline either by selecting the pre-defined samples or by selecting the upper-left tile:

Now you are in the Authoring screen, and here is where you will find the sample data. However, your default portal experience could have the left-hand menu collapsed. You can expand the menu by clicking on the two brackets (WTH is this really called, a vertical chevron? No idea.) This was not intuitive for me, it took me a bit of time to understand I needed to click on this to view a menu.

Once opened, you’ll find sample data as well as some other goodies.

Expand the Sample data option and view the full list of datasets.

I don’t know how often the sample data is refreshed, and the answer is “likely never”. So, if you are looking for up to date census data, or iMDB movie data, you should consider a different source than the sample datasets provided through Azure ML Studio.

Microsoft Data Platform MVP - Fifteen Years

2023-08-17T10:53:36-04:00

I am happy, honored, and humbled to receive the Microsoft Data Platform MVP award for the fifteenth (15th) straight year.

Receiving the MVP award during my unforced sabbatical this summer was a bright spot, no question. It reinforced the belief I have in myself - my contributions have value. Microsoft puts this front and center on the award by stating (emphasis mine):

"We recognize and value your exceptional contributions to technical communities worldwide."

I'm running out of room.

I recall the aftermath of my first award, when I was told I was the "least technical SQL Server MVP ever awarded". Talk about feeling you have no value! And that was certainly the feeling I had two months ago.

It's amazing how something as simple as being recognized by your peers can go so far in making a person feel valued. We should all strive to go out of our way daily to help another human feel valued.

There are plenty of people in the world who are recognized as experts in the Microsoft Data Platform. I'd like to think I am one of them. I also happen to be fortunate enough to know Microsoft recognizes me as one as well.

But MVPs advocate for Microsoft because we want to, not because we want an award. After all these years I’m still crazy for Microsoft, and I am happy to help promote the best data platform on the planet.

For my fellow MVPs renewed this year, I offer this suggestion - say thank you. Then say it again. Email the person on the product team who made the widget you enjoy using over and over and tell them how much you appreciate their effort. Email your MVP lead(s) and thank them for all their hard work as well.

A little kindness goes a long way. You never know how much reaching out could mean to that person at that moment.

Pro SQL Server 2022 Wait Statistics Book

2022-10-10T11:26:54-04:00

After many months of editing, revising, and writing, my new book Pro SQL Server 2022 Wait Statistics: A Practical Guide to Analyzing Performance in SQL Server and Azure SQL Database is ready for print!

You can pre-order here: https://amzn.to/3fQr7hz

I thoroughly enjoyed this project, and I want to thank Apress and Jonathan Gennick for giving me the opportunity to update the previous edition. It felt good to be writing again, something I have not been doing enough of lately. And many thanks to Enrico van de Laar (@evdlaar) for giving me amazing content to start with.

The book is an effort to help explain how, why, and when wait events happen. Of course, I also want to show how to solve issues when they arise. Specific wait events are broken down into parts: definition, remediation, and an example. There are plenty of code examples, allowing the reader to duplicate the scenarios to help understand the wait events better.

It is my understanding we will have a GitHub repository for the sample code. This will make it easy for a reader to access the code for their use. I am hoping to keep the repo up to date and expand upon the example as I look towards the next version.

Pro SQL Server 2022 Wait Statistics at Live 360!

I will be presenting material from the book at SQL Server Live! this November where I have the following sessions, panel discussion, and workshop:

Fast Focus: SQL Server Data Types and Performance
Locking, Blocking, and Deadlocks
Performance Tuning SQL Server using Wait Statistics
SQL Server Live! Panel Discussion: Azure Cloud Migration Discussion
Workshop: Introduction to Azure Data Platform for Data Professionals

The workshop is a full day training session delivered with Karen Lopez (@DataChick), and you can register for Live 360 here: Live 360 Orlando 2022 - Choose Registration

I am hopeful to have copies of Pro SQL Server 2022 Wait Statistics at SQL Server Live!. At the time of this post, I do not know of a publish date. Amazon shows the book as pre-order right now.

Stop Using Production Data For Development

2022-01-31T09:33:43-05:00

A common software development practice is to take data from a production system and restore it to a different environment, often called "test", "development", "staging", or even "QA". This allows for support teams to troubleshoot issues without making changes to the true production environment. It also allows for development teams to build new versions and features of existing products in a non-production environment. Using production to refresh development is just one of those things everyone accepts and does, without question.

Of course the idea of testing in a non-production environment isn't anything new. Consider Haggis. No way someone thought to themselves "let me just shove everything I can into this sheep's stomach, boil it, and serve it for dinner tonight." You know they first fed it to the neighbor nobody liked. Probably right after they shoved a carton of milk in their face and asked "does this smell bad to you?"

For decades software development has made it a standard practice to create copies of production data and restore it to other non-production environments. It was not without issues, however. For example, as data sizes grew so did the length of time to do a restore. This also clogged network bandwidth, not to mention the costs associated with storage.

And then there is this:

https://twitter.com/HengeWitch/status/1483500385180418048

If you read that tweet and thought "yeah, what's your point?" then you are part of the problem.

As an industry we focus on access to specific environments, but not the assets in the environments. This is wrong. The royal family knows where the Crown Jewels are stored but if they are moved to another location you know the Jewels are heavily guarded at all times. Access to the jewels is important no matter where the jewels are located. The same should be true of your production data.

Then again, that stick might be pointy enough to fend off any attacker.

Data is the most critical asset your company owns. If you make efforts to lock down production but allow production data to flow to less-secure environments, then you haven't locked down production.

It is ludicrous to think about the billions of dollars spent to lock down physical access to data centers only to allow junior developers to stuff customer data on a laptop they will then leave behind on a bus. Or senior developers leaving S3 buckets open. Or forgetting they pushed credentials to a GitHub repo.

If you are still moving production data between environments you are a data breach waiting to happen. I don't care what the auditors say, you are at an elevated and unnecessary risk. Like when Obi-Wan decides to protect baby Luke by keeping his name and taking him to Darth Vader's home planet. Nice job, Ben, no way this ends up with you dying, naked, in front a few dozen onlookers.

I think what frustrates me most is this entire system is unnecessary. You have options when moving production data. You can use data masking, obfuscation, and encryption in order to reduce your risk. But the best method is to not move your data at all.

After years of being told "don't test in production" it's time to think about testing in production. Continuous integration and continuous delivery/deployment (CI/CD) allow for you to achieve this miracle. And for those that say "No, you dummy, CI/CD is what you do in test before you push to production," I offer the following.

Use dummy data.

You don't need production data, you need data that looks like production data. You don't need actual customer names and address, you need similar names and address. And there are ways to simulate the statistics in your database, too, so your query plans have the same shape as production without the actual volume of data.

It's possible for you to develop software code against simulated production data, as opposed to actual production data. But doing so requires more work, and nobody likes more work.

Until you are breached, of course. Then the extra work won't be optional.

Microsoft Data Platform MVP – A Baker’s Dozen

2021-07-29T07:28:28-04:00

No Satya, thank you. And you're welcome. Let's do lunch next time I'm in town.

This past week I received another care package from Satya Nadella. Inside was my Microsoft Data Platform MVP award for 2021-2022. I am happy, honored, and humbled to receive the Microsoft Data Platform MVP award for the thirteenth straight year. I still recall my first MVP award and how it got caught in the company spam folder. Good times.

I am not able to explain why I am considered an MVP and others are not. I have no idea what it takes to be an MVP. And neither does anyone else. Well, maybe Microsoft does since they are the ones that bestow the award on others. But there doesn’t seem to be any magical formula to determine if someone is an MVP or not.

I do my best to help others. I value people and relationships over money. And I play around with many Microsoft data tools and applications, and blog about the things I find interesting. Sometimes those blog posts are close to fanboi level, other times they are not. But I do my best to remember that there are people over in Redmond that work hard on delivering quality. Sometimes they miss, and I do my best to help them stay on target. Maybe that’s why they keep me around.

Looking back to 2009 and my first award, I recognize the activities I was doing 13 years ago which earned my my first MVP award are not the same activities I am doing today. And I think that is to be expected, as I'm not the same person today as I was then. I have a different role, different responsibilities, and different priorities. I've branched out into the world of data security and privacy as well as data science.

I now spend time writing for multiple online publications instead of my here on my personal blog. Often those articles are not about SQL Server, but are almost always about data. I remain an advocate for Microsoft technologies, and continue to do my best to influence others to see the value in the products and services coming out of Redmond.

This past year was a bit different, of course, as COVID affected different parts of the world in different ways, at different times. As such, the Microsoft MVP program made the decision to auto renew MVPs without evaluating our current activity. This year's award is the equivalent of Free Parking in Monopoly, Sure, you're still in the game, but you aren't doing much, things are happening around you, and everything is OK.

Summary

I’m going to enjoy this ride while it lasts. I’m also going to do my part to make certain that the ride lasts as long as possible for everyone. Here’s a #ProTip for those of us renewed this year:

Say thank you. Then say it again. Be grateful for what we have. Email the person that made the widget that you enjoy using over and over and tell them how much you appreciate their effort. Email your MVP lead and thank them for all their hard work as well.

MVPs do advocate for Microsoft because we want to, not because we want an award. After all these years I’m still crazy for Microsoft, and I am happy to help promote the best data platform on the planet.

Twenty Years

2021-04-13T06:59:35-04:00

My life changed twenty years ago, this very month.

I was a developer, working for a small software company outside of Boston. Our product was a warehouse management system, built with PowerBuilder on top of Oracle. We had a handful of large customers helping to keep the lights on, but a few went dark at the start of the year, casualties of the dot-com bust.

We were soon a casualty as well, forced to sell to a competitor at the start of the year. And despite their promises and assurances about not making changes, the layoffs started within the first few months.

They hit about a dozen folks the first day, including me. I was stunned.

I still remember the phone call from an executive I spoke infrequently with. "Can you come meet with me in the small conference room we never use for anything?" Yeah, sure. I walked across the office, knowing what was coming.

I've been cut from a team before, but I was just not prepared for this. And no one could explain to me why I was chosen as opposed to someone else, there didn't seem to be any reason. It just...was.

And it really pissed me off.

I did everything I was assigned. Traveled when other developers refused. I asked to take on challenging assignments. Noticing we had no DBA, I asked to attend Oracle certification classes. It seemed any and all efforts were ignored, or discounted.

And thus my first real corporate lesson: Nobody cares about your effort, they only care about results.

The remaining employees huddled in the large recreation room to hear about the layoffs. My friends looked around and noticed I wasn't there. That's because I was at my desk, putting my stuff into a box.

Packed up and walking out the door I said goodbye to a few people, and piss off to a few others.

I put the box into my car and I started driving. I'm thinking to myself "Surely someone must need a PowerBuilder developer, right?" I drove up to Lexington, where Suzanne was working. She had stepped out to get a coffee. While she was walking back she saw me walking towards her. She didn't hesitate to stop and yell across the street "what happened?"

She knew.

I told her what happened.

We were scared.

I vowed to never allow this to happen to me again. Never would I be the one to get cut, not in this manner, and not treated so poorly in the process. But I needed to find a job, and fast, since living in West Newton was not cheap.

My only marketable skill was PowerBuilder. Fortunately, it was still in enough demand, and it wasn't long before my phone rang.

Not from job offers, no. It was recruiters calling me with the hope and promise of finding a job. I had worked with a handful of recruiters in the past, but I was still young and inexperienced in how to play the recruiting game. Or, to be more to the point, how they play games with kids like me.

And I was still scared.

So I listened to what they told me to do. One recruiter took my resume, rewrote it, put c++ at the very top (because I knew how to spell it, apparently) and sent me off on interviews where I would sit with someone for 5 minutes and they would say "why are you here if you haven't done any c++ programming before?"

Good question. Talk to the recruiter, I guess.

Another recruiter scheduled me for an interview where I had to sit for an analytical reasoning test, because apparently my MS in Mathematics wasn't a good enough indicator of my analytical reasoning ability. Shockingly, I scored 50 out of 50, was told no one ever scored perfect before, I didn't get the job, and never got those 45 minutes of my life back.

I had no offers, and no prospects.

I was low.

And then, out of nowhere, a recruiter contacted me about a job down in Hartford.

Hartford!

Hartford is 90 miles away from West Newton, but mostly a reverse commute. I could drive to Hartford in about 75-80 minutes as it was all highway driving. Folks commuting into Boston from Nashua, NH along Route 3 take longer than that!

I decided to go for the interview for two reasons. First, the scared thing. Fear is a great motivator. If I had to drive 90 miles each way, I would. Second, I knew getting hired by an investment firm in Hartford is job security. People there had careers, you know? It didn't matter if you had skills or not, you just needed to get past your 90 days. Once you are in, you are in for as long as you want.

I interviewed with a wonderful human being named Craig. He needed PowerBuilder help. I needed a job. He asked about the drive and I told him it would not be a problem. He mentioned the company offered some relocation assistance, if I wanted to move. I thanked him, we shook hands, and I drove home.

When I got home, I had an offer. Well, offers. One was from Craig. The other was from a software company in East Boston who specialized in point-of-sale systems for cruise lines. East Boston is only 11 miles away from West Newton. However, it would take me an hour to get to work during rush hour combined with the Big Dig. They also expected employees to work at least ten hours a day, but in reality they were only paying me for eight. I really liked their company, and the idea of being "forced" to work on a cruise ship a few weeks each year.

Suddenly, the fear was back.

I couldn't choose East Boston because I'd be doing the same PowerBuilder job, with nothing to grow into. I needed a job, sure.

But I wanted a career.

And I had vowed to never get cut from the team again.

So I chose Hartford.

And my life was forever altered.

Within a few months 9/11 happened. A few months later, Suzanne and I bought our first home, in Worcester, in order to cut down on my commute (reduced to 60 miles, but not by much time because I was a bit further from the highways). We had our first child there, got pregnant again (the best part about babies is making them) and decided to move once more. This time to be closer to both sets of grandparents. Right around the time of the move, "it" happened.

"It" was the 1.5 DBAs at my company quit within a week or so of each other. One was a contractor who simply gave his 30 day notice. The other was a part-timer who was offered the full time job and he said "yes" then changed his mind a week later. This left no one except the guy who had previously executed the following commands successfully:

BACKUP

and

RESTORE

No bootcamps. All it took for me to be handed this opportunity was the fact I knew how to do a backup, a restore, and change passwords.

So, the greatest manager in the world asked Craig (one of the few truly good people on this Earth) if he could "borrow" me, and Craig knew that the best thing for me (not for HIM, but for ME) was to become a DBA. So Craig agreed, we set a transition period over the next six months or so and that was that.

I was now the DBA.

There was a lot to do, a lot to learn, a lot to accomplish. In time I found and joined PASS. I connected with others. I helped take my company from what was essentially a wild west show into a more stable environment. Now, I wasn't always nice (I think my communication skills have improved in time), but I was always thinking about the company, and whatever actions I took needed to benefit everyone and not just the one or two people looking to take a shortcut.

As my skills grew, so did my thirst for new challenges. Eventually, I hit the limits of my role. There was nothing more for me to grow into. And what had once seemed an amazing opportunity and role now seemed like a prison sentence.

I could still remember how I was so excited and so proud to have a job with a company in Hartford. It was not just a job, but a career. And here I was, nine years later, burned out and tired. Worse yet, the people around me were tired of having me around. It seemed as if I would say 'black', three people would say 'white' just on principal alone. I felt my skills were called into question every day, or the environment I spent six years building would be called into question, if for no other reason because I was the voice on the other end of the phone.

There was nowhere else to go.

On a whim, I decided to drive into Boston for a SQL Saturday event. While there I was walking past the vendor tables and I heard the phrase "I bet you'd look good wrapped in bacon". It was probably the best pickup line anyone could use on me. I handed him a bacon gumball. He tried one, got sick, and still wanted to talk to me. I felt bad about making him ill with the gumball so I did something I had not done in over five years:

I attended a dedicated vendor session.

They talked about their product. They promised me it would monitor my SQL and Sybase instances, something I asked other vendors to give me for years. I agreed to give their product a trial, if for no other reason because they did some heavy name dropping and they told me they read my blog.

I installed the trial, let it run for a week, and started to fall in love. About a month later they came to visit and walk through my data with me. I liked it even more. We had lunch together. The next month we all met up again at a SQL Saturday. It was me, David Waugh, and Matt Larson (CEO of Confio Software). We agreed to meet and have dinner prior to the SQL Saturday dinner. It was, essentially, my job interview. There was no exchange of resumes, just some conversation over dinner. No talk of salary, just talk about our children. No talk about managers, just talk about the role they felt I could help with.

And there was no job opening. We were creating one. Right there. In a hole in the wall in New York City, on a Friday night in the Spring.

I was a long way from that walk across the office.

I still had fear. Joining a startup was a risk. Matt looked at me and said "Tom, lots of things can happen. But I'm fairly certain in a stack of resumes, yours would float to the top".

Suddenly, I felt like I had a place to go. A place to grow. New challenges. The opportunity to build something, together.

I took the job(s). I was a sales engineer, customer support, marketing, product support, and maybe a few other roles. You know how everyone wears a lot of hats when you are small.

I clashed with others, wanted to quit, and was almost fired. Fortunately, Confio had their version of Craig. His name was Don, and Don believed I was a value-add even if others did not. Don saved me, just as Craig saved me before.

I held on. I survived.

Not long after, we were purchased by SolarWinds, where I am today.

This post is really long and doesn't have a point. If you've read this far I thank you and feel I owe you something of value. Something like "focus on the things you can change" or some type of sage advice.

I've got nothing. Well, maybe something.

Always be learning.

Twenty years ago I had no job. Today I have a career. I went from software developer to database administrator to sales engineer/customer support/product support to technical product marketing. I've sought roles where I have the opportunity to grow and be presented with new challenges. These past few years I've been immersing myself in data science.

If I needed to find work tomorrow, I believe I could. The same should be true for you, too.

Don't settle for what you already know. Don't let your opinions block your judgement when it comes to tools, products, or technology. Be open to new possibilities. Be humble to the idea that you don't know everything, and you can always be learning something new.

If you can do that, then you'll always have a place to go, a place where you will add value.

That's it, that's the post.