Database Design Archives - Thomas LaRock

You Can’t Marry Your Database, But You Can Have Relations

Thomas LaRock — Mon, 01 Feb 2021 23:15:29 +0000

There’s something you should know about relational databases.

They were designed to store data efficiently, protecting the quality of the data written and stored to disk. I’ve written before about relational engines favoring data quality and integrity, and how relational databases were not designed for the reading of data.

Of course, if you are going through the trouble of writing the data into a relational database, it makes sense that you would want to retrieve the data at some point. Otherwise, why go through the exercise of storing the data inside the database?

The trouble with reading data from a relational database is due to the data not being stored in a format that is friendly for viewing, reading, or retrieving. That’s why we have data professionals, like me, to help you write queries that return the correct data, in the correct format, for you to analyze.

I’m here today to tell you we’ve been doing data wrong the whole damn time.

Let me show you what I mean.

Traditional Data Flow Patterns

Here’s what companies do, every day:

Step 1 – Identify useful data

Step 2 – Import that data into a database

Step 3 – Analyze the data

Step 4 – Export the data into dashboards

Step 5 – Profit (maybe)

The trouble with this process is Step 3, the analyzing of the data. Relational databases were not designed for analytical processing. Relational databases do not store data in a way that is readable, or friendly, for human analysis.

That’s not to say you can’t do analytics inside of a relational database. What I am saying is that it could be better for you not to spin the CPU cycles there, and instead do the analytics somewhere else.

For example, data warehouses help with data storage, retrieval, and analytics. But even a data warehouse can fall short when it comes to the use of unstructured data sources. As a result, we’ve spent decades building ETL processes to curate, collate, consolidate, and consume data.

And we’ve been doing it wrong.

Data Mining for the Data Expert

So, we understand that people find data, store it, and try to use it later. They are engaging in the process of data mining, hoping to find gold in the form of insights, leading to better business decisions.

But as I mentioned before, the data isn’t in a readable format. Let’s look at an example.

Following the common data flow pattern, I found some useful data at NFL Savant https://nflsavant.com/about.php, and imported the data into a SQL Server database:

It looks like any other table in a relational database. In this case, it is a table containing time series data pertaining to play by play records for the 2018 NFL season. Each row represents an entity (a play at a point in time of an NFL game), and the columns represent attributes of the play (down, distance, yards to go, etc.)

Nothing out of place here, this is how data is written to a relational database. In an orderly fashion. As a DBA, I love this type of orderly storage. It’s efficient, and efficient is good.

As a data analyst, I’m not a fan. At least, not yet. I have a bunch of data, but what I want are some answers. So, it’s up to me to ask some questions of the data, find some answers, and use that to help make better business decisions.

For this data, here’s an example of a simple question: What are the average yards to go for NFL teams in 2018? I can get that answer with some simple T-SQL:

This is great! I was able to take my data, ask a question, and get an answer. What could be better, right?

Of course, now I have more questions about my data. And here’s the first issue you will discover when trying to analyze data stored in a traditional relational database.

T-SQL is excellent at answering one question at a time, but not as great when you need more than one question answered.

So, if we have more questions, we will need to write more queries.

Here’s a good follow-up question that we might want to be answered: Can we examine this data broken down by each quarter?

Fortunately, the answer is yes, because T-SQL comes with a bunch of statements and functions that will help. In this case, I am going to use the PIVOT operator, as follows:

Easy, right?

No, not easy.

And not readable, either. What’s with that row saying NULL? Why do I not have a result for some teams in that last column?

As it turns out, you need a lot of experience writing T-SQL to get to that query. And you need more experience understanding the result set, too. You don’t start on Day 0 as a data professional writing PIVOT queries against a SQL Server database.

Here’s the good news: You don’t need to write PIVOT queries, ever.

Data Mining for the Masses

The data import from NFL Savant was in the form of a CSV file, which I then imported into my database. Because that’s how ETL is done (see above for the common data flow process).

What if…now hear me out…we skipped step 2? Forget about doing the import process. Instead, let’s open that CSV file in Excel.

Here’s what it would look like:

Back to our football questions. We’ve seen examples in T-SQL, let’s look at how to do this in Excel using a Pivot table.

I click on one cell in Excel, insert a pivot table, drag the offense teams as a row, and the downs to go as a value, change it to an average, and we are done. Have a look:

It took but a few seconds to get this magic to happen. Here’s what I want you to know:

1. No T-SQL is necessary. None. Not one line of code.

2. I have the entire table as a pivot table, allowing me to answer more questions WITHOUT needing to write more T-SQL.

3. There is no code. None.

Let’s say that I want to know the yards to go broken down by quarter. With T-SQL, I would need to write a new query. With the pivot table, it’s a simple drag and drop, like this:

Fin.

There is no need to rewrite code to get this result. Because there is no code, it’s drag and drop, and then I have my answer.

And that’s why I believe the inclusion of pivot table inside Excel is the greatest advancement in the 21^st century for data professionals.

Fight me.

Summary

I did not come here to bury relational databases. I came here to help you understand relational databases may not be the right place to do analytical processing.

When it comes to curating and consuming data, I have three simple rules for you to follow:

Rule #1 – Only collect data that you need. Don’t collect data “just in case you may need it later.” The data you collect must be relevant for your needs right now.

Rule #2 – Understand that all data is dirty. You could build a perfect analytical solution but based on inaccurate data. Know the risks involved in making business decisions based on dirty data.

Rule #3 – Before you collect any data, consider where the data will be processed. Don’t just assume that your database will do everything you need. Take time to list out all the available tools and systems at your disposal. The result may be a simpler solution than first imagined.

I wrote this post to help you understand Rule #3. Analysis of NFL play by play data is best done in a tool such as Excel, or PowerBI, and not (necessarily) inside of SQL Server.

SQL Server is a robust relational database engine, containing integrations with data science-y stuff such as R and Python. Just because you could do your analysis inside the SQL Server engine doesn’t mean you should.

This post originally appeared on PowerPivotPro and I was reminded about its existence while talking with Rob Collie during our Raw Data podcast. I asked Rob if I could repost here. He said yes. True story.

The post You Can’t Marry Your Database, But You Can Have Relations appeared first on Thomas LaRock.

Tune Workloads, Not Queries

Thomas LaRock — Mon, 31 Aug 2020 22:04:41 +0000

Ask three DBAs about their preferred performance tuning methodology and you will get back seven distinct answers. I bet a pound of bacon one of the answers will be “it depends”.

Of course, it depends! But on what does performance tuning depend?

Context.

Most performance tuning methodologies focus on tuning one or more queries. This is the wrong way of thinking. It is an antiquated way of problem solving.

Let me explain.

The Problem with Traditional Database Monitoring

Traditional database monitoring platforms were built from the point of view of the engine-observer. These tools focus on metrics inside the database engine, and may collect some O/S level metrics. They often assume the database is running on a single server node, and not a collection of nodes. And they are reactive in nature, notifying you after an issue has happened.

But the reality is your database engine is but a process running on top of an operating system, for a server that is likely virtualized, and may be running in your data center or in the cloud. In other words, there are many layers between users and their data. And in a world of globally distributed systems, chances are your database is not on a single node.

This means your in-house legacy accounting application requires different monitoring and performance tuning methods than your on-line ordering system. When you focus on one query, or even a top ten list of queries, you have little to no information regarding the entire application stack. And those engine metrics we know and love will not help you understand the overall end user experience.

But when it comes to database performance tuning methods, there is a heavy focus on tuning activity inside the engine. This makes sense, because that’s what DBAs (and developers) know. That’s the silo in which they operate. They need to prove the issue is not inside the database.

Stop focusing on the database engine and open your mind to the world that exists outside of that database.

Once you turn that corner, the mean time to resolution shrinks. The result is a better end user experience.

Tune Workloads, Not Queries

The Heisenberg Uncertainty principle states that the position and velocity of a particle cannot be measured exactly at the same time. The more you know about position, the less you know about velocity, and vice-versa.

The same theory applies to database performance tuning methods. The more you know about activity happening inside of a database engine, the less you know about the entire system. Nowhere in an execution plan is there a metric for ‘user happiness’, for example.

Therefore, troubleshooting modern distributed systems requires a different approach. Enter the four golden signals: latency, traffic, errors, and saturation. These signals combine to help provide a measure of overall user experience. From there, if you need to dive into a database, you’ll have context necessary to start tuning at the server, instance, or query level. Over time you can shift to thinking about how to scale out, or up, as necessary.

Put another way, you would not expect your mechanic to tune your Jeep the same way she would tune a Ferrari. Both are vehicles but built for different purposes. The tools and methods are distinct for both. And so are the metrics and dashboards you want for your legacy applications versus a distributed one.

Summary

Slow is the new broke. But things don’t have to be slow to be broke. A poor user experience with your online ordering system will hurt your bottom line. Traditional database monitoring systems are not focused on the user experience. Instead, they focus on the database engine itself. But those engine metrics won’t tell you that Brad in Idaho got frustrated and left his shopping cart with $2,000 worth of potato seeds.

Your performance tuning methodology should include an understanding of the entire system and workload first, before you start looking at any specific query.

The post Tune Workloads, Not Queries appeared first on Thomas LaRock.

SQL Plan Warnings

Thomas LaRock — Tue, 24 Mar 2020 17:39:07 +0000

There are many methods available for optimizing the performance of SQL Server. One method in particular is examining your plan cache, looking for query plan warnings. Plan warnings include implicit conversions, key or RID lookups, and missing indexes to name a few. Each of these warnings is the optimizer giving you the opportunity to take action and improve performance. Unfortunately, these plan warnings are buried inside the plan cache, and not many people want to spend time mining their plan cache. That sounds like work.

That’s why last year our company (SolarWinds) launched a free tool called SQL Plan Warnings. Often mining the plan cache involves custom scripts and forcing you to work with text output only. We wanted to make things easier by providing a graphical interface. A GUI will allow for the user to have basic application functionality. Things like connecting to more than one instance at a time, or filtering results with a few clicks.

Let me give a quick tour of SQL Plan Warnings.

Connect to an instance

The first thing noteworthy here is how SQL Plan Warnings supports connecting to a variety of flavors of SQL Server. There’s the Earthed version, Azure QL Database, Azure SQL Database Manage Instance, and Amazon RDS for SQL Server as shown here:

From there you fill in your connection details. The login you choose will need either the VIEW SERVER STATE or SELECT permission for the following DMVs: dm_exec_query_stats, dm_exec_sql_text, and dm_exec_text_query_plan. I’ve provided links to the Microsoft docs for each, so you can review the permissions defined there.

Being able to easily connect to instance of SQL Server, no matter where they are located, is a must-have these days.

SQL Plan Warnings Settings

After you connect to your instance, SQL Plan Warnings will return the top 100 plans, with a default sort by CPU time. However, it is possible after connecting you may see no results. This is likely due to the default settings for SQL Plan Warnings. You get to the settings by clicking on the gear icon in the upper-right corner. Here is what the default settings look like:

If you are not seeing any results, change the default settings and refresh plan analysis. For me, I simply made the change to filter by executions, with 1 as the minimum. This returns a lot of noise, so you need to discover what makes the most sense for your particular instance.

Please note these default settings apply to each connected instance. Think of these settings as the highest level filter for all your connected sources. It may be possible you spend time adjusting these settings frequently, depending on the instance, the workload, and your query tuning goals.

Reviewing the SQL Plan Warnings Results

After plan analysis is complete, you will see a list of warnings found. It should look like this:

Note that a plan can have multiple warnings. So this list could be generated by one or more plans found.

From here we are able to filter on a specific warning type with a simple click. This allows us to narrow our focus. Perhaps today we want to focus on Key and RID lookups. We select that filter, then open the plan:

From here we can zoom and scroll, and view the node that has the lookup warning:

If we select the node a properties dialogue that opens to the right. We also see other warnings are included in this plan, if we want or need to investigate those at this time. We also have the ability to download the plan, if desired.

Summary

The SQL Plan Warnings tool is easy to use and allows for you to be proactive in optimizing your environment. The use of a GUI allows for quick filtering at the plan cache level as well as plan warnings themselves. This allows you to focus on the plan warnings with the most impact.

One thing to note is the size of the plan cache you choose to analyze. Instances with larger plan cache sizes (1GB or greater) may require a larger number of plans to parse for warnings.

You can download the SQL Plan Warnings tool here.

The post SQL Plan Warnings appeared first on Thomas LaRock.

SQL Injection Protection

Thomas LaRock — Wed, 22 May 2019 15:52:56 +0000

SQL injection is a common form of data theft. I am hopeful we can make SQL injection protection more common.

The 2018 TrustWave Global Security Report listed SQL Injection as the second most common technique for web attacks, trailing only cross-site scripting (XSS) attacks. This is a 38% increase from the previous year. That same report also shows SQL Injection ranked fifth on a list of vulnerabilities that can be identified through simple penetration testing.

You may look at the increase and think “whoa, attacks are increasing”. But I believe that what we are seeing is a rising awareness in security. No longer the stepchild, security is a first-class citizen in application design and deployment today. As companies focus on security, they deploy tools and systems to help identify exploits, leading to more reporting of attacks.

SQL Injection is preventable. That’s the purpose of this post today, to help you understand what SQL Injection is, how to identify when it is happening, and how to prevent it from being an issue.

SQL Injection Explained

SQL injection is the method where an adversary appends a SQL statement to the input field inside a web page or application, thereby sending their own custom request to a database. That request could be to read data, or download the entire database, or even delete all data completely.

The most common example for SQL injection attacks are found inside username and password input boxes on a web page. This login design is standard for allowing users to access a website. Unfortunately, many websites do not take precautions to block SQL injection on these input fields, leading to SQL injection attacks.

Let’s look at a sample website built for the fictional Contoso Clinic. The source code for this can be found at https://github.com/Microsoft/azure-sql-security-sample.

On the Patients page you will find an input field at the top, next to a ‘Search’ button, and next to that a hyperlink for ‘SQLi Hints’.

Clicking on the SQLi Hints link will display some sample text to put into the search field.

I’m going to take the first statement and put it into the search field. Here is the result:

This is a common attack vector, as the adversary can use this method to determine what version of SQL Server is running. This is also a nice reminder to not allow your website to return such error details to the end user. More on that later.

Let’s talk a bit about how SQL injection works under the covers.

How SQL Injection works

The vulnerability in my sample website is the result of this piece of code:

return View(db.Patients.SqlQuery
("SELECT * FROM dbo.Patients
WHERE [FirstName] LIKE '%" + search + "%'
OR [LastName] LIKE '%" + search + "%'
OR [StreetAddress] LIKE '%" + search + "%'
OR [City] LIKE '%" + search + "%'
OR [State] LIKE '%" + search + "%'").ToList());

This is a common piece of code used by many websites. It is building a dynamic SQL statement based upon the input fields on the page. If I were to search the Patients page for ‘Rock’, the SQL statement sent to the database would then become:

SELECT * FROM dbo.Patients
WHERE [FirstName] LIKE '%Rock%'
OR [LastName] LIKE '%Rock%'
OR [StreetAddress] LIKE '%Rock%'
OR [City] LIKE '%Rock%'
OR [State] LIKE '%Rock%'

In the list of SQLi hints on that page you will notice that each example starts with a single quote, followed by a SQL statement, and at the end is a comment block (the two dashes). For the example I chose above, the resulting statement is as follows:

SELECT * FROM dbo.Patients
WHERE [FirstName] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [LastName] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [StreetAddress] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [City] LIKE '%' OR CAST(@@version as int) = 1 --%'
OR [State] LIKE '%' OR CAST(@@version as int) = 1 --%'

This results in the conversion error shown above. This also means that I can do interesting searches to return information about the database. Or I could do malicious things, like drop tables.

Chance are you have code like this, somewhere, right now. Let’s look at how to find out what your current code looks like.

SQL Injection Discovery

Discovering SQL injection is not trivial. You must examine your code to determine if it is vulnerable. You must also know if someone is actively trying SQL injection attacks against your website. Trying to roll your own solution can take considerable time and effort.

There are two tools I can recommend you use to help discover SQL injection.

Test Websites with sqlmap

One method is to use sqlmap, an open-source penetration testing project that will test websites for SQL injection vulnerabilities. This is a great way to uncover vulnerabilities in your code. However, sqlmap will not tell you if someone is actively using SQL injection against your website. You will need to use something else for alerts.

Azure Threat Detection

If you are using Azure SQL Database, then you have the option to enable Azure Threat Detection. This feature will discover code vulnerabilities as well as alert you to attacks. It also checks for anomalous client login, data exfiltration, and if a harmful application is trying to access your database.

(For fairness, I should mention that AWS WAF allows for SQL injection detection, but their process is a bit more manual that Azure).

If you try to roll your own discovery, you will want to focus on finding queries that have caused errors. Syntax errors, missing objects, permission errors, and UNION ALL errors are the most common. You can find a list of the common SQL Server error message numbers here.

It warrants mentioning that not all SQL injection attacks are discoverable. But when it comes to security, you will never eliminate all risk, you take steps to lower your risk. SQL injection discovery is one way to lower your risk.

SQL Injection Protection

Detection of SQL Injection vulnerabilities and attacks are only part of the solution. In an ideal world, your application code would not allow for SQL Injection. Here’s a handful of ways you can lower your risk of SQL injection attacks.

Parameterize Your Queries

Also known as ‘prepared statements’, this is a good way to prevent SQL injection attacks against the database. For SQL Server, prepared statements are typically done using the sp_executesql() system stored procedure.

Prepared statements should not allow an attacker to change the nature of the SQL statement by injecting additional code into the input field. I said “should”, because it is possible to write prepared statements in a way that would still be vulnerable to SQL injection. You must (1) know what you are doing and (2) learn to sanitize your inputs.

Traditionally, one argument against the use of prepared statements centers on performance. It is possible that a prepared statement may not perform as well as the original dynamic SQL statement. However, if you are reading this and believe performance is more important than security, you should reconsider your career in IT before someone does that for you.

Use Stored Procedures

Another method available are stored procedures. Stored procedures offer additional layers of security that prepared statements may not allow. While prepared statements require permissions on the underlying tables, stored procedures can execute against objects without the user having similar direct access.

Like prepared statements, stored procedures are not exempt from SQL injection. It is quite possible you could put vulnerable code into a stored procedure. You must take care to compose your stored procedures properly, making use of parameters. You should also consider validating the input parameters being passed to the procedure, either on the client side or in the procedure itself.

Use EXECUTE AS

You could use a security method such as EXECUTE AS to switch the context of the user as you make a request to the database. As mentioned above, stored procedures somewhat act in this manner by default. But EXECUTE AS can be used directly for requests such as prepared statements or ad-hoc queries.

Remove Extended Stored Procedures

Disabling the use of extended stored procedures is a good way to limit your risk with SQL injection. Not because you won’t be vulnerable, but because you limit the surface area for the attacker. By disabling these system procedures you limit a common way that an attacker can get details about your database system.

Sanitize Error Messages

You should never reveal error messages to the end user. Trap all errors and redirect to a log for review later. The less error information you bubble up, the better.

Use Firewalls

Whitelisting of IP addresses is a good way to limit activity from anomalous users. Use of VPNs and VNETs to segment traffic can also reduce your risk.

Summary

The #hardtruth here is that every database is susceptible to SQL injection attacks. No one platform is more at risk than any other. The weak link here is the code being written on top of the database. Most code development does not emphasize security enough, leaving themselves open to attacks.

When you combine poor database security techniques along with poor code, you get the recipe for SQL Injection.

REFERENCES

2018 TrustWave Global Security Report
Contoso Clinic Demo Application
sqlmap: Automatic SQL injection and database takeover tool
Azure SQL Database threat detection
Working with SQL Injection Match Conditions
How to Detect SQL Injection Attacks
sp_executesql (Transact-SQL)
EXECUTE AS (Transact-SQL)
Server Configuration Options (SQL Server)

The post SQL Injection Protection appeared first on Thomas LaRock.

Use PWDCOMPARE() to Find SQL Logins With Weak Passwords

Thomas LaRock — Wed, 06 Feb 2019 14:55:15 +0000

Not a day, week, or month goes by without news of yet another data breach.

And the breaches aren’t the result of some type of Mission Impossible heist. No, it’s often an unprotected S3 bucket, maybe some SQL Injection, or files left behind when relocating to a new office. Silly, fundamental mistakes made by people that should know better.

After decades of reviewing data breaches I have arrived at the following conclusion:

Data security is hard because people are dumb.

Don’t just take my word for it though. Do a quick search for “common password list” and you’ll see examples of passwords scraped from breaches. These are passwords often used by default to secure systems and data.

Chances are, these passwords are in your environment, right now.

Here’s what you can do to protect your data.

Use PWDCOMPARE() to Find SQL Logins With Weak Passwords

SQL Server ships with an internal system function, PWDCOMPARE(), that we can use to find SQL logins with weak passwords. We can combine this function, along with a list of weak passwords, and some PowerShell to do a quick check.

First, let’s build a list. I’ll store mine as a text file and it looks like this:

I can import that file as an array into PowerShell with one line of code:

$pwdList = Get-Content .\password_list.txt

And with just a few lines of code, we can build a query and execute against our instance of SQL Server:

foreach ($password in $pwdList) {
$SQLText = "SELECT name FROM sys.sql_logins WHERE PWDCOMPARE('$password', password_hash) = 1;"
Invoke-Sqlcmd -Query $SQLText -ServerInstance $SQLServer
}

And we find that the ITSupport login has a weak password:

As Dark Helmet once said, “Now you see that evil will always triumph, because good is dumb.”

Preventing Weak Passwords for SQL Logins

One of the easiest things you can do is to enable the CHECK_POLICY for SQL logins. By default, enabling the CHECK_POLICY option will also force the password expiration by enabling the CHECK_EXPIRATION flag. In other words, you can have passwords for SQL logins expire as if they were windows logins, and you can enforce complex passwords.

However, even with those checks enabled, I would advise you still do a manual check for weak passwords. Do not assume that by enabling the password policy checks that you are secure. In fact, you should do the opposite. You should take a stance of assume compromise. This is a fundamental aspect of modern Cybersecurity practices.

As a side note, I also want to point out that Troy Hunt has collected the passwords from many data breaches, and he has made the passwords searchable. Do yourself a favor and take some of the passwords you’ve used throughout the web and see if they have been exposed at some point.

Summary

SQL Server offers system functions to help you search for weak passwords, as well as policies to enforce complex passwords and password expiration. You should adopt a stance of “assume compromise” and be proactive about checking the passwords in your environment to make certain they are not considered weak.

[Hey there, dear reader, if you liked this post about passwords and data security, then you might also like the full day training session I am delivering with Karen Lopez in two weeks at SQL Konferenz. The title is Advanced Data Protection: Security and Privacy in SQL Server, and you’ll learn more about how to protect your data at rest, in use, and in motion.]

The post Use PWDCOMPARE() to Find SQL Logins With Weak Passwords appeared first on Thomas LaRock.

No, You Don’t Need a Blockchain

Thomas LaRock — Thu, 01 Nov 2018 19:11:57 +0000

The hype around blockchain technology is reaching a fever pitch these days. Visit any tech conference and you’ll find more than a handful of vendors offering blockchain. This includes Microsoft, IBM, and AWS. Each of those companies offers a public blockchain as a service.

Blockchain is also the driving force behind cryptocurrencies, allowing Bitcoin owners to purchase drugs on the internet without the hassle of showing their identity. So, if that sounds like you, then yes, you should consider using blockchain. A private one, too.

Or, if you’re running a large logistics company with one or more supply chains made up of many different vendors, and need to identify, track, trace, or source the items in the supply chain, then blockchain may be the solution for you as well.

Not every company has such needs. In fact, there’s a good chance you are being persuaded to use blockchain as a solution to a current logistics problem. It wouldn’t be the first time someone has tried to sell you a piece of technology software you don’t need.

Before we can answer the question if you need a blockchain, let’s take a step back and make certain we understand blockchain technology, what it solves, and the issues involved.

What is a blockchain?

The simplest explanation is a blockchain serves as a ledger. This ledger is a long series of transactions. And it uses cryptography to verify each transaction in the chain. Put another way, think of a very long sequence of small files. Each file based upon a hash value of the previous file, combined with new bits of data, and the answer to a math problem.

Put another way, blockchain is a database—one that is never backed up, grows forever, and takes minutes or hours to update a record. Sounds amazing!

What does blockchain solve?

Proponents of blockchain believe it solves the issue of data validation and trust. For systems needing to verify transactions between two parties, you would consider blockchain. Supply chain logistics is one problem people believe solved by blockchain technology. Food sourcing and traceability are good examples.

Other examples include Walmart requiring food suppliers to use a blockchain provided by IBM starting in 2019. Another is Albert Heijn using blockchain technology along with the use of QR codes to solve issues with orange juice. Don’t get me started on the use of QR codes; we can save it for a future post.

The problem with blockchain

Blockchain should make your system more trustworthy, but it does the opposite.

Blockchain pushes the burden of trust onto individuals adding transactions to the blockchain. This is how all distributed systems work. The burden of trust goes from a central entity to all participants. And this is the inherent problem with blockchain.

[Warrants mentioning – many cryptocurrencies rely on trusted third parties to handle payouts. So, they use blockchain to generate coins, but don’t use blockchain to handle payouts. Because of the issues involved around trust. Let that sink in for a moment.]

Here’s another issue with blockchain: data entry. In 2006, Walmart launched a system to help track bananas and mangoes from field to store, only to abandon the system a few years later. The reason? Because it was difficult to get everyone to enter their data. Even when data is entered, blockchain will not do anything to validate that the data is correct. Blockchain will validate the transaction took place but does nothing to validate the actions of the entities involved. For example, a farmer could spray pesticides on oranges but still call it organic. It’s no different than how I refuse to put my correct cell phone number into any form on the internet.

In other words, blockchain, like any other database, is only as good as the data entered. Each point in the ledger is a point of failure. Your orange, or your ground beef, may be locally sourced, but that doesn’t mean it’s safe. Blockchain could show the point of contamination, but it won’t stop it from happening.

Do you need a blockchain?

Maybe. All we need to do is ask ourselves a few questions.

Do you need a [new] database? If you need a new database, then you might need a blockchain. If an existing database or database technology would solve your issue, then no, you don’t.

Let’s assume you need a database. The next question: Do you have multiple entities needing to update the database? If no, then you don’t need a blockchain.

OK, let’s assume we need a new database and we have many entities needing to write to the database. Are all the entities involved known, and trust each other? If the answer is yes, then you don’t need a blockchain. If the entities have a third party everyone can trust, then you also don’t need a blockchain. A blockchain should remove the use of a third party.

OK, let’s assume we know we need a database, with multiple entities updating it, all trusting each other. The final question: Do you need this database distributed in a peer-to-peer network? If the answer is no, then you don’t need a blockchain.

If you have different answers, then a private or public blockchain may be the right solution for you.

Summary

No, you don’t need a blockchain.

Unless you do need one, but that’s not likely.

And it won’t solve basic issues of data validation and trust between entities. If we can trust each other, then we would be able to trust a central clearinghouse, too.

Don’t buy a blockchain solution unless you know for certain you need one.

[This article first appeared on Orange Matter. Head over there and check out the great content.]

The post No, You Don’t Need a Blockchain appeared first on Thomas LaRock.

Life is dirty. So is your data. Get used to it.

Thomas LaRock — Wed, 10 Oct 2018 19:31:34 +0000

The internet provides everyone the ability to access data at any time, for any need. Unfortunately, it does not help guarantee that the data is valid, or clean.

In the past year I have earned certifications in three areas: Data Science, Big Data, and Artificial Intelligence. Those studies have provided me the opportunity to explore the world of data that exists. For example, Kaggle is a great source of data. They also offer competitions, if you are the type of person that enjoys money.

Today I want to spend time showing you an example of dirty data. Years ago, Rob Collie showed us that UFO pilots are attracted to LSD, but prefer ecstasy. As a follow up, let’s look at the National UFO Reporting Center Online Database. Or, as I like to call it, “what happens when you allow stoners to perform data entry”.

Let’s get started.

Dirty Data Example

The National UFO Reporting Center Online Database can be found at: http://www.nuforc.org/webreports.html

Navigate to that page, and once there click on the ‘Index by STATE’ link. Now notice at the top of the page there is a link for ‘UNSPECIFIED/INTERNATIONAL’.

OK, so let’s pause here for a minute. Judging by the word STATE and the list of US states and Canadian provinces, I assume this database has a North American focus. But there are more than 8,000 sightings listed as ‘UNSPECIFIED/INTERNATIONAL’. This doesn’t seem right to me, and I am now curious to know where the majority of these sightings are taking place. So, let’s download some data and get it into a map inside of PowerBI.

First, let’s examine that data by clicking the link. I want to see what the data looks like, and here is what I find:

Then, using Excel we import the data, using the ‘From Web’ option in the Data tab:

This downloads the data into Excel, and I will save the data as a CSV file. I will then import the CSV file into PowerBI. After the data is loaded I will create a map:

So far this has taken me less than 10 minutes to download roughly 8,000 rows, import those rows into PowerBI, generate this map, and see…I see…

Look at those bubbles inside the USA. I suppose those are “international” to someone not in the USA. But it is clear there is a disconnect in how this website is expecting data to be entered, and how the users are entering data. Alcohol is likely a factor, I’m certain. But it’s clear to me that we have dirty data.

Just look at the first row. The entry says ‘Kiev (Ukraine)’. That’s two different labels (City, Country) in one field (i.e., column). This could explain why the database classifies this entry as unspecified.

The PowerBI map made it easy for me to visualize the dirty data. But you won’t be working with location data on every project. You’ll need to find different ways to determine if your data is dirty.

[SPOILER ALERT]: All data is dirty.

Data Has Always Been Dirty

There’s a series of videos from Dr. Richard Hammond, taken from lectures in 1995. I believe these videos should be required viewing for every data professional. There’s one video, in particular, that I’d suggest you watch, as it is related to the topic today. The video title is “Unreliable Data“, and it is Dr. Hammond delivering a no-nonsense lecture recalling his experiences with data over many decades.

[Side note – this is a wonderful example of how you can deliver a great presentation without needing fancy slides, pictures of cats, or code demos.]

Dr. Hammond has some wonderful insights to share in the video. Have a look:

Watch and listen to the wisdom Dr. Hammond shares with the class. Here’s the TL:DR summary for you:

There is never time to do it right, but somehow you think there will be time to fix it later.

OK, so where does all this dirty data come from?

The Origins of Dirty Data

The title of this post is a quote from my friend Oz du Soleil (website | @ozexcel). It came up in conversation one night at the Microsoft MVP Summit many years ago. For a data professional like myself, it’s one of those phrases that just sticks in your head and never leaves. Mostly because I spend lots of time cleaning data for various projects.

Your data gets dirty through a variety of ways. Here’s but a few examples:

Duplicate data – A single event is recorded and entered twice into your dataset.
Missing data – Fields that should contain values, do not.
Invalid data – Information not entered correctly, or not maintained.
Bad data – Typos, Transpositions, variations in spelling, or formatting (say hello to unicode!)
Inappropriate data – Data entered in the wrong field.

As Dr. Hammond suggests, it’s difficult to determine if data is ever clean. Even scientific constants have a degree of accuracy. They are “good enough”, but not perfect.

Data’s ultimate purpose is to drive decisions. Bad data means bad decisions.

As a data professional it is up to us to help keep data “good enough” for use by others. We have to think of ourselves as data janitors.

But nobody goes to school to become a data janitor. Let’s talk about options for cleaning dirty data.

Data Cleaning Techniques

Here’s a handful of techniques that you should consider when working with data. Remember, all data is dirty, you won’t be able to make it perfect. Your focus should be making it “good enough” to pass along to the next person.

The first thing you should do when working with a dataset is to examine the data. Ask yourself “does this data make sense“? That’s what we did in the example above. We looked at the first few rows of data and found that both the city and country listed inside of one column.

Then, before you do anything else, make a copy, or backup, of your data before you begin to make the smallest change. I cannot stress this enough.

OK, so we’ve examined the data to see if it makes sense, and we have a copy. Here’s a few data cleaning techniques.

Identify and remove duplicate data – Tools such as Excel and PowerBI make this easy. Of course, you’ll need to know if the data is duplicated, or two independent observations. For relational databases we often use primary keys as a way to enforce this uniqueness of the records. But such constraints aren’t available for every system that is logging data.

Remove data that doesn’t fit – Data entered that doesn’t help you answer the question you are asking. In our example, if I want North America sightings, I would remove all entries logged as outside North America.

Identify and fix issues with spelling, etc. – There’s lots of ways to manipulate strings to help get your data formatted and looking pretty. For example, you could use the TRIM function to remove spaces from the text in a column, then sort the data and look for things like capitalization and spelling. There’s also regional terms, like calling a sugary beverage “pop” or “soda”.

Normalize data – Set a standard for the data. If the data is a number, make sure it is a number, not text. If it is categorical, make sure it has entries that apply for that category. Spelling, capitalization, etc., are all ways to set standards and normalize data to some degree.

Remove outliers – But only when it makes sense to do so! If the outlier was due to poor collection, then it could be safe to remove. Dr. Hammond suggested that “for 90% of the time, the next independent measurement will be outside the 90% confidence interval”. I trust his judgement here, so be mindful that outliers are innocent until proven guilty.

Fix missing data – This gets…tricky. You have two options here. Either you remove the record, or you update the missing value. Yes, this is how we get faux null values. For categorical data I suggest you set the data to the word ‘missing’. For numerical data, set the value to 0, or to the average of the field. I avoid using faux nulls for any data, unless it makes sense to note the absence of information collected. Your mileage may vary.

Summary

Life is dirty. So’s your data. Get used to it.

I encourage you to work with datasets, like the ones at Kaggle. Then walk through the techniques I discussed here. Ask yourself if the data makes sense. Think about what you might do to make the data cleaner, if necessary.

Get familiar with tools like Python, Excel, and PowerBI and how they can help you with data cleaning.

And remember that no matter how much you scrub that data, it will never be clean, but it will be good enough.

The post Life is dirty. So is your data. Get used to it. appeared first on Thomas LaRock.