This post was originally published on DataOps Zone
In this post, I want you to join me on a mission at a fictitious company called Hackme Corporation. Your mission, should you choose to accept it, is to send Hackme’s year-end financial reports to third-party authorities while making sure no one can change the documents along the way. I can assure you that this blog post won’t self-destruct in 10 seconds. However, it will discuss what data integrity is and how to use it, and highlight the key differences between data integrity and data security. Then I’ll share a common myth about encryption and answer the question, can you use encryption for data integrity?
Let’s get started.
Your Mission Needs a Good Plan
How would you approach Hackme’s mission from a data security point of view? Every mission needs a good plan, and yours starts with the CIA principles. Not the Central Intelligence Agency, but rather the CIA principles of data security: confidentiality, integrity, and availability. Let’s go through them next.
The C and the A: Confidentiality and Availability Principles
The data confidentiality principle is all about keeping data private and secret. These days, data is the new digital currency, and nefarious hackers, state-sponsored actors, disgruntled employees, and occasional or recreational hackers are keen to get their hands on it. These could include the organization’s intellectual property data, personal data, health data, payment data—and the list goes on. Getting the confidentiality principle right is not only a business priority but also mandated by laws and regulations. Data encryption, authentication, and authorization technologies will help you make sure you use them wisely.
Remember, in our mission we need to send financial reports, and these records are available to the public in our case. This means we don’t worry about confidentiality that much. Let’s now turn our attention to availability.
The data availability principle means that data is always available and accessible to the organization. The principle also has a few threats, like data center outages, hardware failures, DDOS attacks, and crypto-malware attacks. Your team needs battle-tested DR plans, data backup and restore procedures, redundant hardware, and iron-clad contracts with service providers to be successful with data availability.
Data availability also isn’t critical in our mission because we need to send the reports only. Let’s now turn our attention to the most critical principle for our mission—the data integrity principle.
The I: Data Integrity Principle
The data integrity principle focuses on the validity, accuracy, and consistency of the data. It’s a set of rules and mechanisms to record and receive data accurately over its whole life cycle. Data integrity is like when you send a parcel of fragile wine glasses to your grandma. To make sure grandma gets wine glasses and not broken glass, you wrap the glasses with paper or some other wrapping material. You can think of the wrapping material as the data integrity principle.
Sounds simple, doesn’t it? Well, not so fast—let’s explore data integrity a bit further.
Let me start by clarifying one thing first. Data “accuracy” in the context of data integrity is not accuracy in the traditional sense. Let me explain this with our financial report as an example. Data integrity does not focus on the accuracy of the report. In other words, when the income statement isn’t accurate and it doesn’t represent the financial truth of the organization, that’s a data quality issue. However, data integrity has to preserve data quality during the data life cycle. Does that make sense?
To preserve data quality and accuracy, we need to talk about physical and logical integrity. When we store and retrieve data from any digital storage, we need physical integrity. This is all about error-detection algorithms, checksums, and various mechanisms working in the background transparently.
Logical integrity defines logical rules, constraints, and structures for your data. Why do we need logical integrity rules in the first place? Without them, we couldn’t make digital models or define complex relationships between things and data structures. For that reason, we have logical integrity, entity integrity, referential integrity, domain integrity, and user-defined integrity rules. Let’s discuss these with a classic bank account example.
Logical Integrity Deep Dive
Entity integrity means that each entity is identifiable with a unique key. In other words, you’re a bank customer or an entity. The bank has to identify you in its system with a unique key so it won’t mistake you for someone else.
Referential integrity is another form of logical integrity. It ensures that the relationships between entities are clearly defined. In our banking example, both you and your account are uniquely identified, but you also belong together. Referential integrity defines which bank account belongs to you exclusively, and it mandates that you have an account with an account balance—hopefully with lots of zeroes in it!
Domain integrity encompasses constraints and rules that define properties for logical entities. In other words, you can’t open a bank account without your name, your address, and so on.
User-defined integrity rules are additional constraints, limits, and rules defined on the basis of business requirements.
How Is Data Integrity Different From Data Security?
Before I answer this question, let me clarify one thing first. People usually mean data confidentiality when they talk about data security. Both confidentiality and integrity play a key part in data security. They look similar, but they have different purposes. When you apply the data confidentiality principle, you want to keep the report’s contents secret and confidential. When you apply the integrity principle, you don’t want anyone to modify the report without your knowledge. To understand how these two principles differ, let’s take a look at two technologies used to support each principle: data hashing for integrity and encryption for confidentiality.
Get Cracking With Hashing
Hashing algorithms are one of the most fundamental tools in the data integrity toolset. They’re a set of mathematical functions (e.g., MD5, SHA-1, SHA-2, BLAKE2) you can apply and generate hash values or hash digests of the data. Just think of these hash digests as digital fingerprints of that data. The nature of the hash algorithm is that even the slightest change in the data will produce a completely different fingerprint. How could this help with your mission?
You could use hashing and generate the hash digest of the financial report. After that, all you need to do is send both the report and the digest to the third party. The third party would repeat the same process that you did. They’d generate their hash digest of the report and compare theirs with yours. If the digests match, they have proof that no one modified the report and the integrity is intact.
However, when we use hashing algorithms alone, our mission could be in jeopardy. An attacker could still modify the report, generate their own hash, and send those to the third party. The third party wouldn’t know which one is trustworthy because now they’d have two documents and two hashes. An attacker could also modify your hash and invalidate your report at the third party. It looks like our mission is difficult but not impossible; let’s move on to data encryption and see how it could help us.
Could Encryption Save the Day?
Data encryption is one of the most important tools in your data confidentiality toolkit. This topic is complex, challenging, and not for the faint of heart. Books and encyclopedias go into great detail of various encryption algorithms, technologies, and methods, but for now, let’s keep things simple. First, you generate a key and encrypt the document. Then, you send the encrypted document to the third party. The third party decrypts the document with a decryption key and reads the document. Does this mean we managed to maintain Hackme’s report integrity? If you think data encryption is the answer, please read on.
Can Data Encryption Guarantee Data Integrity?
Previously, data encryption looked like a great solution, but unfortunately, you can’t rely only on encryption for data integrity. Why? Because an attacker could still modify the encrypted document. Remember, in the digital world everything is zeroes and ones, and the same is true for our encrypted report. The attacker could still inject or overwrite the zeroes and ones in the document in certain cases. This means the third party could still decrypt the modified document. In that case, the third party wouldn’t know that the document was modified by an attacker.
Using encryption alone could help you with confidentiality, but you can’t rely on it for data integrity. That’s why most modern cryptographic solutions use a combination of hashing and encryption. The same applies to our mission, which means we have to use both encryption and hashing to be successful in our mission.
Data Integrity Mission Prologue
In summary, we discussed the CIA principles, with a focus on the data integrity principle. Now you understand what the key differences are between data integrity and data confidentiality. I’ve also busted a common myth and answered the question, “Can data encryption guarantee data integrity?” Your key takeaway is that you can’t rely on encryption or hashing algorithms alone for data integrity, and to be successful in our mission, we’d have to use both to send the financial report.
Congratulations! You completed this mission with flying colors. In closing, I’ll say that our mission could still be improved with digital signatures and certificates—but that’s a story for another time. Until then, keep safe and secure.