Versity’s most recentVSM1.6.0 release introduces several new types of checksum algorithms, which are used to ensure data correctness. It inspired us to step back and think about what checksums are, why people use different checksum algorithms, and the role they play in long term data archiving.
Checksums are basically a small piece of computed information about a larger piece of digital data, usually a file. They can be thought of as data fingerprints because the checksum output changes if the data changes. The small piece of information, the checksum, is used to check data after it’s been transferred or stored, to verify that the information is still the same as it was before. The computation used to compute the checksum is referred to as the checksum algorithm. There are several different types of algorithms, which mostly perform the same task (computing a checksum) but are very unique in their implementation details.
The main characteristics of checksums are performance, collision likelihood, and size.
Performance in this case is measured by a combination of what compute resources are needed and how long the computation takes to complete. Too fast and it is likely that the algorithm is not very strong which can lead to security issues. Too slow and it causes problems with users or admins choosing to turn checksumming off completely rather than wait for them to be calculated. More compute power can help computations complete more quickly, but there are algorithms where even more or faster processors do not increase the performance of a single checksum significantly.
Collisions occur when two distinct inputs (two different files) produce the same checksum output. When used to verify data correctness this is a problem because the checksum is delivering a false positive verification. Interestingly, all checksum algorithms produce collisions due to something called thepigeonhole principle.The goal is to select an algorithm that is highly unlikely to produce collisions but still computationally efficient. In February 2017, Marc Stevens et alpublished a paperthat demonstrated the ability to generate SHA-1 collision with the equivalent of 150 days of computation on a single 3.2 GHz Haswell Quad Core i5 processor. This work demonstrated that SHA-1 was not only theoretically a weak algorithm, but also vulnerable in a practical sense. The SHA-256 algorithm by comparison would require a super computer capable of processing 15 trillion SHA-256 computations per second approximately 650 million years to generate a single collision. It is interesting to note that collisions are also a problem because they can be used to overwrite data, derive data, or even create load balancing issues that can lead to denial of service – all scenarios to be aware of from a security point of view. For archiving, the biggest malicious threat is that collisions could be used to maliciously corrupt data without detection from checksum verification.
The size of a checksum is the physical amount of space the checksum takes on the storage device, usually reported in bits. In general the more advanced algorithms, as discussed above tend to make bigger and therefore less collision prone checksums. The lower likelihood of collisions comes from an increase in the total hash space. The quality of the algorithm determines how evenly the hashes are spread across the available space. For some well done visualizations of the hash space and how the algorithms spread over it see the top answer on thisthis stack exchange thread.The size can be indicated in the name of the hash, for example, SHA-256 makes a resulting checksum that is 256 bits.
The checksum algorithm is really a special kind of hash function. A hash function is a function, or process, that can be used to map data of arbitrary size to data of a fixed size. The types of hashes used for data integrity are distinguished by the presence or absence of keys and cryptographic properties. Keyed hashes require an additional piece of data, a key, to compute the checksum. This makes it inconvenient for data integrity checks as there is an additional piece that is required for operations. Non-cryptographic hashes tend to be more performant than cryptographic hashes, which are designed to deliver more guarantees when it comes to the output.
Cryptographic checksum algorithms are designed with several different security characteristics in mind. Most importantly they are built to be one way, which means you cannot derive the data from the checksum. They are also built to avoid collisions, ensure that the hash changes significantly if the data changes, and to be resistant to brute force attacks such aspre-image. As a result of these design decisions, cryptographic algorithms are often compute intensive and tend to be slower than non-cryptographic algorithms. For a nice overview on how cryptograhic hashes work, see thisillustrated guide. If data is sensitive or protected by regulations, cryptographic functions are likely worth the performance tradeoff. If performance is of paramount importance and the goal is to simply look for unintentional errors like bit flips that occurred in the data transfer, non-cryptographic checksums are probably a good choice.
Checksums and archiving
In archiving, checksums are used to make sure that transferred or stored copies of data match the data that was originally created. The application or system where the data was created computes a checksum which is stored, and then used later for comparison. The place where the data is stored or transferred need only apply the same checksum algorithm which should produce the same checksum. If differences are detected after comparing checksums, it can be concluded that the data was somehow corrupted, either accidentally or maliciously. Corrective action can then be taken to repair the data, or replace it with a correct copy.
InVSM, a user or admin can choose to generate a checksum to be stored in the filesystem inode, which relieves the user from the responsibility of storing the checksum in a separate file or database. VSM also optionally automates the check, not returning success until the entire file is retrieved from archive and the checksum is verified.
Types of checksums; and which one to pick
As mentioned previously, checksum algorithm implementation details are really what distinguish different types of checksums. Which one to pick depends on requirements for size of the checksums, time to compute the checksums, tolerance for collisions, and need for security. The completelist of checksumsis beyond the scope of this article, so we will stick to the ones available in VSM. These were chosen for their ability to detect data corruption, their performance since they must be able to keep up with tape drives which are up to 300MB/s in the current generation, and to some degree their overall popularity. There are both cryptographic and non-cryptographic offerings but all are keyless, meaning that no additional information other than the original data is needed to compute the checksum.
The first supported checksum inVSMis MD5. MD5 stands for message-digest algorithm which produces a 128 bit checksum. It is a cryptographic algorithm. MD5 is fast but not considered to be as secure as the SHA series. While the method to craft MD5 collisions is known and the algorithm in general has been declared as not cryptographically secure, the collisions are extremely unlikely to hit accidentally so the algorithm may still be used with a reasonable degree of certainty for many data integrity applications. In addition to the existing MD5 algorithm currently offered by VSM, the VSM 1.6.0 release includes a vastly improved MD5 algorithm which performs the calculation much more quickly – delivering a little over a factor of 4 improvement in our internal testing. This new and improved checksum is a great choice for VSM customers who were using the previous version of MD5.
SHA is a family of cryptographic checksum algorithms that are published by NIST. There are14 variants, 3 of which are supported in VSM: SHA-1, SHA-256, and SHA-512. SHA-1 produces a 160 bit checksum and is the highest performing checksum in this family, followed by the 256 and then 512 versions. This family of checksums is a great choice for VSM users who have requirements from their organization for cryptographic capabilities.
Murmur3 is a non-cryptographic hash offered inVSM. The supported version is MurmurHash3_x64_128, which is optimized for 64 bit architectures and produces a 128-bit checksum. It typically performs more quickly than cryptographic functions, but should not be used for security applications. This is a great choice for VSM users who were previously using the “simple” algorithm as this has shown almost a factor of 3 improvement in performance over “simple” and is the fastest checksum algorithm offered by VSM. This might be of interest to customers using object or disk archiving with high performance requirements.
Checksums are an important piece of the overall data protection strategy. There are many different types and which one to use depends on requirements from the organization and users. At Versity we deliver rock solid data protection for large data collections, and as part of that delivery we offer several different kinds of checksums to meet our customer’s needs.
Learn Moreabout VSM and how it protects data.
What is checksum in data integrity? ›
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.Which cryptographic techniques can be used to verify file integrity? ›
Checksums are used to check files and other data for errors or manipulation that might have occurred during data transmission or storage. Cryptographic checksums are the basis for digital signatures and other use cases of cryptography and encryption.Is checksum used to ensure integrity of data portions for data transmission or storage? ›
Checksums are used to ensure the integrity of data portions for data transmission or storage. A checksum is basically a calculated summary of such a data portion. Network data transmissions often produce errors, such as toggled, missing or duplicated bits.Is checksum validation or verification? ›
Checksum is used to verify the integrity of the downloaded file. In this case, the Patch Manager Plus upgrade file's integrity is being verified.What is checksum validation? ›
Checksum validation is used to ensure a downloaded file has not been corrupted. Checksum validation employs algorithms that compute a fixed-size datum (or checksum) from an arbitrary block of digital data. If two parties compute a checksum of a particular file using the same algorithm, the results will be identical.How to use checksum to verify file integrity? ›
- Open the Windows command line. Press Windows + R, type cmd and press Enter. ...
- Go to the folder that contains the file whose MD5 checksum you want to check and verify. Command: Type cd followed by the path to the folder. ...
- Type the command below. certutil -hashfile <file> MD5. ...
- Press Enter.
Error checking and validation, for example, are common methods for ensuring data integrity as part of a process.What is difference between checksum and CRC? ›
Difference between Checksum and CRC :
|1.||It is not a thorough concept for detection and reporting of errors.||It is a thorough concept for detection and reporting of errors.|
|4.||Errors can be easily detected.||It follows a complex computation method for error detection.|
A mathematical value created using a cryptographic algorithm that is assigned to data and later used to test the data to verify that the data has not changed.
A checksum is intended to verify (check) the integrity of data and identify data-transmission errors, while a hash is designed to create a unique digital fingerprint of the data. A checksum protects against accidental changes. A cryptographic hash protects against a very motivated attacker.
What are the four 4 most secure encryption techniques? ›
- AES. The Advanced Encryption Standard (AES) is the trusted standard algorithm used by the United States government, as well as other organizations. ...
- Triple DES. ...
- RSA. ...
- Blowfish. ...
- Twofish. ...
- Rivest-Shamir-Adleman (RSA).
Advantage : The checksum detects all the errors involving an odd number of bits as well as the error involving an even number of bits.How reliable is checksum? ›
There is no absolute reliability of data. A checksum is a relatively small and fast way of providing some security that the data are correct, but it is up to the application to ensure that. TCP does use a checksum, but it does not guarantee the data are correct.Should I enable data checksum for advanced data integrity? ›
When creating a shared folder, the enable data checksum for advanced data integrity option below must be enabled. If this option isn't enabled, the data scrubbing process will not work for the specified shared folder. What is this?When to Use verify vs validate? ›
Verification takes place while the product is still under development, but validation is performed upon the completion of a module, or the entire application. Validation focuses on ensuring that a stakeholder gets the product they desired.Can checksum detect all errors? ›
It does not take into account the number of error bits and the type of error. For error detection, the sender needs to send some additional bits along with the data bits. The receiver performs necessary checks based upon the additional redundant bits.What are three types of validation? ›
- A) Prospective validation (or premarket validation)
- B) Retrospective validation.
- C) Concurrent validation.
- D) Revalidation.
A CMOS Checksum error occurs when the computer isn't able to read that information. One of the more common causes of a checksum error is also the simplest to solve. The battery that powers the CMOS is a watch battery, and it can run out of power. When the battery is dead, the CMOS can't store information anymore.What causes incorrect checksum? ›
The checksum might be incorrect when the firewall received the packets or after NAT was performed on the firewall. The packets of both the inside and outside interfaces of the firewall were analyzed. The result indicated that the TCP checksum was incorrect when the firewall received the packets from the server.What does checksum validation failed mean? ›
If the vendor has changed/upgraded the package like a new version , when we compare the checksum value from DB with actual checksum value of the package there will be mismatch. This will be posted as error message : CHECKSUM validation failed.
How is checksum done? ›
The data being sent is divided into 16-bit chunks. These chunks are then added, any generated carry is added back to the sum. Then, the 1's complement of the sum is performed and put in the checksum field of the UDP segment.What is MD5 checksum verification? ›
Overview. MD5 Checksum is used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change. Most commonly, md5sum is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious modification.What is the checksum process? ›
A checksum is a technique used to determine the authenticity of received data, i.e., to detect whether there was an error in transmission. Along with the data that needs to be sent, the sender uses an algorithm to calculate the checksum of the data and sends it along.Why is checksum used? ›
A checksum is a string of numbers and letters that act as a fingerprint for a file against which later comparisons can be made to detect errors in the data. They are important because we use them to check files for integrity.What is the purpose of checksum in SQL? ›
CHECKSUM computes a hash value, called the checksum, over its argument list. Use this hash value to build hash indexes. A hash index will result if the CHECKSUM function has column arguments, and an index is built over the computed CHECKSUM value. This can be used for equality searches over the columns.How is checksum data calculated? ›
To calculate the checksum of an API frame:
- Add all bytes of the packet, except the start delimiter 0x7E and the length (the second and third bytes).
- Keep only the lowest 8 bits from the result.
- Subtract this quantity from 0xFF.