This project is read-only.
Project Description
Sample library that can be used to allow the client encrypt columns in Azure SQL Database or SQL Server.

In this project, we include the source code for a SQL column encryption library and some sample code on how it can be used for some of the most common tasks.

It is important to secure sensitive data in case there is a breach and a malicious party gets unauthorized access to your data. Such scenarios include storing sensitive data such as credit card numbers, social security numbers and/or personal identifiable information (PII) that must be protected and is potentially subject to regulations.

If you are in a scenario where you need to encrypt data, it is important to understand your options and prepare both your schema and your application to handle encryption, as well as plan for key management, and strategies for data lookup to minimize performance.

In most cases there are two levels of data encryption that can be considered:
  • At I/O level – Protects data at rest at the lowest level and is useful to protect against a malicious party that got unauthorized access to physical media (disk, files, etc.). Technologies such as SQL Server Transparent Data Encryption (TDE) serve to this purposes. It does not protect against online threats.
  • At column level – Protect relational data in such a way that data is protected at rest, even when the database is online. Depending on the key management solution, it can also provide protection against authorized database users who are able to read the column storing the sensitive data, but have no access to the key (which may be protected via access control or preferably stored outside the database).

As mentioned above, the I/O level encryption is a protection mechanism against physical attacks to the data (i.e. disk or DB file being compromised). Because it is mainly an infrastructure threat, its implementation would be something that Azure SQL Database would take care.

On the other hand column level encryption is an application-level responsibility. The application must define what sensitive columns need to be protected, define key management, and make the appropriate schema and application changes to handle the encrypted data.

Both levels of data encryption can be used in tandem as a mechanism to protect against both physical threats (i.e. media loss) and unauthorized data exfiltration while the database is on-line (assuming the adversary has no access to the key).

Schema Changes

The following is a list of consideration when defining the schema, which means it may require some changes to existing schemas in order to support encryption. In some cases there may be a way to work-around such changes, but as expected, it would involve a cost.
  • Ciphertext data length is typically larger than the original plaintext
  • Data type must be binary in order to avoid data loss.
    • If using a string data type, encode (i.e. Base64) the ciphertext before storing, storing binary data in a string column will likely result in data loss.
  • Some functionality may be limited since the service will not be able to operate on the data
  • When encrypting data, you are destroying order, so you can no longer index encrypted data as you would do for plaintext data. Consider that, when encrypting data, typically you may use encryption modes that will be non-deterministic because of the usage of a randomly created initialization vector (IV).

When the encrypted data was not used for any type of lookup, this may be sufficient; but in many cases, sensitive data is used for lookups and even as primary keys. Because of performance reasons, a data scan, by decrypting every row until a match is found is not acceptable, and an alternative may be needed. In some cases, identity based queries or partial match queries may be allowed, but at the cost of exposing some of the data (weakening the encryption) or/and some other limitations.

Here are a few strategies that would help to create some identity index or perform lookups based on partial information that is exposed. For the following examples we will assume the column we are protecting is a social security number (SSN):

These possible strategies are not all-inclusive, and there may be other alternatives that are omitted from this list.
  • Expose partial information on a separate column.
Split the SSN data into two columns: SSNlast4 (exposing the last four digits of the SSN) & SSNciphertext (which stores the encrypted SSN). This would allow to reduce the scan to a small subset of data (the ones matching the last 4 digits), at the cost of exposing some portion of the data that is deemed as acceptable.
  • Create an HMAC column.
SSN data into two columns: SSNhmac (SSN HMAC using a separate hmac key) & SSNciphertext (encrypted SSN using an encryption key). This would allow somebody with access to the HMAC key to do an equality lookup on any given SSN by calculating the HMAC with the appropriate key.
  • Use ECB mode.
When the plaintext is small enough to fit in 1 block of ciphertext, an appropriate alternative would be to use the ECB mode, which doesn’t use a random IV and would allow for equality lookups given that AES is a deterministic algorithm.

As described on the HMAC indexing solution, it may be possible to have multiple keys in the system. Although we will not cover key management in much detail, but here are a few things to consider when choosing a strategy:

Where to store the keys?
  • Hardware Security Module – HSM. This would be the preferred option.
  • Access to the key is limited (the key is not exposed)
  • Network enabled HSMs typically support Granular access control to individual keys.
  • Performant
  • Best option for a complete key management solution.
  • Cost
  • Software based Solutions
A few possibilities: keys stored in HBI-enabled Sharepoint, 3rd party key management software.
Consider protection of keys at rest, disaster recovery, Key identity model (i.e. identify the keys by a unique value that is not the key material), auditability
  • Less expensive
  • Ad-hoc
  • Ad-hoc
  • Many management tasks are left up to the user

Key rotation
  • Scenario & regulation specific
  • Age-based key rotation
    • Two possible strategies (not inclusive):
  • On key rotation, re-encrypt all data
    • Better suitable for small number of rows
    • Only one key needs to be maintained at any given time
  • On key rotation, old keys are only valid for decryption
    • Better suitable for large amounts of data
    • Possible to re-encrypt old values on “next use” in order to phase out old keys
    • Multiple keys need to be managed for extended period of time
  • How many keys do we need to manage?
    • It depends on the scenario.

Depending on the scenario, having a key management solution that allows multiple keys and that separates the key management from the data allows multiple scenarios:
  • Separation of roles
Have a key for lookup-roles (i.e. CSS), where they can access the row only if they know the plaintext (i.e. SELECT * FROM customer WHERE SSN_hmac = HMAC(‘<current customer SSN>’); ) and full-data access roles who can decrypt & recover sensitive data from all the rows (i.e. SELECT decrypt(SSN_ciphertext) FROM customer;).
  • Key rotation
    • Define specific lifetimes and manage encryption keys in such a way that old keys are no longer used for encrypting new data, but are still accessible to decrypt old/archived data.
    • Have a disaster recovery strategy that allows you to do a key rotation in case of a security breach.
  • Isolation/multi-tenancy

Since the key management is going to be handled by the application. There is an opportunity for multi-tenancy scenarios where the key is always under control of each tenant and not under the control of the middle-tier application or the Cloud provider, thus increasing isolation between tenants and providing stronger guarantees in case of a security breach.

You must consider crypto agility. You need to remember that cryptographic algorithms fall in disfavor over time (think of MD5 & DES). Design your cryptosystem to allow changes on the algorithms being used without having major disruptions. One recommendation is to have metadata describing the cryptosystem:
  • Could be using an existing standard, or an ad-hoc solution
  • Version
  • Algorithms being used (could be tied to version)
  • Additional parameter information (IV, AAD, etc.)
  • Key Identifier

Other security consideration you must have is that encryption protects against disclosure, but not against tampering. When you define your schema, it is recommended that you choose an algorithm or cryptosystem that includes some authentication/data integrity verification mechanism.

For the same reasons, it is important to consider the context. Since we are using relational data, one type of attack that we must consider is payload re-use, in case of our scenario in particular, the most common form of this type of attack is row-swapping; for example: The attacker may not be able to decrypt the "salary" column since it is encrypted, but he may be able to copy someone else’s salary value. A tip for preventing this attack is to include the primary key (or other non-volatile, but row-specific value) as part of the cryptosystem integrity verification mechanism.

NOTE: This project is being released only in source code form.

This library sample code is being released as sample code under a Microsoft Limited Permissive License (Ms-LPL)

Last edited Sep 24, 2014 at 12:30 AM by raulga, version 9