Written by César Segura, Subject Matter Expert at SDG Group.
Data Governance in the current Data Platform is a mandatory concept we have to take into consideration together with other ones when you have choosen the best features and pieces for your architecture like Ingestion, Transformation and Delivering.
As a part of the Data Governance, there is a important step that is the Security Privacy in your data. In other Security articles series, I extend on the Accesibility Layer part, talking about the different layers. But in this one I will deep into one specific feature like an Accessbility Privacy Policy that Snowflake provides.
The Differential Privacy Policy (DPP) is mainly focused on providing noise to a table/view that contains PII sensitive information, in order to the restricted access users can continue querying information but they won’t be able to retrieve sensible information. They will only be able to get agregatted data applying restrictions.
On this article we will see basically three topics:
1. What is the process to create a DPP?
When you want to create a DPP, you must define what conditions, when a role is trying to access to a table, applies a policy or not. When you define that applies, Snowflake on Service Layer will assign charges to one specific budget is being affected. That Budget will be automatically generated inside the DPP, it will be used to limit the use done by a specific role to that affected table. You will be able to extend that budget capacity in; number of queries executed (Nbr_Agg), units (the units are consumed depending on each query required to apply more or less noise to the table) or window time (weekly, monthly…). The below Budget values are by default, but it can be modified in any time.
An example of the previous, could be the below Snowflake script:
CREATE OR REPLACE PRIVACY POLICY
security_db.policies_schema.customers_policy AS () RETURNS privacy_budget ->
CASE
WHEN CURRENT_ROLE() = 'ACCOUNTADMIN' THEN no_privacy_policy()
WHEN CURRENT_ROLE() IN ('ROLE_1')
THEN privacy_budget(budget_name => 'Budget_1')
WHEN CURRENT_ROLE() IN ('ROLE_2')
THEN privacy_budget(budget_name => 'Budget_2')
ELSE privacy_budget(budget_name => 'Budget_Rest')
END;
In the process when you apply your policy to your table you must define what are the fields that identifies an entity inside your table information. An example could be the below:
An example of the previous, could be the below Snowflake script:
-- Assign the privacy policy to the CUSTOMERS_TABLE table.
ALTER TABLE dp_db.dp_schema.customers_table
ADD PRIVACY POLICY policy_db.diff_priv_policies.customers_policy
ENTITY KEY (id);
Afterthat, you specify the privacy domains for the field(s) that you want manage the sensitive information, these are the ones can identify individually or in combination the identity / sensitive information. That privacy domains will allow inject noise information with meaning, for every query done to the data later.
An example of the previous, could be the below Snowflake script:
-- Define privacy domains on CUSTOMERS_TABLE table fields
ALTER TABLE dp_db.dp_schema.customers_table ALTER (
COLUMN gender
SET PRIVACY DOMAIN IN ('Female', 'Male'),
COLUMN Address
SET PRIVACY DOMAIN IN ('BCN', 'SAB', 'MAD', 'TAR'),
COLUMN Birth_date
SET PRIVACY DOMAIN
BETWEEN (to_date('01/01/1954'), to_date('12/31/2007'))
);
2. How does it work with examples?
Now, imagine the below scenario where different roles (ADMIN and ROLE_1) wants to execute the same query over the table affected by a DPP. In that example, we will use the below one:
Case 3 — ROLE_1 role executes Q1 query AGAIN later:
3. Why it is named a Differential Privacy Policy?
It’s based on when you execute your query, the output won’t be deterministic when a DPP affects you. In that case, it will apply noise to your data affecting to the results every time. The differences respecting to the correct value (deterministic one), it applies to the name of that security policy.
So you can use the functions DP_INTERVAL_LOW and DP_INTERVAL_HIGH, to determine the differential margins applied to the output result, with a 95% of trustibility.
An example of use with a query:
Conclusions
The DPP policy is one of most complex privacy policy to understand. Many people have been asked me about it, and I thought that it would be a good idea writting this article to help Snowflake members to get a comprehensive way with an easy and quick overview.
We have to take into consideration that this type of policy is very useful for environments when you share sensible information, and you want to highly restrict the use of this data. Snowflake has a full list of other policies, you will be sure to better understand your use case prior to apply this or other ones features.
Original article published on Medium here.