Nikhil Chandna, Senior Solutions Architect
Kapil Budhraja, Solutions Architect
Recently, a customer presented us with an intriguing challenge: rather than restricting access to Personally Identifiable Information (PII) within their Databricks environment, they wanted to monitor any instances where PII is downloaded to users’ local machines. This nuanced requirement reflects a balance between flexibility and oversight — allowing data exploration while ensuring compliance.
However, before we look at how this was achieved, let’s quickly refresh ourselves on what PII is and how to identify and safeguard it in an Enterprise Data platform.
Safeguarding Personally Identifiable Information (PII) is more critical than ever. PII encompasses any data that can be used to identify an individual, such as names, addresses, social security numbers, and even digital identifiers like IP addresses. With organizations increasingly leveraging cloud platforms like Databricks for large-scale analytics, ensuring robust monitoring and protection of PII is essential to maintaining compliance, building customer trust, and preventing data breaches.
Identify, secure, and manage PII in Databricks –
Andrew Weaver wrote a detailed blog about how to manage PII in databricks here.
In the blog post “Identifying and Tagging PII Data with Unity Catalog,” Andrew discusses leveraging Databricks’ Unity Catalog to effectively identify and manage Personally Identifiable Information (PII) within an organisation’s data ecosystem. Unity Catalog provides a centralised governance solution that enables data teams to classify and tag datasets containing PII, facilitating compliance with data protection regulations and enhancing data security. By implementing these tagging strategies, organisations can efficiently search their entire metastore for datasets tagged as PII or even specific types of PII using the entity_type tag, thereby streamlining data management processes and mitigating risks associated with sensitive information.
Since then, databricks has introduced a preview of Auto PII classification, which automatically detects and tags PII across your datasets. Additionally, an early preview of ABAC is now available.
If you’re interested in exploring these features, please reach out to your Databricks account team to request access.
With that primer out of the way, let’s jump directly into how we can identify any unauthorised downloads of PII in our data estate.
In Databricks, users can manually download the data through the UI in two primary ways:
- SQL Query Editor: From the query output in the SQL workspace.
- Notebook Results: Via the preview results set in a notebook cell.
Notebook Results Download SQL Query Editor download
It is important to call out that workspace or account administrators can disable these download capabilities entirely, preventing any data exports. Here’s an example of a query result in a workspace where downloads are blocked:
In some cases, platform teams intentionally preserve download functionality. For instance, users might need to analyse data locally using tools like Tabular Editor on their virtual machines or company laptops. This flexibility, however, heightens the need for robust monitoring, especially for sensitive PII. The goal shifts from proactive restriction to reactive auditing: allowing downloads but tracking them to ensure only authorised users access PII.
Consider a scenario where the marketing team is permitted to download PII for local analysis, but no one outside that team should have the same privilege. This blog explores how to identify all download events in Databricks and verify that only authorized users are involved.
Databricks system tables are the key to unlocking this capability. These pre-built tables store operational metadata about your account, offering visibility into resource usage, user activities, and system performance. Enabled through Unity Catalog — Databricks’ governance layer for data and AI — they span schemas like billing, compute, and access. For monitoring PII downloads, the audit table within the access schema is particularly valuable, as it logs user actions, including data exports.
With system tables, you can:
- Track usage patterns (e.g., DBUs or cluster metrics).
- Audit user activities for security and compliance.
- Query detailed logs to pinpoint specific events, like downloads.
(Note: Unity Catalog must be enabled to access these tables.)
To monitor PII downloads, we can craft SQL queries against the audit table. Below are examples targeting downloads from the SQL Editor and notebook cells over the last 30 days:
WITH user_downloads AS (
WITH audit AS (
SELECT
request_params.queryResultId AS queryResultId,
request_params.queryId AS queryId
FROM
system.access.audit
WHERE
action_name = 'downloadQueryResult'
AND event_date >= current_date() - INTERVAL 30 DAYS
AND request_params.queryId IS NOT NULL
AND request_params.fileType != 'arrows'
),
More AS (
SELECT
history.*,
audit.*
FROM
system.query.history AS history
JOIN
audit
ON
history.session_id = audit.queryResultId
)
SELECT
lineage.source_table_full_name,
created_by
FROM
system.access.table_lineage lineage
JOIN
More
ON
lineage.entity_run_id = More.statement_id
),
notebook_downloads AS (
SELECT
source_table_full_name,
FIRST(created_by) AS created_by
FROM
system.access.table_lineage
WHERE
entity_type = 'NOTEBOOK'
AND source_type = 'TABLE'
AND entity_id IN (
SELECT DISTINCT request_params.notebookId
FROM
system.access.audit
WHERE
action_name = 'downloadPreviewResults'
AND event_date >= current_date() - INTERVAL 30 DAYS
AND request_params.notebookId IS NOT NULL
)
GROUP BY
source_table_full_name
),
combined_downloads AS (
SELECT * FROM user_downloads
UNION ALL
SELECT * FROM notebook_downloads
)
SELECT distinct source_table_full_name, created_by FROM combined_downloads
You can enhance the above query to match with tables with “PII” tags as well –
WITH user_downloads AS (
WITH audit AS (
SELECT
request_params.queryResultId AS queryResultId,
request_params.queryId AS queryId
FROM
system.access.audit
WHERE
action_name = 'downloadQueryResult'
AND event_date >= current_date() - INTERVAL 30 DAYS
AND request_params.queryId IS NOT NULL
AND request_params.fileType != 'arrows'
),
More AS (
SELECT
history.*,
audit.*
FROM
system.query.history AS history
JOIN
audit
ON
history.session_id = audit.queryResultId
)
SELECT
lineage.source_table_full_name,
created_by
FROM
system.access.table_lineage lineage
JOIN
More
ON
lineage.entity_run_id = More.statement_id
),
notebook_downloads AS (
SELECT
source_table_full_name,
FIRST(created_by) AS created_by
FROM
system.access.table_lineage
WHERE
entity_type = 'NOTEBOOK'
AND source_type = 'TABLE'
AND entity_id IN (
SELECT DISTINCT request_params.notebookId
FROM
system.access.audit
WHERE
action_name = 'downloadPreviewResults'
AND event_date >= current_date() - INTERVAL 30 DAYS
AND request_params.notebookId IS NOT NULL
)
GROUP BY
source_table_full_name
),
combined_downloads AS (
SELECT * FROM user_downloads
UNION ALL
SELECT * FROM notebook_downloads
),
pii_tables AS (
SELECT catalog_name, schema_name, table_name
FROM system.information_schema.table_tags
WHERE tag_name ILIKE '%PII%' OR tag_value ILIKE '%PII%'
)
SELECT
distinct cd.source_table_full_name,
cd.created_by,
pt.catalog_name,
pt.schema_name,
pt.table_name
FROM
combined_downloads cd
JOIN
pii_tables pt
ON
cd.source_table_full_name = CONCAT(pt.catalog_name, '.', pt.schema_name, '.', pt.table_name)
While blocking downloads is one way to control PII access, monitoring provides a more nuanced and scalable approach. With the right setup in Databricks, you can support user needs and satisfy compliance requirements at the same time.
Disclaimer 1: Personal Views
The views and opinions expressed in this blog are solely those of the authors and do not represent those of Databricks or its affiliates.
Disclaimer 2: Evolving Platform Notice
Databricks is an evolving platform with new products and features being added regularly. The approaches discussed here are effective today for capturing data from notebooks and warehouses. However, without future updates, they may not automatically extend to new products introduced later.
Originally published at https://medium.com on April 24, 2025.
Learn more Monitoring PII Downloads in Databricks: Balancing Flexibility with Compliance