Governing AI Agents
Let’s start with the why
While interviewing for a AI engineer position, I was surprised that many questions were related to data governance and how to make sure agents don’t cause damage.
Until then, my only concern was creating a performant agent, which can understand its task. Never in my mind came the thought that they might cause damage when not working as expected (which is to be expected for a non-deterministic system).
The reality is that if you want your Agent to access any database containing PII data, you need to address the question of governance.
The solution
TL;DR: This whole topic is addressed in this free course on Andrew Ng’s deep learning platform, which explains and solves exactly this problem using the Databricks infrastructure.
1. Create a technical user (service principal) for your agent
In this course, they used Databricks as a platform, but the principle of creating a technical user is a best practice when letting work get done. From now on, I’ll call the technical user a service principal.
Whatever platform you choose or where you create the service principal, make sure that you give it the least amount of permissions following the principle of least privilege.
2. Create tags for sensitive data
Tagging the data is again a best practice.
A very naive approach would be to add an extra column in the existing table like confidential and alter the original table structure.
Let’s use a simple employees table as an example:
| employee_id | name | department | ssn | salary |
|---|---|---|---|---|
| 101 | Jane Doe | Engineering | 123-45-678 | 120000 |
| 102 | John Smith | Sales | 987-65-432 | 95000 |
| 103 | Alice Brown | HR | 456-78-901 | 80000 |
In this case, they used the following tags:
- Public: Anyone can access (company policies)
- Internal: Employees only (e.g.,
employee_id,name,department) - Confidential: Limited access (employee records, reviews)
- Restricted: Highly sensitive (e.g.,
ssn,salary)
Based on this, we would tag the ssn and salary columns as ‘Restricted’.
3. Create an SQL View to filter out data
The SQL view creates a new table view, where one can only see pre-selected columns. Usually, neither a data analyst nor an agent should see all the data.
For our table, we could create a v_employee_directory view:
Now, the agent can query this view without ever seeing the ‘Restricted’ data.
4. Configure SQL Group permissions
Use granular permissions for the database.
First, the agent needs a Catalog (a Databricks concept) and SQL Schema container permissions to be able to work with the underlying data, like an SQL table.
The object permissions are inheritable and they apply to the MODEL (a Databricks MLFlow model), SELECT (reading data from tables), EXECUTE (running SQL functions) and CREATE TABLE (allows use of tables).
For our agent, we would grant SELECT only on the v_employee_directory view, not the base employees table.
5. Column masking
Masking is needed because the view can be bypassed (e.g., if an agent has broader permissions or finds another way to query the base table).
With masking, even if the agent could query the employees table, a rule would show a masked value for the ssn column, like XXX-XX-678.
6. Build tools for the agents
Instead of building python functions to query the data, we define the operations we need using SQL functions.
They inherit the caller’s permissions, can validate input, and every function call is logged.
For example, we could build a safe function get_department(employee_name) that only returns the department, which is much safer than letting the agent run SELECT * ....
Why Databrics?
I have to be honest and say that this is the first time I used Databricks and I just love it.
The second part of the course is where I see the benefits of using an integrated platform like it, and by that I mean:
- Functions: we can directly reference the SQL Functions created previously.
- Governance: direct connection to our
service pricipal - Deployment: There is an automatic deployment after we test the agent.
For all implementation details make sure to check out the repository and finish the course.