Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROD-2486 Add Dynamic Erasure Email Integration #5226

Merged
merged 28 commits into from
Sep 10, 2024
Merged

PROD-2486 Add Dynamic Erasure Email Integration #5226

merged 28 commits into from
Sep 10, 2024

Conversation

erosselli
Copy link
Contributor

@erosselli erosselli commented Aug 22, 2024

Closes PROD-2486

Description Of Changes

Main idea

This introduces a new type of integration called Dynamic Erasure Email integration. This integration is very similar to our existing Generic Erasure Email integration, with the key difference being that the recipient email address is not set as part of the connection config, but rather calculated at runtime using the privacy request's custom request fields. The connection config has an recipient_email_address field that is a reference to the dataset field that contains the email address.

For example, let's suppose we have the following table tenants :

tenant_id email
123 tenant@example.com
456 a-different-tenant@example.com

The idea behind this integration is that we set the recipient_email_address field to point to the email field of the tenants collection, e.g tenants_dataset.tenants.email , and we provide a tenant_id as part of the erasure request, which will then be used to look up the corresponding email address.

The third_party_vendor_name is also a dataset reference field, since the vendor name will depend on the email address.

Relevant changes

  • We added a new field to the fides_meta of the field, called custom_request_field. This depends on PROD-2486 Add custom_request_field to FidesMeta fideslang#13
  • We relaxed the graph traversability requirement so that unreachable node that have at least one field with the custom_request_field attribute are ignored by the reachability check
  • Created the new connector type

Next Steps

Ideally we want to fully support custom request fields as part of the DSR graph. Once we implement that, a lot of the hacky workarounds of this PR won't be necessary. The idea here was to release this within 1 sprint and have it targeted to a customer request, but we do plan to fully support this feature in the future.

Code Changes

  • Add a new connection type dynamic_email_erasure and create a new DynamicErasureEmailConnector
  • Move some logic from GenericErasureEmailConnector into a new base class BaseErasureEmailConnector, that has logic common to both GenericErasureEmailConnector and DynamicErasureEmailConnector. They both inherit from this class.
  • Make some changes in the graph building logic to ignore unreachable nodes if they have custom request fields.
  • Update SQLConnector and SnowflakeConnector with a new execute_standalone_retrieval_query that allows us to execute a query without caring about the incoming/outgoing edges of the node. This is used to execute the query to retrieve the email address using the provided custom fields

Related Docs PR: https://github.com/ethyca/fidesdocs/pull/413/files

Steps to Confirm

Environment setup

  1. Make sure the following is set in the sample_project fides.toml file in the [execution] section:
allow_custom_privacy_request_field_collection = true
allow_custom_privacy_request_fields_in_request_execution = true
  1. Edit the initiate_scheduled_batch_email_send function in src/fides/api/service/privacy_request/email_batch_service.py so the job runs more than once a week, e.g every 30s:
scheduler.add_job(
        func=send_email_batch,
        kwargs={},
        id=BATCH_EMAIL_SEND,
        coalesce=False,
        replace_existing=True,
        trigger="cron",
        second="30",
    )
  1. Run the sample project, nox -s "fides_env(test)"
  2. Open the postgres_example database in the SQL client of your choice (or open a psql shell in the container) and edit the second row in the dynamic_email_address_config table so that the email_address column contains your email address. This is so you'll actually receive the emails.
  3. Open the Swagger API (or your API Postman collection if you prefer), and call the PUT /api/v1/messaging/default endpoint with the following payload:
{
  "service_type": "mailgun",
  "details": {
    "is_eu_domain": false,
    "api_version": "v3",
    "domain": <the-domain>
  }
}

You can get the domain from our 1password mailgun test credentials.
6. From the Swagger API (or Postman) , call the PUT /api/v1/messaging/default/{service_type}/secret endpoint, with service_type set to mailgun. You can get the api key from the same 1password credentials as before.

Testing the integration

  1. Create an erasure privacy request through the Swagger API (or Postman) using the POST /api/v1/privacy-request endpoint with the following payload:
[
  {
    "identity": {
      "email": "jane@example.com"
    },
    "custom_privacy_request_fields": {
      "tenant_id": {
        "label": "Tenant Id",
        "value": "site-id-2"
      }
    },
    "policy_key": "default_erasure_policy"
  }
]
  1. Go to the Admin UI and approve the privacy request. In about 1-2 min, you should receive the erasure request email from mailgun in your inbox
    2.1 The setup for this to work is done in the sample project environment, but you can go to the Systems list in the Admin UI and see for yourself. There's a system for the new Dynamic Erasure Email Integration, which has the dataset reference that points to the Postgres Integration for the dynamic_email_address_config collection.

Testing error cases

Creating the integration from scratch

  1. Create a new system and go to its Integrations tab. Click Dynamic Erasure Email.
  2. When filling out the integration config, try to write invalid dataset references. Some cases to test, and their expected outcomes:
    2.1 Reference is not dot-delimited, e.g asingleword => frontend form shows an error specifying correct format
    2.2 Reference is dot-delimited, but only has 2 parts, e.g collection.field => frontend form shows an error
    2.3 Reference field references a fake dataset, e.g fake_dataset.collection.field => backend returns error which is displayed on the frontend
    2.4 recipient_email_address and third_party_vendor_name reference different datasets => backend returns error which is displayed on the frontend
    2.5 recipient_email_address and third_party_vendor_name reference different collections from the same dataset => backend returns error which is displayed on the frontend

Running a misconfigured integration

  1. To simulate a misconfigured integration, open the datasetconfig table and edit the fides_key of the postgres_example_custom_request_field_dataset row, e.g change it to test.
  2. Create the privacy request as before, and approve it for the admin.
  3. Wait for the task to run, and you should see the privacy request with an error status, as well as an ExecutionLog in the Activity Timeline section that shows the error detail

Privacy request with no email lookup results
Creating and approving a privacy request with the following payload

[
  {
    "identity": {
      "email": "jane@example.com"
    },
    "custom_privacy_request_fields": {
      "tenant_id": {
        "label": "Tenant Id",
        "value": "site-id-5"
      }
    },
    "policy_key": "default_erasure_policy"
  }
]

should cause the privacy request to error, and an ExecutionLog with the error details added.

Privacy request with multiple results for email lookup
Creating and approving a privacy request with the following payload

[
  {
    "identity": {
      "email": "jane@example.com"
    },
    "custom_privacy_request_fields": {
      "tenant_id": {
        "label": "Tenant Id",
        "value": "site-id-multiple-emails"
      }
    },
    "policy_key": "default_erasure_policy"
  }
]

should cause the privacy request to error, and an ExecutionLog with the error details added.

Pre-Merge Checklist

  • All CI Pipelines Succeeded
  • Documentation:
    • documentation complete, PR opened in fidesdocs
    • documentation issue created in fidesdocs
    • if there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
  • Issue Requirements are Met
  • Relevant Follow-Up Issues Created
  • Update CHANGELOG.md
  • For API changes, the Postman collection has been updated
  • If there are any database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!

Copy link

vercel bot commented Aug 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Sep 10, 2024 3:50pm

Copy link

cypress bot commented Aug 22, 2024

fides    Run #9864

Run Properties:  status check passed Passed #9864  •  git commit 25ae3fee73 ℹ️: Merge d5db540557e2bd39abb76c686e1b56aaf7024df8 into 77dabdd4202b35247be5cb3d6afd...
Project fides
Branch Review refs/pull/5226/merge
Run status status check passed Passed #9864
Run duration 00m 37s
Commit git commit 25ae3fee73 ℹ️: Merge d5db540557e2bd39abb76c686e1b56aaf7024df8 into 77dabdd4202b35247be5cb3d6afd...
Committer erosselli
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
View all changes introduced in this branch ↗︎

.fides/fides.toml Outdated Show resolved Hide resolved
@@ -627,6 +658,48 @@ class QueryStringWithoutTuplesOverrideQueryConfig(SQLQueryConfig):
Generates SQL valid for connectors that require the query string to be built without tuples.
"""

# Overrides SQLQueryConfig.generate_raw_query
def generate_raw_query(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: reuse code from generate_query_without_tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there still some code here that could be consolidated?

src/fides/api/service/connectors/sql_connector.py Outdated Show resolved Hide resolved
src/fides/api/task/create_request_tasks.py Show resolved Hide resolved
src/fides/data/sample_project/fides.toml Outdated Show resolved Hide resolved
@erosselli erosselli requested a review from pattisdr August 22, 2024 20:47
Copy link
Contributor

@pattisdr pattisdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did a nice job tracking down all the individual pieces to get something working end-to-end @erosselli - and your self-review of this PR is great. But big picture, I have several concerns with how we're slotting in this new connector type and would consider refactoring so this doesn't require so much special handling. This might have to be a follow-up due to time constraints. Generally I think the dynamic erasure email connector is doing too much.

My primary issues:

  • Separation of concerns: Dynamic Email Connector is doing all the heavy lifting. this includes querying the Postgres Custom Field Collection, when that is typically taken care of as part of the Postgres Custom Field Collection upstream node. Each node is typically responsible for querying itself.
  • Lots of special casing given that Postgres Custom Field Collection is not reachable via custom field
  • Dynamic Email Connector is being treated as a bulk email connector. Bulk emails are sent once weekly, combining the last week's users that need their info deleted into a single email.  However, with dynamic emails, I don't think we get a lot of benefit out of postponing these to fire once weekly. Have we considered firing these as part of the erasure graph itself?

Here's how I might approach this connector instead:

  • Make collections like the Postgres Custom Field Collection reachable by custom field, not just identity data.  That way the Postgres Custom Field Collection could be a proper part of the graph, with upstream nodes (the root collection). I definitely get that this is easier said than done, but avoids a lot of the special casing.
  • Also add the Dynamic Email Collection to the graph. We typically need a DatasetConfig/Dataset for this.  One way to achieve this would be when setting up the integration, a dataset is automatically created with just a single recipient_email_address field that adds fides_meta.references to point to the postgres collection with the custom field. - Then our graph would look like: Root -> Postgres Custom Field Collection -> Dynamic Email Collection -> Terminator. 
  • Unlike the other email connectors, the dynamic email connector could look more similar to a saas connector/database connector and have access/erasure methods.  The erasure method itself would use the access data already retrieved by upstream Postgres Custom Field Collection to fire off a single erasure email instead of being one that is run at the end in the "bulk email" section. I know I'm hand waving over some of the complexity here, since there is no "access" step here -

requirements.txt Outdated Show resolved Hide resolved
Comment on lines 13 to 14
revision = '9de4bb76307a'
down_revision = 'ffee79245c9a'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a note to make sure to double check your down_revision before you merge in case someone else has added a new migration in the interim - in that case you'd adjust your down_revision to their new migration

src/fides/api/task/create_request_tasks.py Show resolved Hide resolved
@pattisdr
Copy link
Contributor

@erosselli thanks for discussing next steps - as part of this increment, can you add code comments to the special-cased locations, where it is temporary until we get custom fields treated in more of a first class way?

Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 76.74419% with 60 lines in your changes missing coverage. Please review.

Project coverage is 86.29%. Comparing base (6a882bf) to head (d119ff6).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
src/fides/api/service/connectors/query_config.py 34.78% 28 Missing and 2 partials ⚠️
...vice/connectors/dynamic_erasure_email_connector.py 76.76% 15 Missing and 8 partials ⚠️
src/fides/api/service/connectors/sql_connector.py 61.53% 3 Missing and 2 partials ⚠️
src/fides/api/service/connectors/base_connector.py 50.00% 1 Missing ⚠️
...service/connectors/base_erasure_email_connector.py 97.82% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5226      +/-   ##
==========================================
- Coverage   86.41%   86.29%   -0.12%     
==========================================
  Files         362      365       +3     
  Lines       22792    22992     +200     
  Branches     3060     3090      +30     
==========================================
+ Hits        19695    19842     +147     
- Misses       2538     2583      +45     
- Partials      559      567       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 160 to 166
if not dataset_config:
logger.error(
"DatasetConfig with key '{}' not found. Skipping erasure email send for connector: '{}'.",
dataset_key,
self.configuration.key,
)
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pattisdr @galvana In these cases where the connector is misconfigured, do we want to mark all the privacy requests as error or just skip them? I think we shouldn't really hit this case since we validate the config when creating the integration, but wanted to check what the expected behavior would be here. Right now I'm doing nothing which is probably not the right solution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would raise an exception here just in case so the customer sees the failed privacy requests and we can work on addressing the connector instead of having this issue go undetected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think raising the exception is enough for the error to be visible , I think I also need to add an ExecutionLog for each privacy request. I'll do that as well to ensure the privacy request shows as failed. Also if I raise the exception it seems like the task keeps retrying which is probably not what we want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right we're outside of that typical flow

]


class BaseErasureEmailConnector(BaseEmailConnector):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the logic here was directly taken from the GenericEmailErasureConnector, just made the typing a bit more flexible so we can inherit from it in both cases

@erosselli erosselli marked this pull request as ready for review August 27, 2024 21:15
@erosselli erosselli force-pushed the PROD-2486 branch 2 times, most recently from c866b4a to 708b055 Compare August 29, 2024 15:26
Comment on lines 40 to 57
class ProcessedConfig(NamedTuple):
graph_dataset: GraphDataset
connector: BaseConnector
collection_address: CollectionAddress
collection_data: Any
email_field: str
vendor_field: str
dsr_field_to_collection_field: Dict[str, str]


class BatchedIdentitiesData(NamedTuple):
email_address: str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are just some helper types I added to make the code easier to read

Copy link
Contributor

@pattisdr pattisdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thorough error handling here @erosselli! good job working through the tradeoffs of this very nonstandard connector. My comments are fairly minor. first thing, I'd get your migration updated with main so tests can run on this branch

@erosselli erosselli changed the title [WIP] PROD-2486 Add Dynamic Erasure Email Integration PROD-2486 Add Dynamic Erasure Email Integration Sep 4, 2024
Copy link
Contributor

@pattisdr pattisdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @erosselli, thanks for all the extra docstrings and clarification with this nonstandard connector added on this iteration- for a PR this size, I'd do one more skim through to make sure there's no code you're accidentally committing, check your downrev one last time, etc.

@erosselli erosselli added the run unsafe ci checks Runs fides-related CI checks that require sensitive credentials label Sep 10, 2024
@erosselli erosselli merged commit 2e03b9b into main Sep 10, 2024
34 of 40 checks passed
@erosselli erosselli deleted the PROD-2486 branch September 10, 2024 17:02
@erosselli erosselli mentioned this pull request Sep 10, 2024
11 tasks
Copy link

cypress bot commented Sep 10, 2024

fides    Run #9865

Run Properties:  status check passed Passed #9865  •  git commit 2e03b9b9c3: PROD-2486 Add Dynamic Erasure Email Integration (#5226)
Project fides
Branch Review main
Run status status check passed Passed #9865
Run duration 00m 39s
Commit git commit 2e03b9b9c3: PROD-2486 Add Dynamic Erasure Email Integration (#5226)
Committer erosselli
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 4
View all changes introduced in this branch ↗︎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run unsafe ci checks Runs fides-related CI checks that require sensitive credentials
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants