Add a utility to drop unknown references (and enforce referential integrity) #1792
Labels
data:multi-table
Related to multi-table, relational datasets
feature request
Request for a new feature
Milestone
Problem Description
For multi-table datasets, SDV currently expects that all the foreign key values must be present in the primary key (aka referential integrity). For various reasons* I may currently be in possession of a datasets that does not have referential integrity. This prevents SDV from being able to model my data and instead gives me an error.
*Reasons may include some messiness in the data source, or having random (incomplete) data from various data sources
Expected behavior
Add a utility function called
utils.drop_unknown_references
:Parameters:
metadata
: A MultiTableMetadata objectdata
: A dictionary that maps each table name (string) to the data for that table (pandas.DataFrame)drop_missing_values
: A boolean describing whether or not to also drop foreign keys with missing valuesTrue
: Drop a row if a foreign key has missing valuesFalse
: Allow rows to contain missing values as foreign keysOutput: A dictionary that maps each of the original table names (string) to cleaned data for that table (pandas.DataFrame). The cleaned data should have referential integrity.
Note that if a table has multiple foreign keys, then the script should only keep rows where all foreign keys have references. If any one foreign key has an unknown reference, the entire row should be dropped.
Update the error message: If the passed-in data does not have referential integrity, update the error message to point the user towards the
drop_unknown_references
method. This error check happens inmetadata.validate_data
.Additional context
The text was updated successfully, but these errors were encountered: