Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor BigQuery Destination Integration tests #20851

Merged
merged 6 commits into from
Dec 28, 2022

Conversation

grishick
Copy link
Contributor

@grishick grishick commented Dec 23, 2022

This refactoring reduces code duplication and moves configuration logic from code into config files.
The purpose of this change is to make it easier to add more test cases for configuration variations such as
data location, accounts with various permission combinations, and account impersonation. This change was inspired by the impersonation PR. In order to run all the tests with impersonated credentials, we would have had to write a bunch more code. Instead, after this refactoring, we can just add two more credential files (1 - positive test, 2 - negative test) and about ~10 lines of code.

Part of the change is removing BigQueryGcsDestinationTest, by replacing it with a gcs-enabled config file (I added new config files to GSM before submitting this PR).

@github-actions
Copy link
Contributor

github-actions bot commented Dec 23, 2022

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

✅ Sources (0)

Connector Version Changelog Publish
  • See "Actionable Items" below for how to resolve warnings and errors.

✅ Destinations (2)

Connector Version Changelog Publish
destination-bigquery 1.2.9
destination-bigquery-denormalized 1.2.9
  • See "Actionable Items" below for how to resolve warnings and errors.

👀 Other Modules (1)

  • base-normalization

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the seed file (e.g. source_definitions.yaml), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug.

diff seed version
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version.

@grishick

This comment was marked as outdated.

@grishick

This comment was marked as outdated.

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking my understanding:

  • BigQueryDAT - runs the full DAT suite against a standard inserts config
  • BigQueryGcsDAT - runs the full DAT suite against a GCS staging config. Extends BigQueryDAT for convenience
    • Maybe we should have a root-level BigQueryDAT and then inherit BigQueryStandardInsertsDAT + BigQueryGcsDAT from it? It might reduce some code duplication between them in the setup method
  • BigQueryDestinationTest - verifies a couple different expected success/failure configs with basic behaviors (check, write) but doesn't run the full DAT process against all of them

?


left a couple small comments+questions but overall this makes sense.

final String tmpConfigAsString = Files.readString(configFile);
final JsonNode tmpConfigJson = Jsons.deserialize(tmpConfigAsString);
final JsonNode tmpCredentialsJson = tmpConfigJson.get(BigQueryConsts.BIGQUERY_BASIC_CONFIG);
Builder<Object, Object> mapBuilder = ImmutableMap.builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: ObjectNode finalConfig = (ObjectNode) Jsons.emptyObject() and build it directly


public class BigQueryDestinationTestUtils {

public static JsonNode createConfig(Path configFile, String datasetId) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand - it looks like this is reading a json blob and shuffling some stuff around. Can we just store the final json blob in GSM?

Copy link
Contributor Author

@grishick grishick Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. We should make the secrets stored in GSM have the structure that the tests expect and avoid all this extra shuffling

/**
* Remove all the GCS output from the tests.
*/
public static boolean tearDownGcs(AmazonS3 s3Client, JsonNode config, Logger LOGGER) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (nonblocking): would be cool to have this live in base-java-s3, but idk if our build.gradle are set up to support this easily


final JsonNode properties = config.get(BigQueryConsts.LOADING_METHOD);
final String gcsBucketName = properties.get(BigQueryConsts.GCS_BUCKET_NAME).asText();
final String gcs_bucket_path = properties.get(BigQueryConsts.GCS_BUCKET_PATH).asText();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that if we ever want to test the gcs_bucket_path = "" scenario, we'll want to set up a new bucket for it - which is probably correct, just want to call it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is gcs_bucket_path = "" a valid config for staging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so? Not aware of any reason why not

try {
dataset = BigQueryDestinationTestUtils.initDataSet(config, bigquery, datasetId);
} catch(Exception ex) {
//ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we fail if we can't create the dataset? Or is this to handle expected errors like permissions problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I got lost in my own refactoring and left this try-catch behind

}
}));
protected void addShutdownHook() {
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mildly prefer to have this in an @AfterEach method, but this is fine if that would be complicated (also looks like you inherited this from the existing test structure, so feel free to ignore)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I thought it was already called after each test, because it overwrites tearDown. I'll add @AfterEach annotation. The shutdown hooks were there before refactoring and AFAIK, the reason is to execute cleanup even if the test crashes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure junit takes care of running AfterEach methods, even if tests blow up (except maybe in cases where the entire runtime crashes?)

but yeah, don't worry about this if it looks super messy. Sounds like this has been working correctly as a shutdown hook anyway

throw new IllegalStateException("""
Json config not found. Must provide path to a big query credentials file,
please add file with creds to
../destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json.""");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
../destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json.""");
<...>/destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json.""");

since .. might get misinterpreted

@grishick
Copy link
Contributor Author

Looks like we broke DATs for destination-bigquery-denormalized back in November. I opened a separate PR to fix it: #20871

@grishick

This comment was marked as resolved.

@grishick

This comment was marked as resolved.

@grishick

This comment was marked as resolved.

@grishick
Copy link
Contributor Author

grishick commented Dec 27, 2022

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3790371368
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3790371368
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@grishick
Copy link
Contributor Author

grishick commented Dec 28, 2022

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/3791285212
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/3791285212
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@@ -70,7 +70,7 @@ public class BigQueryUtils {
DateTimeFormatter.ofPattern("[yyyy][yy]['-']['/']['.'][' '][MMM][MM][M]['-']['/']['.'][' '][dd][d]" +
"[[' ']['T']HH:mm[':'ss[.][SSSSSS][SSSSS][SSSS][SSS][' '][z][zzz][Z][O][x][XXX][XX][X]]]");
private static final String USER_AGENT_FORMAT = "%s (GPN: Airbyte)";
private static final String CHECK_TEST_DATASET_SUFFIX = "_airbyte_check_stage_tmp_" + System.currentTimeMillis();
Copy link
Contributor

@etsybaev etsybaev Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, if this timestamp removing may cause issues during simultaneous tests execution? Ex. same tests for different PRs, local run, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't affect anything since it's implemented back below in checkHashCreateAndDeleteDatasetRole and CHECK_TEST_DATASET_SUFFIX is only called within that method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I moved the timestamp from class instantiation time to test execution time is that it was causing name collision when tests run too fast.

}

protected void tarDownGcs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably should be tearDownGcs

grishick added 3 commits December 28, 2022 10:38
to reduce code duplication and move configuration
logic from code into config files.
This refactoring will make it easier to add more
test cases for configuration variations such as
data location, accounts with various permission combinations,
and account impersonation
Copy link
Contributor

@ryankfu ryankfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good and the usage of config files cleans up a lot of the overhead from before too. Thanks for taking on this change

if(s3Client == null) {
return;
}
if(BigQueryUtils.getLoadingMethod(config) != UploadingMethod.GCS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if check can probably be grouped together with the preceding if statement. Also just to make sure this is clear, is the reason for these empty return basically to say we shouldn't be logging any warnings? From what I can understand, it seems that if these values don't exist then it will log a warning in the try-catch block

throw new IllegalStateException("""
Json config not found. Must provide path to a big query credentials file,
please add file with creds to
<...>/destination-bigquery/secrets/credentials-with-missed-dataset-creation-role.json.""");
Copy link
Contributor

@ryankfu ryankfu Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good place to use the constants in a string format instead of having the values be re-type. Something like

throw new IllegalStateException(String.format("""
                                      Json config not found. Must provide path to a big query credentials file,
                                       please add file with creds to
                                      <...>/destination-bigquery/%s""", 
                                      CREDENTIALS_WITH_MISSED_CREATE_DATASET_ROLE_PATH);

That said, it appears this hardcoded strings already exist in tests so it's a minor nit and it's not functionally changing anything. The reason for the nit is that it explicitly gives the reader an understanding of where the file name came from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I should actually remove all of this copy pasta and just iterate over the array of paths.

@grishick
Copy link
Contributor Author

grishick commented Dec 28, 2022

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3796081975

@grishick
Copy link
Contributor Author

grishick commented Dec 28, 2022

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/3796083138
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/3796083138
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@grishick
Copy link
Contributor Author

grishick commented Dec 28, 2022

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3796307446
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/3796307446
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 14      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    14      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     166      8    95%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         189     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         61     38    38%
normalization/transform_catalog/stream_processor.py                 595    400    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1441    629    56%

Build Passed

Test summary info:

All Passed

@grishick
Copy link
Contributor Author

/approve-and-merge reason=”to unblock other PRs that modify BigQuery tests"

@octavia-approvington
Copy link
Contributor

This looks fine!
Merged!
Imagine it being fine

@octavia-approvington octavia-approvington merged commit 99335da into master Dec 28, 2022
@octavia-approvington octavia-approvington deleted the greg/refactor-bq-tests branch December 28, 2022 21:37
public static void beforeAll() throws IOException {
for(Path path : ALL_PATHS) {
if (!Files.exists(path)) {
throw new IllegalStateException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing - I believe this exception will be thrown on the first file to not exist, then exit the loop. It might be more convenient if all missing files are shown in the error message.

final List<Path> missingPaths = ALL_PATHS.stream().filter(path -> !Files.exist(path)).collect(Collectors.toList());
if (!missingPaths.isEmpty()) {
  throw new IllegalStateException(
  // ...
  )
}


tornDown = false;
addShutdownHook();
configs = new HashMap<String, JsonNode>() {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if there was motivation to remove the ImmutableMap.builder() flow here?

try {
dataset = BigQueryDestinationTestUtils.initDataSet(config, bigquery, datasetId);
} catch(Exception ex) {
//ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least log a warning here, otherwise if an exception occurs at this point it would be difficult to debug.

@@ -374,11 +322,17 @@ void testWriteFailure(final DatasetIdResetter resetDatasetId) throws Exception {
}

private Set<String> fetchNamesOfTablesInDb() throws InterruptedException {
if(dataset == null || bigquery == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two conditionals on lines 325-327 and 333-335 can be combined. The QueryJobConfiguration AFAIK is just building a query rather than executing it. Still a little rusty on my Java but I think an Optional would be good here

Optional<Dataset> potentialDataset = Optional.ofNullable(dataset);
if (potentialDataset.filter(d -> !d.exists()).orElse(false) || bigquery == null) { 
  return Collections.emptySet();
}
  

Copy link
Contributor

@jbfbell jbfbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but mostly LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants