Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add anonymized query plan in json format to QueryCompletedEvent #12968

Merged
merged 7 commits into from
Aug 26, 2022
Merged

Add anonymized query plan in json format to QueryCompletedEvent #12968

merged 7 commits into from
Aug 26, 2022

Conversation

gaurav8297
Copy link
Member

@gaurav8297 gaurav8297 commented Jun 24, 2022

Description

Add anonymized query plan in json format to QueryCompletedEvent
Implement EXPLAIN (TYPE DISTRIBUTED, FORMAT JSON)

Is this change a fix, improvement, new feature, refactoring, or other?

New feature

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Event listener SPI, EXPLAIN

How would you describe this change to a non-technical end user or system administrator?

Provides anonymised query plan in json format to event listener to enable offline analysis without leaking sensitive info.
Implements EXPLAIN (TYPE DISTRIBUTED, FORMAT JSON)

Documentation

( ) No documentation is needed.
(x) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# General
* Implement EXPLAIN (TYPE DISTRIBUTED, FORMAT JSON). ({issue}`12968`)
* Add anonymized query plan in json format to QueryCompletedEvent. ({issue}`12968`)

@cla-bot cla-bot bot added the cla-signed label Jun 24, 2022
@github-actions github-actions bot added docs ui Web UI labels Jun 24, 2022
@gaurav8297 gaurav8297 requested a review from raunaqmorarka June 27, 2022 08:24
@gaurav8297 gaurav8297 removed docs ui Web UI labels Jun 27, 2022
@gaurav8297 gaurav8297 changed the title [WIP] Anonymize Plan Expose anonymized json plan in QueryCompletedEvent Jun 29, 2022
@gaurav8297 gaurav8297 marked this pull request as ready for review June 29, 2022 09:25
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add examples of anonymised plans to the PR

@raunaqmorarka
Copy link
Member

Please add examples of anonymised plans to the PR

ping

@raunaqmorarka
Copy link
Member

Test failures are related, PTAL

@gaurav8297
Copy link
Member Author

Please add examples of anonymised plans to the PR

Query: SELECT quantity FROM lineitem LIMIT 10

{
  "0" : {
    "id" : "6",
    "name" : "Output",
    "descriptor" : {
      "columnNames" : "[column_717400177]"
    },
    "outputs" : [ {
      "symbol" : "symbol_717400177",
      "type" : "double"
    } ],
    "details" : [ ],
    "estimates" : [ ],
    "children" : [ {
      "id" : "98",
      "name" : "Limit",
      "descriptor" : {
        "count" : "10",
        "withTies" : "",
        "inputPreSortedBy" : ""
      },
      "outputs" : [ {
        "symbol" : "symbol_717400177",
        "type" : "double"
      } ],
      "details" : [ ],
      "estimates" : [ ],
      "children" : [ {
        "id" : "171",
        "name" : "LocalExchange",
        "descriptor" : {
          "connectorHandleType" : "SystemPartitioningHandle",
          "partitioning" : "SINGLE",
          "function" : "SINGLE",
          "isReplicateNullsAndAny" : "",
          "hashColumn" : "",
          "arguments" : ""
        },
        "outputs" : [ {
          "symbol" : "symbol_717400177",
          "type" : "double"
        } ],
        "details" : [ ],
        "estimates" : [ ],
        "children" : [ {
          "id" : "138",
          "name" : "RemoteSource",
          "descriptor" : {
            "sourceFragmentIds" : "[1]"
          },
          "outputs" : [ {
            "symbol" : "symbol_717400177",
            "type" : "double"
          } ],
          "details" : [ ],
          "estimates" : [ ],
          "children" : [ ]
        } ]
      } ]
    } ]
  },
  "1" : {
    "id" : "137",
    "name" : "LimitPartial",
    "descriptor" : {
      "count" : "10",
      "withTies" : "",
      "inputPreSortedBy" : ""
    },
    "outputs" : [ {
      "symbol" : "symbol_717400177",
      "type" : "double"
    } ],
    "details" : [ ],
    "estimates" : [ ],
    "children" : [ {
      "id" : "0",
      "name" : "TableScan",
      "descriptor" : {
        "connector" : "tpch",
        "table" : "catalog_225088199.schema_225081830.table_3191690477"
      },
      "outputs" : [ {
        "symbol" : "symbol_717400177",
        "type" : "double"
      } ],
      "details" : [ "symbol_717400177 := column_2086283230" ],
      "estimates" : [ ],
      "children" : [ ]
    } ]
  }
}

@github-actions github-actions bot added the docs label Jul 5, 2022
@raunaqmorarka raunaqmorarka changed the title Expose anonymized json plan in QueryCompletedEvent Add anonymized query plan in json format to QueryCompletedEvent Jul 6, 2022
Optional<String> connectorName = metadata.listCatalogs(session).stream()
.filter(catalogInfo -> catalogInfo.getCatalogName().equals(tableSchema.getCatalogName()))
.map(CatalogInfo::getConnectorName)
.findFirst();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always exists, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes, but I don't know if we should assume that. In future if catalogs can be removed on the fly, I'm not sure if that assumption would continue to hold.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future if catalogs can be removed on the fly, I'm not sure if that assumption would continue to hold.

I don't think they can be removed for active query

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always exists, right?

I tried making it not optional, but there were many tests which started failing. IIRC these errors were coming in the case of an information schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Could you tell why for information schema connector name is missing?

import static io.trino.sql.planner.plan.StatisticsWriterNode.WriteStatisticsTarget;
import static io.trino.sql.planner.plan.TableWriterNode.WriterTarget;

public interface Anonymizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an interface? There should only be one implementation of the anonymizer, so an interface is overkill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One implementation performs anonymisation, the other doesn't. Could you take a look at its usage in PlanPrinter and suggest a different way if this can be improved ?

@@ -56,6 +57,7 @@ public QueryMetadata(
List<RoutineInfo> routines,
URI uri,
Optional<String> plan,
Optional<String> anonymizedJsonPlan,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for including both the regular plan and the anonymized plan? Most event listeners will just forward and store the event as is, which defeats the purpose of having an anonymized plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to have different use cases for regular plan and anonymised plan. E.g. Regular plan can be used to display query plan for past queries to a user in a UI. This would be user specific data that might not be made easily accessible to everyone, it's also formatted in a way that is not ideal for offline analysis.
Anonymized plan could be sent to a different downstream system which is more widely accessible and easier to work with for offline analysis.
Currently we send same event to all event listeners, so we also can't use different event listeners to get different versions of plan.

@@ -48,6 +48,7 @@
import static io.trino.sql.analyzer.QueryType.EXPLAIN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should it be separate PR?

@gaurav8297
Copy link
Member Author

@raunaqmorarka @sopel39 PTAL

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
will wait to see if @martint or @sopel39 have further comments

Optional<String> connectorName = metadata.listCatalogs(session).stream()
.filter(catalogInfo -> catalogInfo.getCatalogName().equals(tableSchema.getCatalogName()))
.map(CatalogInfo::getConnectorName)
.findFirst();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Could you tell why for information schema connector name is missing?

@gaurav8297
Copy link
Member Author

Could you tell why for information schema connector name is missing?

For some reason, We don't register the system or information schema's catalog name to CatalogManager, which is used during listCatalogs. That's why internal catalog names are missing from listCatalogs results.

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/connector/ConnectorManager.java#L255

PS: Github is not letting me add a comment in the open thread 😞

cc @sopel39

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

if (replicateNullsAndAny) {
builder.append(format("Output partitioning: %s (replicate nulls and any) [%s]%s\n",
partitioningScheme.getPartitioning().getHandle(),
Joiner.on(", ").join(arguments),
formatHash(partitioningScheme.getHashColumn())));
Copy link
Member

@sopel39 sopel39 Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keep using formatHash method here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've made formatHash to be a non-static method

}

public static String formatAggregation(Aggregation aggregation)
public static String formatAggregation(Anonymizer anonymizer, Aggregation aggregation)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider making it non static so that you don't have to pass anonymizer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be redundant to make formatAggregation static because it is being used in GraphVizPlanPrinter too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways in PlanPrinter, formatAggregation is only used at 2-3 places.

@sopel39
Copy link
Member

sopel39 commented Jul 15, 2022

Is failure related?

@gaurav8297
Copy link
Member Author

@martint @raunaqmorarka @sopel39 Implemented counter-based anonymization

@sopel39 sopel39 requested a review from martint August 10, 2022 19:06
};
}

private static class TestLiteral
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this?

The AST is meant to be a closed hierarchy and not to be extended by third-parties. There's a lot of infrastructure that depends on knowing exactly what classes are part of the AST. Eventually, we'll update it to use Java 17's sealed types and this won't be possible to do at all and enforced at compile time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to test the scenario that a new sub-class of Literal is added but CounterBasedAnonymizer#anonymizeExpression isn't updated to handle that new class. In this scenario, we should still anonymise the literal.
It's not strictly necessary though, let me know if you want this removed or handled differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let’s remove it. It’s not a correct usage of the AST classes and will break in the future.

As long as the anonymizer fails if a new class is added (very unlikely), we’ll be able to catch that very quickly and add the relevant code, so we don’t even have to handle anonymization for the general case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this now and modified CounterBasedAnonymizer#anonymizeExpression to throw UnsupportedOperationException in case of un-handled Literal sub-class.

@raunaqmorarka raunaqmorarka requested a review from martint August 17, 2022 11:55
return anonymizeExpression(expression).toString();
}

private Expression anonymizeExpression(Expression expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have concerns about this, which I mentioned before in a related context. Anonymized expressions are not valid SQL, so we should not be trying to construct an AST out of them (effectively what this method does)

Copy link
Member

@raunaqmorarka raunaqmorarka Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API of Anonymizer is String anonymize(Expression expression), so we don't really want an Anonymized expression, an anonymised string representation of the Expression will do.
This method is creating an Expression because it was convenient to use an ExpressionRewriter to anonymise literals and then use Expression#toString on the result.
I think an alternative could be that we write something similar to ExpressionFormatter#Formatter that delegates to existing formatter for all methods except the ones we want to anonymise (visitXXXLiteral).
Would that be better or is there another way that you would recommend instead ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better. Alternatively, we should consider and explore making the expression formatter itself anonymizer-aware.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the expression formatter such that we could use that directly instead of creating anonymized AST. PTAL @martint

@gaurav8297 gaurav8297 requested a review from martint August 23, 2022 20:26
This will be used in PlanPrinter to print
connector name as part of table scan node
in case of anonymization.
@martint martint dismissed their stale review August 24, 2022 23:43

The general approach looks good now. I'll leave it up to @sopel39 to do the final review

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

ImmutableList.of("symbol_1 := column_2"),
ImmutableList.of(),
ImmutableList.of()))));
assertThat(event.getMetadata().getJsonPlan())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should anonymize both jsonPlan and plan. Could you create an issue + add TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are anonymizing both plan and jsonPlan. It's just there's no test for plan in TestEventListenerBasic. In general, there's no good testing of the text plan.

@gaurav8297
Copy link
Member Author

@sopel39 I've addressed comments

@sopel39 sopel39 merged commit d0563dc into trinodb:master Aug 26, 2022
@sopel39 sopel39 mentioned this pull request Aug 26, 2022
@@ -26,4 +26,12 @@ default void queryCompleted(QueryCompletedEvent queryCompletedEvent)
default void splitCompleted(SplitCompletedEvent splitCompletedEvent)
{
}

/**
* Specify whether the plan included in QueryCompletedEvent should be anonymized or not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should mention that both plan and jsonPlan are anonymized

@github-actions github-actions bot added this to the 394 milestone Aug 26, 2022
@gaurav8297 gaurav8297 deleted the gaurav8297/anonymized_plan_new branch August 26, 2022 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants