Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Presto Type Parser based on Flex and Bison #7568

Closed
wants to merge 5 commits into from

Conversation

majetideepak
Copy link
Collaborator

@majetideepak majetideepak commented Nov 14, 2023

A Presto Type Parser is needed to support PrestoQueryRunner.
This will eventually replace the Antlr implementation in Prestissimo and remove the dependency on Antlr.

Copy link

netlify bot commented Nov 14, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 26ecd6b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65595d677baaa5000896ac68

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 14, 2023
@majetideepak majetideepak force-pushed the typeparser branch 2 times, most recently from 78c2e6e to 6f11b6b Compare November 15, 2023 05:27
Copy link
Collaborator

@duanmeng duanmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM, Nice!

@@ -16,6 +16,7 @@ if(${VELOX_BUILD_TESTING})
endif()

add_subdirectory(tz)
add_subdirectory(type_parser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doesn't only "parser" here already implies this is related to types? velox/type/parser vs /velox/type/type_parser`?

add_subdirectory(tests)
endif()

bison_target(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we now have bison and flex as compile time dependencies? Does our CMake automatically install them if they are not found?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the dependency management was done when we added the type calculation for Decimal Types
https://github.com/facebookincubator/velox/blob/main/velox/expression/type_calculation/CMakeLists.txt

std::istream& arg_yyin,
std::ostream& arg_yyout,
TypePtr& outputType,
const std::string& input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if this should operate over std::string_view to avoid all the copies below


private:
TypePtr& outputType_;
std::string input_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

#include "velox/type/Type.h"

namespace facebook::velox {
TypePtr parseType(const std::string& typeText);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some documentation on this method? This file is problem where developers will look for more information about this API

#include <string>
#include "velox/type/Type.h"

namespace facebook::velox {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: blank line after namespace definition


%option c++ noyywrap noyylineno nodefault caseless

A [A|a]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? To upper case characters? Maybe add a comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to support handling the input in a case-insensitive manner.
This also enables use to define custom types such as TYPES_WITH_SPACES.
I followed the Antlr grammar for the most part to reduce risk.
https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/types/TypeSignature.g4#L86

NUMBER ([[:digit:]]+)
ROW (ROW|STRUCT)
VARIABLE (VARCHAR|VARBINARY)
TYPE_WITH_SPACES ((DOUBLE[ ]PRECISION)|(TIME[ ]WITH[ ]TIME[ ]ZONE)|(TIMESTAMP[ ]WITH[ ]TIME[ ]ZONE)|(INTERVAL[ ]YEAR[ ]TO[ ]MONTH)|(INTERVAL[ ]DAY[ ]TO[ ]SECOND))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tricky part here is that these types can be extended by users (some of these are Presto specific, for example). How can we make the parser extensible, based on the types that are registered?

also, for example, if "timestamp with timezone" was not registered, we probably want to fail the parsing.

Copy link
Collaborator Author

@majetideepak majetideepak Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation follows the antlr semantics. https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/types/TypeParser.cpp#L58
The Presto types are manually handled and not based on the registration.
All other types are handled based on the registration. See TypePtr typeFromString(const std::string& type) inside TypeParser.yy

New types as long as they are scalar will be handled automatically. Parameterized types and types with spaces have to be handled manually since they have custom syntax.

endif()

bison_target(
TypeParserParser TypeParser.yy ${CMAKE_CURRENT_BINARY_DIR}/TypeParser.yy.cc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we call the target only "TypeParser"?

std::istream& arg_yyin,
std::ostream& arg_yyout,
TypePtr& outputType,
const std::string& input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to take a string_view here as well or you will end up copying to a temporary and acquiring a string_view to the temporary.

/// otherwise.
/// Uses the Type::hasType and Type::getType APIs to convert a string to Velox
/// type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move blank line to between the function and namespace end

(DECIMAL) return Parser::token::DECIMAL;
{ROW} return Parser::token::ROW;
{VARIABLE} yylval->build<std::string>(YYText()); return Parser::token::VARIABLE;
{NUMBER} yylval->build<long long>(strtoll(YYText(), nullptr, 10)); return Parser::token::NUMBER;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use folly::to here?

using namespace facebook::velox;
TypePtr typeFromString(const std::string& type) {
auto upper = type;
std::transform(upper.begin(), upper.end(), upper.begin(), ::toupper);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think folly has an in-place upper case method as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find one. Velox uses boost in other places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually only support lower case:

https://github.com/facebook/folly/blob/main/folly/String.h#L742

never mind

TypePtr typeFromString(const std::string& type) {
auto upper = type;
std::transform(upper.begin(), upper.end(), upper.begin(), ::toupper);
if (upper == "UNKNOWN") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this already covered in getType()? Can't we just rely on getType() since that already honors types that are and are not registered?

Y [Y|y]
Z [Z|z]

WORD ([[:alpha:][:alnum:]_]*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like the WORD, since it's used for the type names, will need to support whitespaces as well - even though this will make it ambiguous in some cases, but I remember there was a way to specify how these should be handled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUOTED_ID will support names with spaces. There is a test as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my point is that since WORD cannot have spaces, things like TIMESTAMP WITH TIMEZONE will not be parsed as a single WORD, and thus will need special flex/grammar support to understand that these three WORDs together form a type name. And I guess we can't update the grammar as more types are registered/unregistered. For example, if I register a new type named "PEDROs DATA TYPE", the parser won't be able to understand it.

This is ok for now, but we might need to think through it eventually.

Copy link
Collaborator Author

@majetideepak majetideepak Nov 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even Postgres has tokens for types with spaces. https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y#L14336
The issue with your proposal is how do we decide if TIMESTAMP WITH TIME ZONE VELOX is a new type vs. an error parsing TIMESTAMP WITH TIME ZONE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other ambiguity is if ROW(PEDROs DATA TYPE) is Row with field name PEDROs with type name DATA TYPE or a type PEDROs DATA TYPE without a field name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Presto Java does is to first check if there's a type that matches the full name ("PEDROs DATA TYPE"), and if not, assume the first token is the name and proceed to try to match the remaining as the type ("DATA TYPE" as the type name and "PEDROs" the field name). Could we mimic that logic here somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes we can. We can give precedence between two rules and achieve this. I will implement this in a follow up PR.

ASSERT_EQ(*parseType("row(col iNt)"), *ROW({"col"}, {INTEGER()}));
}

TEST_F(TestTypeSignature, typesWithSpaces) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what is not covered today is if you add a new custom type which has blankspaces in the name.

Copy link
Collaborator Author

@majetideepak majetideepak Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the implementation to remove the manual Prestosql types inference.
JSON and Timestamp with time zone are now registered in the test. Hyperloglog is not registered in the test.
These take the custom type path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. Ideally we would like to add a custom test which is not explicitly defined in the flex file (so it's extensible).


class TestTypeSignature : public ::testing::Test {};

TEST_F(TestTypeSignature, booleanType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the extensive test suite :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied most from the Prestissimo parser test suite to be fair.

@majetideepak majetideepak changed the title Add TypeParser based on Flex and Bison Add Presto TypeParser based on Flex and Bison Nov 16, 2023
@majetideepak majetideepak changed the title Add Presto TypeParser based on Flex and Bison Add Presto Type Parser based on Flex and Bison Nov 16, 2023
@duanmeng duanmeng mentioned this pull request Nov 17, 2023

#include "velox/common/base/tests/GTestUtils.h"
#include "velox/functions/prestosql/types/HyperLogLogType.h"
#include "velox/functions/prestosql/types/JsonType.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right to have a dependency on prestosql here. CC: @pedroerp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova this is a test file. Can you share why it would be not right to depend on prestosql?

Copy link
Collaborator Author

@majetideepak majetideepak Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that even a dependency across modules for tests is not ideal. If we decide to keep the implementation here, I can define these types here again and remove the dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Seems like what we really want to test is whether the parser understands user defined types. Could we rather define a super tiny and simple type in this file, register, then test that we can parse it? Ideally, one without and one with blank spaces in the name

Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. Let's get this one merged, but discuss offline is there's a way to make it more general to user defined type names in the future.

Thanks!

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in d7e8348.

Copy link

Conbench analyzed the 1 benchmark run on commit d7e83486.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@majetideepak majetideepak deleted the typeparser branch January 3, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants