-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Presto Type Parser based on Flex and Bison #7568
Conversation
✅ Deploy Preview for meta-velox canceled.
|
78c2e6e
to
6f11b6b
Compare
6f11b6b
to
260f66a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM, Nice!
velox/type/CMakeLists.txt
Outdated
@@ -16,6 +16,7 @@ if(${VELOX_BUILD_TESTING}) | |||
endif() | |||
|
|||
add_subdirectory(tz) | |||
add_subdirectory(type_parser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: doesn't only "parser" here already implies this is related to types? velox/type/parser
vs /velox/type/type_parser`?
add_subdirectory(tests) | ||
endif() | ||
|
||
bison_target( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we now have bison and flex as compile time dependencies? Does our CMake automatically install them if they are not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the dependency management was done when we added the type calculation for Decimal Types
https://github.com/facebookincubator/velox/blob/main/velox/expression/type_calculation/CMakeLists.txt
velox/type/type_parser/Scanner.h
Outdated
std::istream& arg_yyin, | ||
std::ostream& arg_yyout, | ||
TypePtr& outputType, | ||
const std::string& input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if this should operate over std::string_view to avoid all the copies below
velox/type/type_parser/Scanner.h
Outdated
|
||
private: | ||
TypePtr& outputType_; | ||
std::string input_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
velox/type/type_parser/TypeParser.h
Outdated
#include "velox/type/Type.h" | ||
|
||
namespace facebook::velox { | ||
TypePtr parseType(const std::string& typeText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some documentation on this method? This file is problem where developers will look for more information about this API
velox/type/type_parser/TypeParser.h
Outdated
#include <string> | ||
#include "velox/type/Type.h" | ||
|
||
namespace facebook::velox { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: blank line after namespace definition
velox/type/type_parser/TypeParser.ll
Outdated
|
||
%option c++ noyywrap noyylineno nodefault caseless | ||
|
||
A [A|a] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this? To upper case characters? Maybe add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this to support handling the input in a case-insensitive manner.
This also enables use to define custom types such as TYPES_WITH_SPACES
.
I followed the Antlr grammar for the most part to reduce risk.
https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/types/TypeSignature.g4#L86
velox/type/type_parser/TypeParser.ll
Outdated
NUMBER ([[:digit:]]+) | ||
ROW (ROW|STRUCT) | ||
VARIABLE (VARCHAR|VARBINARY) | ||
TYPE_WITH_SPACES ((DOUBLE[ ]PRECISION)|(TIME[ ]WITH[ ]TIME[ ]ZONE)|(TIMESTAMP[ ]WITH[ ]TIME[ ]ZONE)|(INTERVAL[ ]YEAR[ ]TO[ ]MONTH)|(INTERVAL[ ]DAY[ ]TO[ ]SECOND)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tricky part here is that these types can be extended by users (some of these are Presto specific, for example). How can we make the parser extensible, based on the types that are registered?
also, for example, if "timestamp with timezone" was not registered, we probably want to fail the parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation follows the antlr semantics. https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/types/TypeParser.cpp#L58
The Presto types are manually handled and not based on the registration.
All other types are handled based on the registration. See TypePtr typeFromString(const std::string& type)
inside TypeParser.yy
New types as long as they are scalar will be handled automatically. Parameterized types and types with spaces have to be handled manually since they have custom syntax.
9768c44
to
674aab8
Compare
674aab8
to
6c348dd
Compare
velox/type/parser/CMakeLists.txt
Outdated
endif() | ||
|
||
bison_target( | ||
TypeParserParser TypeParser.yy ${CMAKE_CURRENT_BINARY_DIR}/TypeParser.yy.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we call the target only "TypeParser"?
velox/type/parser/Scanner.h
Outdated
std::istream& arg_yyin, | ||
std::ostream& arg_yyout, | ||
TypePtr& outputType, | ||
const std::string& input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to take a string_view here as well or you will end up copying to a temporary and acquiring a string_view to the temporary.
velox/type/parser/TypeParser.h
Outdated
/// otherwise. | ||
/// Uses the Type::hasType and Type::getType APIs to convert a string to Velox | ||
/// type. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move blank line to between the function and namespace end
velox/type/parser/TypeParser.ll
Outdated
(DECIMAL) return Parser::token::DECIMAL; | ||
{ROW} return Parser::token::ROW; | ||
{VARIABLE} yylval->build<std::string>(YYText()); return Parser::token::VARIABLE; | ||
{NUMBER} yylval->build<long long>(strtoll(YYText(), nullptr, 10)); return Parser::token::NUMBER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use folly::to here?
using namespace facebook::velox; | ||
TypePtr typeFromString(const std::string& type) { | ||
auto upper = type; | ||
std::transform(upper.begin(), upper.end(), upper.begin(), ::toupper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think folly has an in-place upper case method as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find one. Velox uses boost in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually only support lower case:
https://github.com/facebook/folly/blob/main/folly/String.h#L742
never mind
velox/type/parser/TypeParser.yy
Outdated
TypePtr typeFromString(const std::string& type) { | ||
auto upper = type; | ||
std::transform(upper.begin(), upper.end(), upper.begin(), ::toupper); | ||
if (upper == "UNKNOWN") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this already covered in getType()? Can't we just rely on getType() since that already honors types that are and are not registered?
Y [Y|y] | ||
Z [Z|z] | ||
|
||
WORD ([[:alpha:][:alnum:]_]*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like the WORD, since it's used for the type names, will need to support whitespaces as well - even though this will make it ambiguous in some cases, but I remember there was a way to specify how these should be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUOTED_ID will support names with spaces. There is a test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my point is that since WORD cannot have spaces, things like TIMESTAMP WITH TIMEZONE will not be parsed as a single WORD, and thus will need special flex/grammar support to understand that these three WORDs together form a type name. And I guess we can't update the grammar as more types are registered/unregistered. For example, if I register a new type named "PEDROs DATA TYPE", the parser won't be able to understand it.
This is ok for now, but we might need to think through it eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even Postgres has tokens for types with spaces. https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y#L14336
The issue with your proposal is how do we decide if TIMESTAMP WITH TIME ZONE VELOX
is a new type vs. an error parsing TIMESTAMP WITH TIME ZONE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other ambiguity is if ROW(PEDROs DATA TYPE)
is Row with field name PEDROs with type name DATA TYPE or a type PEDROs DATA TYPE
without a field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What Presto Java does is to first check if there's a type that matches the full name ("PEDROs DATA TYPE"), and if not, assume the first token is the name and proceed to try to match the remaining as the type ("DATA TYPE" as the type name and "PEDROs" the field name). Could we mimic that logic here somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes we can. We can give precedence between two rules and achieve this. I will implement this in a follow up PR.
ASSERT_EQ(*parseType("row(col iNt)"), *ROW({"col"}, {INTEGER()})); | ||
} | ||
|
||
TEST_F(TestTypeSignature, typesWithSpaces) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what is not covered today is if you add a new custom type which has blankspaces in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the implementation to remove the manual Prestosql types inference.
JSON
and Timestamp with time zone
are now registered in the test. Hyperloglog
is not registered in the test.
These take the custom type path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above. Ideally we would like to add a custom test which is not explicitly defined in the flex file (so it's extensible).
|
||
class TestTypeSignature : public ::testing::Test {}; | ||
|
||
TEST_F(TestTypeSignature, booleanType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the extensive test suite :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied most from the Prestissimo parser test suite to be fair.
e5bd144
to
c3910b4
Compare
c3910b4
to
9612498
Compare
|
||
#include "velox/common/base/tests/GTestUtils.h" | ||
#include "velox/functions/prestosql/types/HyperLogLogType.h" | ||
#include "velox/functions/prestosql/types/JsonType.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem right to have a dependency on prestosql here. CC: @pedroerp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova this is a test file. Can you share why it would be not right to depend on prestosql?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that even a dependency across modules for tests is not ideal. If we decide to keep the implementation here, I can define these types here again and remove the dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Seems like what we really want to test is whether the parser understands user defined types. Could we rather define a super tiny and simple type in this file, register, then test that we can parse it? Ideally, one without and one with blank spaces in the name
3de4ccb
to
26ecd6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes. Let's get this one merged, but discuss offline is there's a way to make it more general to user defined type names in the future.
Thanks!
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
A Presto Type Parser is needed to support PrestoQueryRunner.
This will eventually replace the Antlr implementation in Prestissimo and remove the dependency on Antlr.