-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Add support to Dataset for project_ids with org prefix. #8877
BigQuery: Add support to Dataset for project_ids with org prefix. #8877
Conversation
added support to Dataset for project_ids with org prefix
updated tests to check dataset chgs
@emar-kar, you should run black reformat on |
if with_prefix is None: | ||
parts = dataset_id.split(".") | ||
else: | ||
parts = with_prefix.group("ref").split(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. What is this doing? I think the prefix needs to be part of the project ID, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the issue's Stack trace
:
The error occurs due to the prefix google.com:
. Previously the passed string was separated only by .
, what led to ValueError
raising because of the len(parts) > 2
at google.com:[project].ryan_dataset
. As I see here prefix is not the part of the Project ID
itself. I was trying to find out, how could I parse the string and fulfill both the previous and new format. That is why I decided to use regular expressions. Now, with template's help, it will separate prefix and solve several situations:
string-project.string_dataset
- will pass the template successfully as it was before;prefix:string-project.string_dataset
- will group the part without prefix and then will divide it;string-project:string_dataset
- if thedefault_project
was not defined raisesValueError
;google.com:project:dataset_id
- same as above.
ValueError: Too many parts in dataset_id. Expected a fully-qualified dataset ID in standard SQL format. e.g. "project.dataset_id", got google.com:[project].ryan_dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but it appears to me that we're discarding the prefix? Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is what I was thinking before this conversation. So, now I'm a bit confused. I thought the prefix is an extra part and should be just removed. But if it is actually the part of the Project ID
, I'll need to reconfigure the pattern.
Applying requested chgs. // Removed description for 'single prefix'.
@@ -26,6 +27,14 @@ | |||
from google.cloud.bigquery.table import TableReference | |||
|
|||
|
|||
_W_PREFIX = re.compile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pick a better name for this? Maybe _PROJECT_PREFIX_PATTERN
?
@@ -26,6 +27,14 @@ | |||
from google.cloud.bigquery.table import TableReference | |||
|
|||
|
|||
_W_PREFIX = re.compile( | |||
r""" | |||
(\S*)\:(?P<ref>\S*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since at least one character is required, this should probably be \S+
, right?
Also, ref
isn't all that meaningful to me. How about remaining
, since it's everything after the :
character?
if with_prefix is None: | ||
parts = dataset_id.split(".") | ||
else: | ||
parts = with_prefix.group("ref").split(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but it appears to me that we're discarding the prefix? Is that correct?
bigquery/tests/unit/test_dataset.py
Outdated
def test_from_string_w_prefix(self): | ||
cls = self._get_target_class() | ||
got = cls.from_string("prefix:string-project.string_dataset") | ||
self.assertEqual(got.project, "string-project") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be prefix:string-project
, since the prefix is actually part of the project ID?
Complete template change.
@@ -26,6 +27,14 @@ | |||
from google.cloud.bigquery.table import TableReference | |||
|
|||
|
|||
_PROJECT_PREFIX_PATTERN = re.compile( | |||
r""" | |||
(?P<prefix>\S+\:\S+)\.+(?P<remaining>\S*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're matching this way, prefix isn't the right term. Should be project_id
. Likewise, remaining
should be renamed to dataset_id
.
Also, instead of \S
, we should be matching for characters other than .
, that is [^.]+
.
We want to match the whole string, so we should probably end this pattern with $
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed parts of the pattern, but the second comment about [^.]
seems inappropriate to me. As we know the string could be google.com:project.dataset
, that means that dot could be a part of the prefix. I checked couple of variants and as I see \S
fits more.
@@ -26,6 +27,14 @@ | |||
from google.cloud.bigquery.table import TableReference | |||
|
|||
|
|||
_PROJECT_PREFIX_PATTERN = re.compile( | |||
r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using a multi-line string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the readability. I think I’ll switch this to the single line, after correcting the pattern implementation.
minor corrections
def test_from_string_legacy_string(self): | ||
cls = self._get_target_class() | ||
with self.assertRaises(ValueError): | ||
cls.from_string("string-project:string_dataset") | ||
|
||
def test_from_string_w_incorrect_prefix(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an additional test where the project ID / dataset ID contains an illegal .
character. Another way to say that, is the string contains too many "parts". e.g. google.com:project-id.dataset_id.table_id
. This should also fail with ValueError
.
@@ -26,6 +27,9 @@ | |||
from google.cloud.bigquery.table import TableReference | |||
|
|||
|
|||
_PROJECT_PREFIX_PATTERN = re.compile(r"(?P<project_id>\S+\:\S+)\.+(?P<dataset_id>\S+)$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will match patterns with too many .
characters. Let's try something like:
(?P<project_id>\S+\:[^.]+)\.(?P<dataset_id>[^.]+)$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, sorry for misunderstanding. Appreciate your help.
pattern rewrote with the '[^.]' and .VERBOSE (due to blacken session) added test to check extra parts within the string with the prefix reconf prefix in an existed test
Closes: #8646