Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docs performance (#2480) #2481

Merged
merged 8 commits into from
May 22, 2020
4 changes: 2 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ version: 2.1
jobs:
unit:
docker: &test_only
- image: fishtownanalytics/test-container:5
- image: fishtownanalytics/test-container:7
environment:
DBT_INVOCATION_ENV: circle
steps:
Expand Down Expand Up @@ -30,7 +30,7 @@ jobs:
destination: dist
integration-postgres-py36:
docker: &test_and_postgres
- image: fishtownanalytics/test-container:5
- image: fishtownanalytics/test-container:7
environment:
DBT_INVOCATION_ENV: circle
- image: postgres
Expand Down
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,12 @@
- Fixed a number of issues with globally-scoped vars ([#2473](https://github.com/fishtown-analytics/dbt/issues/2473), [#2472](https://github.com/fishtown-analytics/dbt/issues/2472), [#2469](https://github.com/fishtown-analytics/dbt/issues/2469), [#2477](https://github.com/fishtown-analytics/dbt/pull/2477))
- Fixed DBT Docker entrypoint ([#2470](https://github.com/fishtown-analytics/dbt/issues/2470), [#2475](https://github.com/fishtown-analytics/dbt/pull/2475))
- Fixed a performance regression that occurred even when a user was not using the relevant feature ([#2474](https://github.com/fishtown-analytics/dbt/issues/2474), [#2478](https://github.com/fishtown-analytics/dbt/pull/2478))
- Substantial performance improvements for parsing on large projects, especially projects with many docs definition. ([#2480](https://github.com/fishtown-analytics/dbt/issues/2480), [#2481](https://github.com/fishtown-analytics/dbt/pull/2481))
- Expose Snowflake query id in case of an exception raised by connector ([#2201](https://github.com/fishtown-analytics/dbt/issues/2201), [#2358](https://github.com/fishtown-analytics/dbt/pull/2358))

Contributors:
- [@dmateusp](https://github.com/dmateusp) ([#2475](https://github.com/fishtown-analytics/dbt/pull/2475))
- [@ChristianKohlberg](https://github.com/ChristianKohlberg) (#2358](https://github.com/fishtown-analytics/dbt/pull/2358))

## dbt 0.17.0rc1 (May 12, 2020)

Expand Down Expand Up @@ -67,7 +70,6 @@ Contributors:
- Add a 'depends_on' attribute to the log record extra field ([#2316](https://github.com/fishtown-analytics/dbt/issues/2316), [#2341](https://github.com/fishtown-analytics/dbt/pull/2341))
- Added a '--no-browser' argument to "dbt docs serve" so you can serve docs in an environment that only has a CLI browser which would otherwise deadlock dbt ([#2004](https://github.com/fishtown-analytics/dbt/issues/2004), [#2364](https://github.com/fishtown-analytics/dbt/pull/2364))
- Snowflake now uses "describe table" to get the columns in a relation ([#2260](https://github.com/fishtown-analytics/dbt/issues/2260), [#2324](https://github.com/fishtown-analytics/dbt/pull/2324))
- Expose Snowflake query id in case of an exception raised by connector ([#2201](https://github.com/fishtown-analytics/dbt/issues/2201), [#2358](https://github.com/fishtown-analytics/dbt/pull/2358))
- Sources (and therefore freshness tests) can be enabled and disabled via dbt_project.yml ([#2283](https://github.com/fishtown-analytics/dbt/issues/2283), [#2312](https://github.com/fishtown-analytics/dbt/pull/2312), [#2357](https://github.com/fishtown-analytics/dbt/pull/2357))
- schema.yml files are now fully rendered in a context that is aware of vars declared in from dbt_project.yml files ([#2269](https://github.com/fishtown-analytics/dbt/issues/2269), [#2357](https://github.com/fishtown-analytics/dbt/pull/2357))
- Sources from dependencies can be overridden in schema.yml files ([#2287](https://github.com/fishtown-analytics/dbt/issues/2287), [#2357](https://github.com/fishtown-analytics/dbt/pull/2357))
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ RUN apt-get update && \
apt-get install -y --no-install-recommends \
netcat postgresql curl git ssh software-properties-common \
make build-essential ca-certificates libpq-dev \
libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit \
libsasl2-dev libsasl2-2 libsasl2-modules-gssapi-mit libyaml-dev \
&& \
add-apt-repository ppa:deadsnakes/ppa && \
apt-get install -y \
Expand Down
20 changes: 20 additions & 0 deletions core/dbt/clients/jinja.py
Original file line number Diff line number Diff line change
Expand Up @@ -493,13 +493,33 @@ def _requote_result(raw_value: str, rendered: str) -> str:
return f'{quote_char}{rendered}{quote_char}'


# performance note: Local benmcharking (so take it with a big grain of salt!)
# on this indicates that it is is on average slightly slower than
# checking two separate patterns, but the standard deviation is smaller with
# one pattern. The time difference between the two was ~2 std deviations, which
# is small enough that I've just chosen the more readable option.
_HAS_RENDER_CHARS_PAT = re.compile(r'({[{%]|[}%]})')


def get_rendered(
string: str,
ctx: Dict[str, Any],
node=None,
capture_macros: bool = False,
native: bool = False,
) -> str:
# performance optimization: if there are no jinja control characters in the
# string, we can just return the input. Fall back to jinja if the type is
# not a string or if native rendering is enabled (so '1' -> 1, etc...)
# If this is desirable in the native env as well, we could handle the
# native=True case by passing the input string to ast.literal_eval, like
# the native renderer does.
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one implication of this is that invalid jinja will return plaintext instead of some sort of compiler error, eg:

{ { config(....) }}

I think that's ok, but we should keep this in mind if anyone write in with a support question about something like this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, one thing we could do there is also search for [%}]}. That's a pretty minor fix that might actually capture all cases that jinja would error about - I think jinja would just silently pass { { config(...) } } as text, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - i'd be in favor of also searching for [%}]}. It's a pretty minor failure mode, but I do think it's a good and positive change to make :)

I'm not at all concerned about

{ { config() } }

that is plaintext as far as i'm concerned

not native and
isinstance(string, str) and
_HAS_RENDER_CHARS_PAT.search(string) is None
):
return string
template = get_template(
string,
ctx,
Expand Down
15 changes: 14 additions & 1 deletion core/dbt/clients/yaml_helper.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,17 @@
from typing import Any

import dbt.exceptions

import yaml
import yaml.scanner

# the C version is faster, but it doesn't always exist
YamlLoader: Any
try:
from yaml import CSafeLoader as YamlLoader
except ImportError:
from yaml import SafeLoader as YamlLoader


YAML_ERROR_MESSAGE = """
Syntax error near line {line_number}
Expand Down Expand Up @@ -44,9 +53,13 @@ def contextualized_yaml_error(raw_contents, error):
raw_error=error)


def safe_load(contents):
return yaml.load(contents, Loader=YamlLoader)


def load_yaml_text(contents):
try:
return yaml.safe_load(contents)
return safe_load(contents)
except (yaml.scanner.ScannerError, yaml.YAMLError) as e:
if hasattr(e, 'problem_mark'):
error = contextualized_yaml_error(contents, e)
Expand Down
3 changes: 2 additions & 1 deletion core/dbt/context/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from dbt import flags
from dbt import tracking
from dbt.clients.jinja import undefined_error, get_rendered
from dbt.clients import yaml_helper
from dbt.contracts.graph.compiled import CompiledResource
from dbt.exceptions import raise_compiler_error, MacroReturn
from dbt.logger import GLOBAL_LOGGER as logger
Expand Down Expand Up @@ -383,7 +384,7 @@ def fromyaml(value: str, default: Any = None) -> Any:
-- ["good"]
"""
try:
return yaml.safe_load(value)
return yaml_helper.safe_load(value)
except (AttributeError, ValueError, yaml.YAMLError):
return default

Expand Down
28 changes: 28 additions & 0 deletions core/dbt/context/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
ParsedMacro, ParsedSourceDefinition, ParsedSeedNode
)
from dbt.exceptions import (
CompilationException,
InternalException,
ValidationException,
RuntimeException,
Expand Down Expand Up @@ -130,6 +131,19 @@ def _repack_args(
else:
return [package, name]

def validate_args(self, name: str, package: Optional[str]):
if not isinstance(name, str):
raise CompilationException(
f'The name argument to ref() must be a string, got '
f'{type(name)}'
)

if package is not None and not isinstance(package, str):
raise CompilationException(
f'The package argument to ref() must be a string or None, got '
f'{type(package)}'
)

def __call__(self, *args: str) -> RelationProxy:
name: str
package: Optional[str] = None
Expand All @@ -140,6 +154,7 @@ def __call__(self, *args: str) -> RelationProxy:
package, name = args
else:
ref_invalid_args(self.model, args)
self.validate_args(name, package)
return self.resolve(name, package)


Expand All @@ -148,12 +163,25 @@ class BaseSourceResolver(BaseResolver):
def resolve(self, source_name: str, table_name: str):
pass

def validate_args(self, source_name: str, table_name: str):
if not isinstance(source_name, str):
raise CompilationException(
f'The source name (first) argument to source() must be a '
f'string, got {type(source_name)}'
)
if not isinstance(table_name, str):
raise CompilationException(
f'The table name (second) argument to source() must be a '
f'string, got {type(table_name)}'
)

def __call__(self, *args: str) -> RelationProxy:
if len(args) != 2:
raise_compiler_error(
f"source() takes exactly two arguments ({len(args)} given)",
self.model
)
self.validate_args(args[0], args[1])
return self.resolve(args[0], args[1])


Expand Down
Loading