-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guard rails around collect/parallelize #301
Comments
Another one -- if you call to output a node within a import hamilton.ad_hoc_utils
from hamilton.htypes import Parallelizable, Collect
def url() -> Parallelizable[str]:
for url_ in ['url_a', 'url_b']:
print(url)
yield url_
def url_loaded(url: str) -> str:
print(url)
return url
def counts(url_loaded: str) -> int:
print(url_loaded)
print(type(url_loaded)) # url_loaded seems to be a list not a string???
return len(url_loaded.split("_"))
def total_words(counts: Collect[int]) -> int:
return sum(counts)
my_hamilton_nodes = hamilton.ad_hoc_utils.create_temporary_module(
url,url_loaded, counts, total_words
)
if __name__ == '__main__':
from hamilton import driver, base, telemetry
from hamilton.execution import executors
telemetry.disable_telemetry()
config = {}
dr = (
driver.Builder()
.with_modules(my_hamilton_nodes)
.enable_dynamic_execution(allow_experimental_mode=True)
.with_local_executor(executors.SynchronousLocalTaskExecutor())
.with_config(config)
.build()
)
output_columns = [
'counts'
]
out = dr.execute(output_columns)
print(out) |
One other thing to get a better error for - regular driver used to try to execute a graph with parallelize. |
Documented here: #745 |
@elijahbenizzy I ran into #301 (comment) too, in the context of a dataloader under parallelizable. This is quite a blocker for me. |
Is your feature request related to a problem? Please describe.
There are a few edge cases that aren't handled in the code:
Parallelizable[]
/Collect
Parallel Execution:KeyError: 'key df not found in cache'
#1029Parallelizable
directly preceeding aCollect
Parallelizable
with noCollect
coming afterwardsParallelizable
with multipleCollect
s -- this should be allow eventually but its not feasible now. See Parallelizable cannot aggregate or return multiple Collects #742.Collect
in a nodeParallelizable
andCollect
- Parallel Execution: Collect node returnslist[list[pd.DataFrame]]
instead oflist[pd.DataFrame]
#1030And probably a few more.
Describe the solution you'd like
Clean errors as early as possible. Currently this does nothing.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Follow up from this: #299.
The text was updated successfully, but these errors were encountered: