Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tour of Beam markdown touchups #32536

Merged
merged 3 commits into from
Sep 26, 2024
Merged

Conversation

hjtran
Copy link
Contributor

@hjtran hjtran commented Sep 23, 2024

Just some formatting and clarification changes to a few Tour of Beam pages. (I've been referring many people to Tour of Beam!)

  • Broke up the first page of the Tour of Beam which was just a huge blob of unbroken text.
    Before:
image After: (just with local markdown rendering) image
  • Adjusted PTransform definition which used to say that PTransforms took one or more PCollection when they can really take zero or more.

  • Small capitalization / language clarifications

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck added as fallback since no labels match configuration

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@hjtran
Copy link
Contributor Author

hjtran commented Sep 26, 2024

assign to next reviewer

@lostluck lostluck self-assigned this Sep 26, 2024
@lostluck
Copy link
Contributor

Sorry about the delay. The tail end of Google internal release tasks made me miss this earlier this week. Looking now.

Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Nothing blocking so I'll merge, but we can always discuss and add another PR later.

Sorry again for the delay.

@@ -22,7 +22,7 @@ The Beam SDKs provide several abstractions that simplify the mechanics of large-

→ `PCollection`: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver program. From there, PCollections are the inputs and outputs for each step in your pipeline.

→ `PTransform`: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as the input, performs a processing function that you provide on the elements of that PCollection, and then produces zero or more output PCollection objects.
→ `PTransform`: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes zero or more PCollection objects as the input, performs a processing function that you provide on the elements of that PCollection, and then produces zero or more output PCollection objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence between "yes, this is accurate and technically correct" and "no, this doesn't help users learn the model, as it's easier to automatically follow best practices by treating the 0 input cases as special/exceptional".

But I don't feel strongly enough for the latter to force further rewrites.

@@ -61,9 +61,9 @@ In java, you need to set runner to `args` when you start the program.
{{end}}

{{if (eq .Sdk "python")}}
In the Python SDK , the default is runner **DirectRunner**.
In the Python SDK , the **DirectRunner** is the default runner and is used if no runner is specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action required.

Obligatory complaint that we never explain anywhere that the Direct Runner isn't a monolith and has very different behaviors between SDKs. I can't finish Prism soon enough...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That relates a little bit to your other comment. Technically true that the DirectRunners are not a monolith but I imagine most people are single SDK users so the different SDK DirectRunner behaviors are unlikely to bite them in practice (but I may be overgeneralizing my experience :)

@lostluck lostluck merged commit 1eddbdc into apache:master Sep 26, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants