Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for the R Client #4610

Merged
merged 24 commits into from
Oct 20, 2023
Merged

Documentation for the R Client #4610

merged 24 commits into from
Oct 20, 2023

Conversation

alexpeters1208
Copy link
Contributor

@alexpeters1208 alexpeters1208 commented Oct 9, 2023

This client introduces documentation pages generated from existing docstrings that will be accessible from within RStudio. All documentation will have working examples, though these are not needed to assess language or formatting of the docs. RStudio is needed to get the full perspective on these docs.

Here are formatted doc previews, saved directly from R studio, as of 10/11:
Introduction to the package and important concepts: rdeephaven.Rd
Reference doc on the Client class: Client.Rd
Reference doc on the TableHandle class: TableHandle.Rd
Intro to AggBy: AggBy.Rd
Intro to UpdateBy: UpdateBy.Rd

For any who are using RStudio to view these files (Cristian and maybe others):

  1. I'm interested to know whether ?thing that you would expect to work actually work. The operator will work with the thing names I've selected, but I want to know if those names are actually what you expect them to be, so that you can run ?thing by just guessing what thing should be, and get the results you want.
  2. ?rdeephaven is my intended "portal" to the rest of the documentation. Start here.
  3. Do the links work, do they take you to pages you expect?
  4. Is it easy to find information you're looking for?

@alexpeters1208 alexpeters1208 added documentation Improvements or additions to documentation NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. r-client labels Oct 9, 2023
@alexpeters1208 alexpeters1208 added this to the October 2023 milestone Oct 9, 2023
@alexpeters1208 alexpeters1208 self-assigned this Oct 9, 2023
#'
#' The argument `big_value_context` can take the following values:
#'
#' - `'decimal128'`: IEEE 754R Decimal128 format. 34 digits and rounding is half-even.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of decimal digits are not exact; I would say "Approx. XX digits" or something of the sort.

#' @param cols String or list of strings denoting the column(s) to operate on. Can be renaming expressions, i.e. “new_col = col”.
#' Default is to compute the cumulative sum for all non-grouping columns.
#' @return UpdateByOp to be used in `update_by()`.
#'
#' @examples
#' print("hello!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print("hello!") ? Is that intentional? :-)

Copy link
Member

@jcferretti jcferretti Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some amount of repetition in the doc strings below for other functions like this one, so two comments:
One is to try to think about some way to try to reduce that repetition. You could point to a common part of the documentation with a note, instead of copy&paste. That is not necessarily "obviously better", but I think it is worth to think about doing it; see if you feel one way or the other is better. The issue with duplication is maintainability of us. But is not only that, users coming to the documentation for a specific check/answer will see that comment multiple times; so it may also make sense to give them the option to not having to see the whole thing every time.
The second comment is that if you decide to keep the repetition, then any fixes will need to be consistently done on all of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the constraints of this documentation; on the doc site, I'd recommend one note at the top that applies to all to avoid duplicating the information for each operation.

Copy link
Member

@jcferretti jcferretti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a bunch of comments, most of them are suggestions and things to consider rather than hard black and white mandates, so ponder those and make your own calls.

I wanted to say, I think this is excellent and the level of detail and effort you put here is really above and beyond. Thanks for caring this much!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if checking in Rd files is a mistake. These are all generated, and we have actively moved away from checking in generated files. With these files in git, there is a problem. If some some .R file gets deleted or renamed, the old, generated Rd file still sits in git, unless the dev remembers to deleted it. This could inadvertently yield incorrect Rd documentation.

@niloc132 should comment, since he has been a destroyer of generated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly inclined to say that it's not a mistake. The devtools::document() function that generates the docs from Roxygen docstrings does not get run automatically on package installation. As an R user, I never knew about this function before this project, and if I got a package that did not appear to have any documentation, I would not have run devtools::document() to see if that fixed things.
Also arrow does it, dplyr does it, ggplot2 does it, caret does it, etc etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a subtlety in your statement that I'm missing. Are you suggesting it would be better to have CI generate and check in these files? That might be a good solution, since there's no worry about the issues mentioned in the latter half of your comment.

Copy link
Member

@jcferretti jcferretti Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a complicated topic and there are schools of thought about this in the "abstract". I was "raised" in the "never check in generated files" school. However, I believe we in DH are strongly of the "check in all generated files and when there are new versions have a check that fails and forces the PR to re-generate and add to git the generated changes" school.

So a general discussion here is possible, but if I understand correctly how we treat generated files (eg, protobuf compiler output, java expression parser in the engine, chunked types classes, and a lot of others), we consistently check in / add to git everything necessary for a build that we ourselves make. Meaning all generated files.

Your point btw is very valid, files that were generated in the past that won't be generated again because their source was intentionally deleted pose a problem for how we do things. I don't have a good answer to that, other than to try to have a check for it like we do for the need to regenerate. But I am not familiar enough with the code we use for checks (and I suspect is highly ad-hoc for each case, eg, the protobuf generating code checks are completely re-done and independent of the java generated checks, etc, so this is a potentially recurring issue).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary:

  • In Java, we try not to check in generated sources, since usually we can get Gradle to do it for us. There are exceptions, where we either don't have the workflow finished, or the code that is generated needs some tweaking, or where we just can't be bothered for legacy reasons.
  • In Go, in order for the ecosystem's packaging tools to work, all source code must be committed to a git repo, so generated sources are checked in. No option here if we want to be part of the ecosystem.
  • Our python client code uses checked in sources because it was decided that we wouldn't follow gRPC/protobuf conventions in naming, and while we have a process that lets a user invoke Gradle to generate the appropriate python output and then mangle names with the corresponding sed-in-groovy magic, the thinking was that a Python developer shouldn't need to use Gradle to make small changes to our code.
  • Our c++ client code uses checked in sources for reasons I don't recall, but at least we just run protoc without extra magic. Likely the thinking was the same - while a c++ developer would need to follow another 18 steps to get a working build, that 19th of using Gradle was a doozy.

As such I don't strongly object, but I would love to revisit some of these past decisions. If we're going to use "check in the sources, and validate that they are correct by running Gradle", it seems like we could abbreviate this to "run Gradle to have the right sources", but perhaps verifying correctness only in CI is okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go statement isn't true... but that isn't relevant here.

R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/agg_ops_wrapper.R Outdated Show resolved Hide resolved
#' The Deephaven Community R Client provides an R interface to Deephaven's powerful real-time data engine, [_Deephaven Core_](https://deephaven.io/community/).
#' To use this package, you must have a Deephaven server running and be able to connect to it. For more information on
#' how to set up a Deephaven server, see the documentation [here](https://deephaven.io/core/docs/tutorials/quickstart/).
#' For a brief overview of the Deephaven R Client, see the [blog post](https://deephaven.io/blog/2023/10/04/r-client-updates/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any thing fundamentally wrong with linking to a blog post, but it might be worthwhile to take some of the info from the blog post and this intro and throw it into a concept guide as well, or what other sites would call a whitepaper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean away from linking to blog posts in ref docs.

R/rdeephaven/R/exports.R Outdated Show resolved Hide resolved
R/rdeephaven/R/operation_control.R Outdated Show resolved Hide resolved
R/rdeephaven/R/update_by_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/update_by_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/R/update_by_ops_wrapper.R Outdated Show resolved Hide resolved
R/rdeephaven/man/AggBy.Rd Outdated Show resolved Hide resolved
R/rdeephaven/man/AggBy.Rd Outdated Show resolved Hide resolved
@alexpeters1208 alexpeters1208 marked this pull request as ready for review October 20, 2023 18:28
Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving under the assumption that the open comments will be addressed in a follow-on PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Go statement isn't true... but that isn't relevant here.

Comment on lines +35 to +37
#' The `agg_by()` and `agg_all_by()` methods themselves do not know anything about the columns on which you want to
#' perform aggregations. Rather, the desired columns are passed to individual `agg` functions, enabling you to apply
#' various kinds of aggregations to different columns or groups of columns as needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if agg_all_by operators accept column args. I think they don't, but this wording sounds like they do.

#' @description
#' An AggOp represents an aggregation operation that can be passed to `agg_by()` or `agg_all_by()`.
#' Note that AggOps should not be instantiated directly by user code, but rather by provided agg_* functions.
#' An `AggOp` is the return type of one of Deephaven's [`agg`][AggBy] functions. It is a function that performs the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should indicate that these are "deephaven" functions and not general R functions.

Comment on lines +90 to +92
#' Much of the power of Deephaven's suite of table operations is achieved through the use of the [`update_by()`][UpdateBy]
#' and [`agg_by()`][AggBy] methods. These table methods are important enough to warrant their own documentation pages, accessible
#' by clicking on their names, or by running `?UpdateBy` or `?AggBy`. These methods come with their own suites of functions,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird that UpdateBy and AggBy are links here as well as ? operator targets.

If this is as problem, I would search the rest of the docs for similar problems.

#' @section
#' Getting help:
#' While we've done our best to provide good documentation for this package, you may find you need more help than what
#' this documentation has to offer. Please visit the official Deephaven Community Core [documentation](https://deephaven.io/core/docs/tutorials/quickstart/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' this documentation has to offer. Please visit the official Deephaven Community Core [documentation](https://deephaven.io/core/docs/tutorials/quickstart/)
#' this documentation has to offer. Please visit the official Deephaven Community Core [documentation](https://deephaven.io/core/docs/)

Comment on lines +123 to +125
#' An `UpdateByOp` is the return type of one of Deephaven's [`uby`][UpdateBy] functions. It is a function that performs
#' the computation specified by the `uby` function. These are intended to be passed directly to `update_by()`,
#' and should never be instantiated directly be user code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this make clear tht it is not a general purpose R function?

Comment on lines +235 to +241
#' function called an [`UpdateByOp`][UpdateByOp] intended to be used in a call to `update_by()`. This detail is typically
#' hidden from the user. However, it is important to understand this detail for debugging purposes, as the output of
#' a `uby` function can otherwise seem unexpected.
#'
#' @param cols String or list of strings denoting the column(s) to operate on. Can be renaming expressions, i.e. “new_col = col”.
#' Default is to compute the cumulative product for all non-grouping columns.
#' @return UpdateByOp to be used in `update_by()`.
#' @return `UpdateByOp` to be used in a call to `update_by()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_by() is listed multiple times in here without a link. Seems like it should have a link. Change probably needed in other places as well.

#'
#' Where:
#' - \eqn{dt_i} is the difference between time \eqn{t_i} and \eqn{t_{i-1}} in nanoseconds.
#' - \eqn{\tau} is `decay_time` in nanoseconds, an input parameter to the method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "in nanoseconds" to be confusing here since users are specifying the duration string, which could be specified in hours or maybe days.

#'
#' Where:
#' - \eqn{dt_i} is the difference between time \eqn{t_i} and \eqn{t_{i-1}} in nanoseconds.
#' - \eqn{\tau} is `decay_time` in nanoseconds, an input parameter to the method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. I'm guessing that there are other cases as well. Could probably also remove "in nanoseconds" from the prior line.

#' This uses ISO-8601 time strings as the reverse and forward window parameters.
#'
#' @details
#' This uses [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) time strings as the reverse and forward window parameters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It supports these and other formats. I think the examples below fall into the other formats.

@jcferretti jcferretti merged commit 04ac064 into deephaven:main Oct 20, 2023
10 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Oct 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. r-client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

R Client Latex Formatting for UpdateBy Formulas R Client Documentation
5 participants