Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace S3 dispatch by C implementation #400

Merged
merged 24 commits into from
Nov 2, 2023
Merged

Conversation

mgirlich
Copy link
Contributor

@mgirlich mgirlich commented Aug 23, 2023

Closes #386.

library(xml2)

data <- read_xml("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/reed.xml")
many_elts <- data |> 
  xml_contents() |> 
  xml_contents() |>
  xml_contents()

path <- xml2_example("cd_catalog.xml")
xml <- read_xml(path)
cds <- xml |> xml_children()
cd_contents <- cds |> xml_contents()

bench::mark(
  cds = xml_name(cds),
  cd_contents = xml_name(cd_contents),
  many_elts = xml_name(many_elts),
  check = FALSE
)

# DEV
#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           78.6µs   82.5µs   11139.     6.95KB     23.7
#> 2 cd_contents  447.2µs  480.9µs    1940.     1.27KB     21.2
#> 3 many_elts     28.2ms   30.1ms      33.3   65.72KB     42.8

# This PR
#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           3.52µs   3.87µs   214805.    2.66KB    21.5 
#> 2 cd_contents  13.03µs   14.1µs    61534.    1.27KB    18.5 
#> 3 many_elts     1.23ms   1.71ms      572.   65.72KB     8.36

Created on 2023-08-23 with reprex v2.0.2

More benchmarks
library(xml2)

data <- read_xml("http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/reed.xml")
many_elts <- data |> 
  xml_contents() |> 
  xml_contents() |>
  xml_contents()

path <- xml2_example("cd_catalog.xml")
xml <- read_xml(path)
cds <- xml |> xml_children()
cd_contents <- cds |> xml_contents()

bench::mark(
  cds = xml_name(cds),
  cd_contents = xml_name(cd_contents),
  many_elts = xml_name(many_elts),
  check = FALSE
)
#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds             66µs   78.6µs   11863.     6.95KB     23.8
#> 2 cd_contents  378.9µs  449.6µs    2069.     1.27KB     23.5
#> 3 many_elts     24.3ms   27.9ms      36.3   65.72KB     40.8

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           2.85µs   3.07µs   277778.    2.66KB     0   
#> 2 cd_contents  13.17µs  14.23µs    62456.    1.27KB    18.7 
#> 3 many_elts      1.1ms   1.63ms      616.   65.72KB     8.29

bench::mark(
  cds = xml_attr(cds, "test"),
  cd_contents = xml_attr(cd_contents, "test"),
  many_elts = xml_attr(many_elts, "test"),
  check = FALSE
)

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           80.7µs   95.9µs    9750.     8.66KB     25.5
#> 2 cd_contents  464.3µs  538.8µs    1779.     1.27KB     23.6
#> 3 many_elts     29.2ms   30.4ms      32.9   65.72KB    132.

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           2.77µs   3.01µs   292658.    3.44KB     29.3
#> 2 cd_contents   8.88µs   9.66µs    87308.    1.27KB     26.2
#> 3 many_elts   814.71µs   1.02ms      889.   65.72KB     10.4

bench::mark(
  cds = xml_attrs(cds),
  cd_contents = xml_attrs(cd_contents),
  many_elts = xml_attrs(many_elts),
  check = FALSE
)

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           71.6µs   83.8µs   11029.     6.48KB     25.6
#> 2 cd_contents  401.2µs  475.8µs    1997.     1.27KB     26.8
#> 3 many_elts     24.5ms   25.9ms      39.1   65.72KB    102.

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           3.54µs   4.39µs   194171.    3.02KB     38.8
#> 2 cd_contents  15.28µs  18.69µs    44680.    1.27KB     44.7
#> 3 many_elts     1.07ms   1.42ms      671.   65.72KB     29.9

bench::mark(
  cds = xml_text(cds),
  cd_contents = xml_text(cd_contents),
  many_elts = xml_text(many_elts),
  check = FALSE
)

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           94.8µs  110.1µs    7962.    11.24KB     22.1
#> 2 cd_contents  526.7µs  653.4µs    1241.     1.27KB     19.2
#> 3 many_elts     43.7ms   48.6ms      21.3   65.72KB     56.7

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           11.8µs  12.22µs    74143.    5.14KB     7.42
#> 2 cd_contents  37.14µs  42.75µs    22727.    1.27KB     9.09
#> 3 many_elts     1.94ms   2.36ms      418.   65.72KB     6.30

bench::mark(
  cds = xml_length(cds),
  cd_contents = xml_length(cd_contents),
  many_elts = xml_length(many_elts),
  check = FALSE
)

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           75.1µs   85.1µs    8996.     8.69KB     18.9
#> 2 cd_contents  417.2µs  475.9µs    1739.       672B     21.2
#> 3 many_elts     23.6ms   27.2ms      35.7   32.88KB     45.9

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds            1.9µs   2.06µs   418749.    2.41KB      0  
#> 2 cd_contents   6.13µs   6.71µs   121050.      672B     36.3
#> 3 many_elts   712.37µs 852.89µs     1075.   32.88KB     14.7

bench::mark(
  cds = xml_path(cds),
  cd_contents = xml_path(cd_contents),
  many_elts = xml_path(many_elts),
  check = FALSE
)

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           80.9µs   96.3µs    8871.     6.12KB     18.9
#> 2 cd_contents  499.1µs  614.8µs    1412.     1.27KB     14.6
#> 3 many_elts     56.2ms     67ms      15.6   65.72KB     26.0

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           15.7µs   18.5µs   52116.     2.46KB     5.21
#> 2 cd_contents  117.7µs  134.9µs    7305.     1.27KB     2.01
#> 3 many_elts     25.6ms     29ms      33.5   65.72KB     0

bench::mark(
  cds = xml_type(cds),
  cd_contents = xml_type(cd_contents),
  many_elts = xml_type(many_elts),
  check = FALSE
)

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           45.1µs   53.1µs   15496.      8.1KB     21.1
#> 2 cd_contents  232.4µs  281.6µs    2984.     1.92KB     21.3
#> 3 many_elts     16.1ms   17.8ms      52.9    98.6KB     26.5

#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 cds           1.85µs   2.22µs   360872.    5.73KB     36.1
#> 2 cd_contents   6.04µs    8.2µs   104980.    1.92KB     31.5
#> 3 many_elts   702.33µs    870µs     1076.    98.6KB     15.0

Created on 2023-08-30 with reprex v2.0.2

@mgirlich mgirlich marked this pull request as draft August 23, 2023 07:49
@mgirlich
Copy link
Contributor Author

@hadley It would be great to get your opinion of whether this approach is fine for you. It kind of works against the dispatch system. But I'm also not quite sure whether the dispatch is really needed as there are only three classes (<xml_missing>, <xml_node> and <xml_nodeset>) and I don't think the user really needs to extend them.

The performance gains are quite big (around a factor of 20), which would be nice for e.g. the {paws} package (a package to work with AWS).

Overall, the code is pretty redundant so we could also use a macro or some other CPP techniques but I'm not so familiar with C/CPP to know about the potential downsides.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think switching from S3 dispatch to something custom at the C-level is fine, although as I noted I think providing a helper that makes it possible to use switch will make the implementation cleaner (and a little easier to extend in the unlikely event that is necessary).

I'm not too worried about the duplication, although the idiosyncratic mix of C and C++ is going to make maintenance generally harder. Possibly might be better to use cpp11 and some of the vector helpers that it provides.

R/xml_text.R Show resolved Hide resolved
@@ -1,3 +1,13 @@
test_that("xml_name() returns the name", {
x <- sample_nodeset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to copy and paste the contents of sample_nodeset() here; otherwise it's hard to verify that the test is correct.

src/xml2_node.cpp Outdated Show resolved Hide resolved
@mgirlich mgirlich changed the title Implement xml_name.xml_nodeset() in C Implement xml_name(), xml_text() and xml_type() in C Aug 29, 2023
@mgirlich mgirlich changed the title Implement xml_name(), xml_text() and xml_type() in C Replace S3 dispatch by C implementation Aug 29, 2023
@mgirlich mgirlich marked this pull request as ready for review August 30, 2023 06:27
inst/include/xml2_types.h Outdated Show resolved Hide resolved
Comment on lines +41 to +44
XPtrNode node(node_sxp);

std::string name = nodeName(node.checked_get(), nsMap);
out = Rf_mkCharLenCE(name.c_str(), name.size(), CE_UTF8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this has always been the case, but if we have an R error anywhere in here, then it will skip over any C++ destructors (like for XPtrNode, if it has one, and std::string) on the way out

I don't think there is a reason to try and improve this in this PR, but I will keep an eye out for potentially problematic cases as I continue my review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have an odd mix of C and C++ in this package which we should try and fix at some point.

src/xml2_node.cpp Outdated Show resolved Hide resolved
src/xml2_node.cpp Outdated Show resolved Hide resolved
src/xml2_node.cpp Outdated Show resolved Hide resolved
src/xml2_node.cpp Outdated Show resolved Hide resolved
Comment on lines 660 to 662
if (n == 0) {
return Rf_ScalarInteger(0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything is fine if you remove this. Generally i try and reserve early exits for cases where they are absolutely necessary (like if your algorithm requires >=2 elements or something, like most sorting algos)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I see what you are doing here. This seems odd. If the nodeset is length 0, shouldn't the output be length 0? i.e. integer()? In which case you could definitely just remove this then it would do the right thing.

Maybe this is some old behavior I don't know about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed this and found it quite weird which is why I opened an issue #404. I wanted to tackle that in a separate PR, forgot to mention this here.

tests/testthat/test-xml_type.R Outdated Show resolved Hide resolved
@mgirlich mgirlich mentioned this pull request Sep 4, 2023
@hadley hadley merged commit 1c4453f into r-lib:main Nov 2, 2023
@hadley
Copy link
Member

hadley commented Nov 30, 2023

This breaks fhircracker

setClass(
	Class = "fhir_bundle_xml",
	contains = c("fhir_resource_xml", "fhir_bundle"),
	slots = c(next_link = "fhir_url", self_link = "fhir_url"),
	prototype = prototype(xml2::read_xml(x = "<Bundle></Bundle>"))
)

@hadley
Copy link
Member

hadley commented Nov 30, 2023

I'm guessing this is because while inherits(x, "xml_node") is still true, Rf_inherits() doesn't handle S4 classes.

Yeah — looks like it https://github.com/wch/r-source/blob/d6dad605b05810cb43f991d292169a2cd436a818/src/include/Rinlinedfuns.h#L774.

hadley added a commit that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trim all xml_text results from a nodeset at the same time
3 participants