Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbWriteTable vs dbAppendTable google cloud Postgres #241

Closed
ghost opened this issue Apr 10, 2020 · 6 comments · Fixed by #322
Closed

dbWriteTable vs dbAppendTable google cloud Postgres #241

ghost opened this issue Apr 10, 2020 · 6 comments · Fixed by #322
Milestone

Comments

@ghost
Copy link

ghost commented Apr 10, 2020

Hello,
Im trying to write a table to a Google Cloud Postgres database. Se below for reprex.

dbCreateTable/dbWriteTable takes 0.23sec while dbAppendTable 10.6sec.

Why does it says in the documentation (?dbWriteTable) "New code should prefer dbCreateTable() and dbAppendTable()"when it is much slower?

# Make database connection
db <-
  dbConnect(
    RPostgres::Postgres(),
    host = "xx.xxx.x.xxx",
    user = "xxxxx",
    password = "xxxxx",
    dbname = "xxxxx"
  )

# Insert data with dbAppendTable 
# Output: Time spent dbAppendTable 10.63571
start_time <- Sys.time()
db %>% dbCreateTable(name = "iris",iris)
db %>% dbAppendTable(name = "iris", iris)
end_time <- Sys.time()
cat("Time spent dbAppendTable",(start_time %--% end_time))

# Insert data with dbWriteTable 
# Time spent dbWriteTable 0.2309279
start_time <- Sys.time()
db %>% dbWriteTable(name = "iris", iris)
end_time <- Sys.time()
cat("Time spent dbWriteTable",(start_time %--% end_time))
@jakejh
Copy link

jakejh commented Oct 2, 2020

I've observed the same behavior on Mac and on EC2 running Ubuntu, explicitly comparing dbAppendTable with dbWriteTable(..., append = TRUE). The speed difference is enormous.

@krlmlr
Copy link
Member

krlmlr commented Oct 18, 2020

Thanks. I see that dbWriteTable() defaults to copy = TRUE which emits a COPY ... FROM STDIN query. With dbWriteTable(copy = FALSE), a single SQL string is generated. dbAppendTable() uses a third approach -- parametrized queries.

I'm observing a substantial slowdown even on a local server when writing 1000 copies of the data. This slowdown is reduced substantially, but still amounts to a factor of ~4-5, when dbAppendTable() is surrounded by dbBegin() and dbCommit() .

I think we should use the COPY ... FROM STDIN approach in dbAppendTable(), and the parametrized query with begin+commit for RRedshift . Happy to review a pull request.

library(DBI)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(magrittr)

# Make database connection
db <-
  dbConnect(
    RPostgres::Postgres()
  )

# Insert data with dbAppendTable
start_time <- Sys.time()
try(db %>% dbRemoveTable("iris"))
db %>% dbCreateTable(name = "iris", iris[0, ])
db %>% dbBegin()
db %>% dbAppendTable(name = "iris", iris[rep(1:150, each = 1000), ])
#> Warning: Factors converted to character
#> [1] 150000
db %>% dbCommit()
end_time <- Sys.time()
cat("Time spent dbAppendTable", (start_time %--% end_time))
#> Time spent dbAppendTable 4.068846

# Insert data with dbWriteTable
start_time <- Sys.time()
try(db %>% dbRemoveTable("iris"))
db %>% dbWriteTable(name = "iris", iris[rep(1:150, each = 1000), ])
end_time <- Sys.time()
cat("Time spent dbWriteTable", (start_time %--% end_time))
#> Time spent dbWriteTable 0.9158208

Created on 2020-10-18 by the reprex package (v0.3.0)

@Gutschlhofer
Copy link

Gutschlhofer commented Dec 14, 2020

In my case, this problem is even more pronounced: I have to insert around 2 Million rows (multiple times) and with dbAppendTable one dataset took over 7 hours. Thanks to your discussion here I tried dbWriteTable(..., append=TRUE) and I reduced the time to less than 3 minutes. I suggest that until this is fixed there should be some remark in dbAppendTable that there are (potentially) huge performance gains when using dbWriteTable(..., append=TRUE).

@JSchoenbachler
Copy link
Contributor

I created a pull request (#286) to address this. @krlmlr could you review and let me know if there are any requested changes?

@eweisbrod
Copy link

In my case, this problem is even more pronounced: I have to insert around 2 Million rows (multiple times) and with dbAppendTable one dataset took over 7 hours. Thanks to your discussion here I tried dbWriteTable(..., append=TRUE) and I reduced the time to less than 3 minutes. I suggest that until this is fixed there should be some remark in dbAppendTable that there are (potentially) huge performance gains when using dbWriteTable(..., append=TRUE).

Thank you, I had the same issue and this was very helpful for me. I can confirm the huge performance gains.

krlmlr added a commit that referenced this issue Sep 11, 2021
- `dbAppendTable()` gains `copy` argument, `TRUE` by default. If set, data is imported via `COPY name FROM STDIN` (#241, @hugheylab).
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants