Add a high-level function for table upload to database #172

lungben · 2020-04-06T15:03:14Z

It would be good to have an easy-to-use high-level function for efficient uploading of tabular data into a Postgres table, analogue to Pandas to_sql().
The upload! function defined in this PR uses the COPY FROM STDIN functionality (which is much faster than using SQL Insert statements) and is essentially a thin wrapper around LibPG.CopyIn (following the example provided in the documentation).
It takes care about column ordering, missing data and escapes CSV delimiters in strings.
As far as I can see, this functionality fits best into LibPQ.jl (not e.g. DataFrames.jl) because it is database-specific for Postgres, but should work for any structure with Tables.jl interface.

Usage example:

conn = LibPQ.Connection("dbname=postgres user=$DATABASE_USER, password=$DATABASE_PASSWORD")
execute(conn, """CREATE TEMPORARY TABLE libpqjl_test (
                id int PRIMARY KEY,
                start_date timestamp,
                price numeric,
                comment varchar)""")
df = DataFrame(id=[1, 2, 3],
            start_date=[DateTime("2020-04-06T12:54:23"), DateTime("2016-04-06T02:54:13"), DateTime("2030-07-25T14:54:23")],
            price=[124.23, 17, -0.532],
            comment=["string with , in between", "string with \"quotes\"", "for, more; \$, and \"fun\""],)
result = upload!(df, conn, "libpqjl_test")

Note that in order to do my tests I included #171 as first commit of this PR.

Comments are welcome and please let me know if you encounter any issues!

iamed2 · 2020-04-06T17:03:39Z

Most of this function is CSV writing. I think that functionality (given a Tables table, return an iterable of string CSV lines) should be added to CSV. At that point this is the combination of a simple query and a CSV.jl function call, which can just be added to documentation.

lungben · 2020-04-06T17:27:35Z

I could not figure out how to re-use CSV.jl for this purpose.
If you could add an example how to do this in the documentation it would be great.

Edit: reverted changes related to #171

iamed2 · 2020-04-09T16:05:29Z

I could not figure out how to re-use CSV.jl for this purpose.

This functionality does not currently exist in CSV.jl; I suggested adding it.

Currently the functionality for writing a CSV file exists as CSV.write. I am suggesting adding another method that works the same way, but as an iterator. Instead of writing a row to a file, it would return the row as a string on each iteration.

cc @quinnj in case he has already thought of adding this

lungben · 2020-04-14T16:33:17Z

Being able to use CSV.jl for this purpose would be great, there are probably many edge cases which my simple function cannot handle.
Would it be helpful to add an Issue for this at CSV.jl?

lungben · 2020-06-17T16:15:36Z

The iterator has been implemented in CSV.jl, I'll take a look and update this PR.

lungben · 2020-06-19T20:31:05Z

The following function uses the new CSV.jl row iterator for writing tables to Postgres:

function load_by_copy!(table, con:: LibPQ.Connection, tablename:: AbstractString)
    it = CSV.RowWriter(df_sample)
    row_names = first(it)
    copyin = LibPQ.CopyIn("COPY $tablename ($row_names) FROM STDIN (FORMAT CSV);", Iterators.drop(it, 1))
    execute(con, copyin)
end

It is much faster (and probably much more robust) than my initial version.
However, when adding this to LibPQ, the CSV.jl package would be needed as dependency.
An alternative would be to add it to the documentation of LibPQ, or an own package (probably overkill for 5 LOC).
What do you think?

iamed2 · 2020-06-23T22:11:46Z

I suggest adding it in the docs in place of the existing COPY section

iter = CSV.RowWriter(df)
column_names = first(iter)
copyin = LibPQ.CopyIn("COPY my_table ($column_names) FROM STDIN (FORMAT CSV, HEADER);", iter)
execute(conn, copyin)

You can also avoid drop by using the HEADER flag.

This would also make a good integration test, and CSV would be fine as a test dependency IMO.

iamed2 mentioned this pull request Apr 17, 2020

Support generating lines of CSV as an iterator JuliaData/CSV.jl#608

Closed

lungben closed this Jun 26, 2020

lungben force-pushed the add_table_upload branch from 48e9cfb to 9fc394a Compare June 26, 2020 09:29

lungben mentioned this pull request Jun 26, 2020

Add Documentation and Test-Case for DataFrame Upload using CSV.RowIterator #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a high-level function for table upload to database #172

Add a high-level function for table upload to database #172

lungben commented Apr 6, 2020

iamed2 commented Apr 6, 2020

lungben commented Apr 6, 2020 •

edited

Loading

iamed2 commented Apr 9, 2020

lungben commented Apr 14, 2020

lungben commented Jun 17, 2020

lungben commented Jun 19, 2020

iamed2 commented Jun 23, 2020 •

edited

Loading

Add a high-level function for table upload to database #172

Add a high-level function for table upload to database #172

Conversation

lungben commented Apr 6, 2020

iamed2 commented Apr 6, 2020

lungben commented Apr 6, 2020 • edited Loading

iamed2 commented Apr 9, 2020

lungben commented Apr 14, 2020

lungben commented Jun 17, 2020

lungben commented Jun 19, 2020

iamed2 commented Jun 23, 2020 • edited Loading

lungben commented Apr 6, 2020 •

edited

Loading

iamed2 commented Jun 23, 2020 •

edited

Loading