Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a high-level function for table upload to database #172

Closed
wants to merge 0 commits into from

Conversation

lungben
Copy link

@lungben lungben commented Apr 6, 2020

It would be good to have an easy-to-use high-level function for efficient uploading of tabular data into a Postgres table, analogue to Pandas to_sql().
The upload! function defined in this PR uses the COPY FROM STDIN functionality (which is much faster than using SQL Insert statements) and is essentially a thin wrapper around LibPG.CopyIn (following the example provided in the documentation).
It takes care about column ordering, missing data and escapes CSV delimiters in strings.
As far as I can see, this functionality fits best into LibPQ.jl (not e.g. DataFrames.jl) because it is database-specific for Postgres, but should work for any structure with Tables.jl interface.

Usage example:

conn = LibPQ.Connection("dbname=postgres user=$DATABASE_USER, password=$DATABASE_PASSWORD")
execute(conn, """CREATE TEMPORARY TABLE libpqjl_test (
                id int PRIMARY KEY,
                start_date timestamp,
                price numeric,
                comment varchar)""")
df = DataFrame(id=[1, 2, 3],
            start_date=[DateTime("2020-04-06T12:54:23"), DateTime("2016-04-06T02:54:13"), DateTime("2030-07-25T14:54:23")],
            price=[124.23, 17, -0.532],
            comment=["string with , in between", "string with \"quotes\"", "for, more; \$, and \"fun\""],)
result = upload!(df, conn, "libpqjl_test")

Note that in order to do my tests I included #171 as first commit of this PR.

Comments are welcome and please let me know if you encounter any issues!

@iamed2
Copy link
Collaborator

iamed2 commented Apr 6, 2020

Most of this function is CSV writing. I think that functionality (given a Tables table, return an iterable of string CSV lines) should be added to CSV. At that point this is the combination of a simple query and a CSV.jl function call, which can just be added to documentation.

@lungben
Copy link
Author

lungben commented Apr 6, 2020

I could not figure out how to re-use CSV.jl for this purpose.
If you could add an example how to do this in the documentation it would be great.

Edit: reverted changes related to #171

@iamed2
Copy link
Collaborator

iamed2 commented Apr 9, 2020

I could not figure out how to re-use CSV.jl for this purpose.

This functionality does not currently exist in CSV.jl; I suggested adding it.

Currently the functionality for writing a CSV file exists as CSV.write. I am suggesting adding another method that works the same way, but as an iterator. Instead of writing a row to a file, it would return the row as a string on each iteration.

cc @quinnj in case he has already thought of adding this

@lungben
Copy link
Author

lungben commented Apr 14, 2020

Being able to use CSV.jl for this purpose would be great, there are probably many edge cases which my simple function cannot handle.
Would it be helpful to add an Issue for this at CSV.jl?

@lungben
Copy link
Author

lungben commented Jun 17, 2020

The iterator has been implemented in CSV.jl, I'll take a look and update this PR.

@lungben
Copy link
Author

lungben commented Jun 19, 2020

The following function uses the new CSV.jl row iterator for writing tables to Postgres:

function load_by_copy!(table, con:: LibPQ.Connection, tablename:: AbstractString)
    it = CSV.RowWriter(df_sample)
    row_names = first(it)
    copyin = LibPQ.CopyIn("COPY $tablename ($row_names) FROM STDIN (FORMAT CSV);", Iterators.drop(it, 1))
    execute(con, copyin)
end

It is much faster (and probably much more robust) than my initial version.
However, when adding this to LibPQ, the CSV.jl package would be needed as dependency.
An alternative would be to add it to the documentation of LibPQ, or an own package (probably overkill for 5 LOC).
What do you think?

@iamed2
Copy link
Collaborator

iamed2 commented Jun 23, 2020

I suggest adding it in the docs in place of the existing COPY section

iter = CSV.RowWriter(df)
column_names = first(iter)
copyin = LibPQ.CopyIn("COPY my_table ($column_names) FROM STDIN (FORMAT CSV, HEADER);", iter)
execute(conn, copyin)

You can also avoid drop by using the HEADER flag.

This would also make a good integration test, and CSV would be fine as a test dependency IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants