Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(postgres sink): Add postgres sink #21248

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

jorgehermo9
Copy link
Contributor

@jorgehermo9 jorgehermo9 commented Sep 9, 2024

Closes #15765

This PR is not 100% ready by my side and there will likely be a few things wrong, but had a few questions and wanted to know if the direction seems right... So I would like an initial round of review if possible.

I tested the sink and it seems to be working, but I lack a lot of knowledge about Vector's internals and I'm not sure if the implementation is okay.

I inspired a lot from the databend and clickhouse sinks, but left a few questions as TODOs in the source. I found this sink a bit different from the others, as the others had the request_builder thing and encoding the payload in bytes (as most of the sinks are http based).. But I didn't think that fitted well in this case, as in the sqlx API I should wrap the events with the sqlx::types::Json type and that will do all the encoding with serde internally.

If someone want to manually test it, I used this Vector config:

[sources.demo_logs]
type = "demo_logs"
format = "apache_common"

[transforms.payload]
type = "remap"
inputs = ["demo_logs"]
source = """
.payload = .
"""

[sinks.postgres]
type = "postgres"
inputs = ["payload"]
endpoint = "postgres://postgres:postgres@localhost/test"
table = "test"

Run postgres server with podman run -e POSTGRES_PASSWORD=postgres -p 5432:5432 docker.io/postgres

and execute the following with psql -h localhost -U postgres:

CREATE DATABASE test;

then execute \c test
and last:

CREATE TABLE test (message TEXT, timestamp TIMESTAMP WITH TIME ZONE, payload JSONB);

And then, you will see logs in that table:

image

@jorgehermo9 jorgehermo9 requested a review from a team as a code owner September 9, 2024 22:33
@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: ci Anything related to Vector's CI environment labels Sep 9, 2024
}

/// Configuration for the `postgres` sink.
#[configurable_component(sink("postgres", "Deliver log data to a PostgreSQL database."))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I call this sink postgres or postgres_logs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good question, could this evolve to handle both logs and metrics in the future?

Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if this can evolve to integrate with other postgres flavours such as timescaledb, which is oriented to time series

My thoughts on this: #21308 (comment)

Timescaledb tracking issue: #939

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be interesting to change the input

Input::log()
of this sink to allow for metrics and traces too. I'll give it a try.

@jorgehermo9 jorgehermo9 requested review from a team as code owners September 9, 2024 22:38
@github-actions github-actions bot added the domain: external docs Anything related to Vector's external, public documentation label Sep 9, 2024
// TODO: If a single item of the batch fails, the whole batch will fail its insert.
// Is this intended behaviour?
sqlx::query(&format!(
"INSERT INTO {table} SELECT * FROM jsonb_populate_recordset(NULL::{table}, $1)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table configuration can be a victim of sql injection, but in my opinion, we shouldn't avoid that kind of attacks at this level and the user should be responsible of ensuring that there is not sql injection in the config... The databend sink works like this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I suppose sqlx does not support parameterized table names? Does the query builder help here? If none of the above works, then we can leave as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that could help in this case. See this statement about sqlx's query builder.

And we cannot use a variable bind ($ syntax) in postgres for table names, as the prepared statements are bounded to a query plan and it cannot change if the target table changes.

I think this is the better way to do it... sqlx does not check for sql injection

pub endpoint: String,

/// The table that data is inserted into.
pub table: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the table templatable? Like the clickhouse sink. That would complicate the code a little bit (with KeyPartitioner and so. If yes, I would like some guidance about it if possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a nice feature but not a must-have, we can do this incrementally. Once we finalized the rest of the comments we can come back to this if you are motivated to add this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay!

}

#[tokio::test]
async fn test_postgres_sink() {
Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just a single test is too little.. But I couldn't figure out anything else to test. This test is very similar to the integration tests from databend sink

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some more interesting things we can do here, at the very least send more than one events. Also, we could test failures such as sending a badly formatted payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, going to include more tests!

Copy link
Contributor

@aliciascott aliciascott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for docs

use std::future::ready;
use vector_lib::event::{BatchNotifier, BatchStatus, BatchStatusReceiver, Event, LogEvent};

fn pg_host() -> String {
Copy link
Contributor Author

@jorgehermo9 jorgehermo9 Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied those utility functions from the postgres_metrics source integration tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do some small refactoring here.

  • The folder src/test_util seems like a good location to put these utils.
  • You can use #[cfg(any(feature = "postgresql_metrics-integration-tests", feature = "postgres_sink-integration-tests"))] in src/test_util/mod.rs.

@pront pront self-assigned this Oct 15, 2024
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jorgehermo9, thank you for this sizable contribution! On a high level, it looks great. I did a first review and left some comments. Don't hesitate to follow up, happy to discuss details.

scripts/integration/postgres/test.yaml Outdated Show resolved Hide resolved
}

/// Configuration for the `postgres` sink.
#[configurable_component(sink("postgres", "Deliver log data to a PostgreSQL database."))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good question, could this evolve to handle both logs and metrics in the future?

pub endpoint: String,

/// The table that data is inserted into.
pub table: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a nice feature but not a must-have, we can do this incrementally. Once we finalized the rest of the comments we can come back to this if you are motivated to add this feature.

// TODO: If a single item of the batch fails, the whole batch will fail its insert.
// Is this intended behaviour?
sqlx::query(&format!(
"INSERT INTO {table} SELECT * FROM jsonb_populate_recordset(NULL::{table}, $1)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I suppose sqlx does not support parameterized table names? Does the query builder help here? If none of the above works, then we can leave as is.

/// The table that data is inserted into.
pub table: String,

/// The postgres connection pool size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be useful to explain what this pool is used for. Maybe a link to relevant docs would suffice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 21f3ad3. Do you think it is enough?

I also have doubts about using a connection pool. Can the event batches be executed in parallel for this sink? I don't know the specifics of vector's internals...

.batched(self.batch_settings.as_byte_size_config())

If the batches of events can be processed in parallel, then a connection pool is beneficial. If the batches are processed sequentially, then we should use a single postgres connection as a pooled connection does not have sense

use std::future::ready;
use vector_lib::event::{BatchNotifier, BatchStatus, BatchStatusReceiver, Event, LogEvent};

fn pg_host() -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do some small refactoring here.

  • The folder src/test_util seems like a good location to put these utils.
  • You can use #[cfg(any(feature = "postgresql_metrics-integration-tests", feature = "postgres_sink-integration-tests"))] in src/test_util/mod.rs.

}

#[tokio::test]
async fn test_postgres_sink() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some more interesting things we can do here, at the very least send more than one events. Also, we could test failures such as sending a badly formatted payload.

@jorgehermo9
Copy link
Contributor Author

Thank you very much for the review @pront! I'm kinda busy these days but I will revisit this as soon as I can :)

@pront
Copy link
Member

pront commented Nov 25, 2024

There are a few failing checks. Also, let's add a new postgres semantic scope in https://github.com/vectordotdev/vector/blob/master/.github/workflows/semantic.yml. I will review once these are addressed. Thank you!

@pront pront changed the title feat(sink): Add postgres sink feat(postgres sink): Add postgres sink Nov 25, 2024
@jorgehermo9
Copy link
Contributor Author

jorgehermo9 commented Nov 25, 2024

I will work on this PR these days, I'll ping you whenever it is ready for another round. Thank you so much @pront!

@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New sink: postgres
3 participants