Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http_client source corrupts binary data before passing it to the decoder #16814

Closed
Dnnd opened this issue Mar 16, 2023 · 1 comment · Fixed by #17655
Closed

http_client source corrupts binary data before passing it to the decoder #16814

Dnnd opened this issue Mar 16, 2023 · 1 comment · Fixed by #17655
Assignees
Labels
source: http_client Anything `http_client` source related type: bug A code related bug.

Comments

@Dnnd
Copy link

Dnnd commented Mar 16, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I think, I found a bug in the http_client source implementation.

Consider the following pipeline:

[sources.native_messages]
type = "http_client"
endpoint = "http://127.0.0.1:8080/native"
decoding.codec = "native"

[sinks.print]
type = "console"
inputs = ["native_messages"]
encoding.codec = "json"

Expectation

Data in Vector-native binary format from http://127.0.0.1:8080/native will be re-encoded to json and printed to the stdout

Reality

2023-03-16T03:14:28.853215Z ERROR source{component_kind="source" component_id=native_messages component_type=http_client component_name=native_messages}: vector::internal_events::codecs: Failed deserializing frame. error=failed to decode Protobuf message: invalid tag value: 0 error_type="parser_failed" stage="processing" internal_log_rate_limit=true

This error isn't triggered when I'm using other source types (everything works as expected with "stdin" source and decoding.codec = "native").

I managed to trace down this error. I think the root cause of this error is on_response callback in the http_client::HttpClientContext:

fn on_response(&mut self, _url: &Uri, _header: &Parts, body: &Bytes) -> Option<Vec<Event>> {
// get the body into a byte array
let mut buf = BytesMut::new();
let body = String::from_utf8_lossy(body);
buf.extend_from_slice(body.as_bytes());
// decode and enrich
let mut events = self.decode_events(&mut buf);
self.enrich_events(&mut events);
Some(events)

This callback unconditionally invokes String::from_utf8_lossy on the response body before passing it to the decoder. This behavior impacts any binary data produced by the "http_client" source, which is relevant for the "native" and "bytes" codecs.

Configuration

sources.native_messages]
type = "http_client"
endpoint = "http://127.0.0.1:8080/native"
decoding.codec = "native"

[sinks.print]
type = "console"
inputs = ["native_messages"]
encoding.codec = "json"

Version

vector 0.28.1 (x86_64-unknown-linux-gnu ff15924 2023-03-06)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@Dnnd Dnnd added the type: bug A code related bug. label Mar 16, 2023
@jszwedko
Copy link
Member

Thanks @Dnnd ! Good find. I agree we should not unconditionally be treating the bytes as UTF-8.

This is related to #16406

@jszwedko jszwedko added the source: http_server Anything `http_server` source related label Mar 16, 2023
@neuronull neuronull added source: http_client Anything `http_client` source related and removed source: http_server Anything `http_server` source related labels Apr 5, 2023
@dsmith3197 dsmith3197 self-assigned this Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: http_client Anything `http_client` source related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants