-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TraceQL: support mixed-type attribute querying (int/float) #4391
base: main
Are you sure you want to change the base?
Conversation
I apologize for taking so long to get to this. Your analysis is correct! We do generate predicates per column and, since we store integers and floats independently we only scan one of the columns. Given how small int and float columns tend to be (compared to string columns) I think the performance hit of doing this is likely acceptable in exchange for the nicer behavior. What is the behavior in this case? I'm pretty sure this will work b/c the engine will request all values for the two attributes and do the work itself. I believe the engine layer will compare ints and floats correctly but I'm not 100% sure.
Tests should also be added here for the new behavior. These tests build a block and then search for a known trace using a large range of traceql queries. If you add tests here and they pass it means that your changes work from the parquet file all the way up through the engine. This will also break the "allConditions" optimization if the user types any query with a number comparison: I would like preserve the allConditions behavior in this case b/c it's such a nice optimization and number queries are common. I'm not quite sure why the |
3d8f31d
to
812d768
Compare
Thank you for confirming the approach and pointing out the
I verified that
Done. Let me know if I missed something.
Regarding the
Given my limited exposure to Tempo’s internals, I’d appreciate any guidance on whether these routes are viable or if there’s a simpler approach to preserve P.S. Do we care about comparisons with negative values? Should it also be covered? |
812d768
to
171afab
Compare
4970fbd
to
50f5ae5
Compare
This is a really cool change. Ran benchmarks and found no major regressions. Nice tests added ./tempodb. We try to keep those as comprehensive as possible given the complexity of the language.
This case is covered in the ./pkg/traceql tests so I wouldn't worry about it. It occurred to me that this case causes two "OpNone" conditions to the fetch layer and the condition itself is evaluated in the engine, so your changes will not impact it.
Nice improvements here. I like falling back to integer comparison (or nothing) based on if the float has a fractional part.
The right choice would be a Also, if you're interested, plug your queries into this test and run it. It will dump the iterator structure and you can see how your changes have impacted the hierarchy.
Yes, are they not already? reviewing your code I think they would work fine. I think my primary ask at this point would be to keep the int and float switch cases symmetrical. Even though it's trivial can you create a I'm a bit impressed you're taking this on. I wouldn't have guessed someone outside of Grafana would have had the time and patience to find this. benches
|
b0ffcde
to
10b04c5
Compare
I'm not sure if it's worth it. I'd rather rely on your opinion here.
Actually, it turned out they didn't work correctly with negative values. I've updated the shifting logic to fix this. Also, another edge case raises questions: what happens if a float hits MaxInt/MinInt? In some cases, it might cause jumps between MaxInt and MinInt.
Done! Let me know if this aligns with what you had in mind. Plus, I've added some tests in a separate commit. Feel free to let me know if they look odd or need adjustments.
Haha, thanks! Honestly, it's just curiosity. Tempo is a fascinating system, and I've wanted to dive into something challenging like this. It's fun to learn from real-world systems and see how they tackle performance and scalability. :) |
c05b608
to
a679803
Compare
We could try to get tricky here. Like if you do
Yup, I think this communicates better to a future reader what's going on. Thanks for the change. Ok, I was running your branch on Friday to test and we do have one final thing to figure out. This query does not work:
The reason is b/c we handle this special column here: tempo/tempodb/encoding/vparquet4/block_traceql.go Lines 1969 to 1986 in 14efba0
All well known and dedicated columns are strings ... except this one unfortunately. To do this correctly we have to scan both the well known column as well as the general float attribute column if the static value being compared against http status code is a float. To do this performantly I think we will need to build a |
Sounds like a plan. Will do it later. :)
Oh, that's a nice catch! But before rushing into handling this case, I want to address one quick concern. If a user specifies Anyway, if you see real value in covering this edge case, I'm happy to implement it. Let me know what you think! P.S. I found out that I should convert |
yeah, that does kind of feel like a typo, but there's nothing special in the language about
it will find float status codes and return spans appropriately. i don't know why but for some reason we have float status codes all over the place in our internal Tempo installation. |
a679803
to
4386237
Compare
It's funny because I really want this PR in, but the only thing blocking it is handling http status code correctly. However, I'd really like to cut a vparquet5 that removes all well known columns (and other cleanup) which would unblock this PR. |
I believe that I finally learn on how to use |
7222fa2
to
9ac7d86
Compare
I wish we had time to work on this. It's an undefined cleanup pass on vParquet with a focus on reducing complexity, number of columns and footer size. One of the things I'd like accomplished is removing the well known columns and instead relying on dedicated columns. |
This comment was marked as outdated.
This comment was marked as outdated.
63aa82c
to
69aa605
Compare
Tested and works! but there's definitely some cleanup to do.
but we don't need to scan the generic attribute column for an int. Int values are guaranteed to be stored in the dedicated column for this attribute name so we only need to scan the generic column for a float. This should simplify the iterators to something like:
unsure why you're seeing nils. I can dig into that a bit. we shouldn't need the filter nil thing. |
69aa605
to
7e2974d
Compare
Oh my gosh! This is what happens when a review lasting too long. I started forgetting what I've been doing. :D Fixed.
If it looks good, there's one more step remaining - need to update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! it looks like we were able to get rid of those nil filter shenanigans. I think this is very very close. All functionality is accounted for. I did run some benchmarks and find a regression we should spend some time to understand. I do expect a bit of overhead due to this change but one particular query is showing a 20% increase in cpu.
I can help dig into this.
These are the queries used in the benches. The regression occurred on traceOrMatch
which you can see below. As you can tell they are crafted for internal data, but they can be rewritten for any block where they get some matches.
statuscode: { span.http.status_code = 200 }
traceOrMatch: { rootServiceName = `tempo-gateway` && (status = error || span.http.status_code = 500)}
complex: {resource.cluster=~"prod.*" && resource.namespace = "tempo-prod" && resource.container="query-frontend" && name = "HTTP GET - tempo_api_v2_search_tags" && span.http.status_code = 200 && duration > 1s}
benches
> benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Apple M3 Pro
│ before.txt │ after.txt │
│ sec/op │ sec/op vs base │
BackendBlockTraceQL/statuscode-11 64.28m ± 2% 64.81m ± 1% +0.83% (p=0.043 n=10)
BackendBlockTraceQL/traceOrMatch-11 249.8m ± 8% 303.7m ± 9% +21.59% (p=0.000 n=10)
BackendBlockTraceQL/complex-11 4.944m ± 5% 4.880m ± 1% ~ (p=0.190 n=10)
geomean 42.98m 45.80m +6.56%
│ before.txt │ after.txt │
│ B/s │ B/s vs base │
BackendBlockTraceQL/statuscode-11 341.4Mi ± 2% 338.8Mi ± 1% ~ (p=0.052 n=10)
BackendBlockTraceQL/traceOrMatch-11 6.695Mi ± 8% 5.541Mi ± 9% -17.24% (p=0.000 n=10)
BackendBlockTraceQL/complex-11 182.0Mi ± 5% 184.4Mi ± 1% ~ (p=0.190 n=10)
geomean 74.66Mi 70.22Mi -5.94%
│ before.txt │ after.txt │
│ MB_io/op │ MB_io/op vs base │
BackendBlockTraceQL/statuscode-11 23.01 ± 0% 23.02 ± 0% +0.04% (p=0.000 n=10)
BackendBlockTraceQL/traceOrMatch-11 1.753 ± 0% 1.766 ± 0% +0.74% (p=0.000 n=10)
BackendBlockTraceQL/complex-11 943.7m ± 0% 943.7m ± 0% ~ (p=1.000 n=10) ¹
geomean 3.364 3.373 +0.26%
¹ all samples are equal
│ before.txt │ after.txt │
│ B/op │ B/op vs base │
BackendBlockTraceQL/statuscode-11 31.19Mi ± 1% 31.29Mi ± 1% ~ (p=0.436 n=10)
BackendBlockTraceQL/traceOrMatch-11 10.597Mi ± 17% 9.867Mi ± 35% ~ (p=0.579 n=10)
BackendBlockTraceQL/complex-11 5.387Mi ± 4% 5.413Mi ± 2% ~ (p=0.631 n=10)
geomean 12.12Mi 11.87Mi -2.09%
│ before.txt │ after.txt │
│ allocs/op │ allocs/op vs base │
BackendBlockTraceQL/statuscode-11 378.4k ± 0% 378.6k ± 0% +0.04% (p=0.000 n=10)
BackendBlockTraceQL/traceOrMatch-11 86.49k ± 1% 86.57k ± 1% ~ (p=0.218 n=10)
BackendBlockTraceQL/complex-11 79.81k ± 0% 79.83k ± 0% +0.02% (p=0.000 n=10)
geomean 137.7k 137.8k +0.05%
I cannot reproduce it. Could you tell me how you generated traces? I scribbled such a Frankenstein monsterpackage vparquet4
import (
"bytes"
"context"
"io"
"math/rand"
"os"
"sort"
"testing"
"time"
"github.com/google/uuid"
"github.com/stretchr/testify/require"
"github.com/grafana/tempo/pkg/tempopb"
"github.com/grafana/tempo/pkg/traceql"
"github.com/grafana/tempo/pkg/util/test"
"github.com/grafana/tempo/tempodb/backend"
"github.com/grafana/tempo/tempodb/backend/local"
"github.com/grafana/tempo/tempodb/encoding/common"
v1_common "github.com/grafana/tempo/pkg/tempopb/common/v1"
v1_resource "github.com/grafana/tempo/pkg/tempopb/resource/v1"
v1_trace "github.com/grafana/tempo/pkg/tempopb/trace/v1"
)
type testTrace struct {
traceID common.ID
trace *tempopb.Trace
}
type testIterator2 struct {
traces []testTrace
}
func (i *testIterator2) Next(context.Context) (common.ID, *tempopb.Trace, error) {
if len(i.traces) == 0 {
return nil, nil, io.EOF
}
tr := i.traces[0]
i.traces = i.traces[1:]
return tr.traceID, tr.trace, nil
}
func (i *testIterator2) Close() {
}
func newTestTraces(traceCount int) []testTrace {
traces := make([]testTrace, 0, traceCount)
for i := 0; i < traceCount; i++ {
traceID := test.ValidTraceID(nil)
if i%2 == 0 {
trace := MakeTraceWithCustomTags(traceID, "tempo-gateway", int64(i), true, true)
traces = append(traces, testTrace{traceID: traceID, trace: trace})
} else {
trace := MakeTraceWithCustomTags(traceID, "megaservice", int64(i), false, false)
traces = append(traces, testTrace{traceID: traceID, trace: trace})
}
}
sort.Slice(traces, func(i, j int) bool {
return bytes.Compare(traces[i].traceID, traces[j].traceID) == -1
})
return traces
}
var (
blockID = uuid.MustParse("6757b4d9-8d6b-4984-a2d7-8ef6294ca503")
)
func TestGenerateBlocks(t *testing.T) {
const (
traceCount = 10000
)
blockDir, ok := os.LookupEnv("TRACEQL_BLOCKDIR")
require.True(t, ok, "TRACEQL_BLOCKDIR env var must be set")
rawR, rawW, _, err := local.New(&local.Config{
Path: blockDir,
})
require.NoError(t, err)
r := backend.NewReader(rawR)
w := backend.NewWriter(rawW)
ctx := context.Background()
cfg := &common.BlockConfig{
BloomFP: 0.01,
BloomShardSizeBytes: 100 * 1024,
}
traces := newTestTraces(traceCount)
iter := &testIterator2{traces: traces}
meta := backend.NewBlockMeta(tenantID, blockID, VersionString, backend.EncNone, "")
meta.TotalObjects = int64(len(iter.traces))
_, err = CreateBlock(ctx, cfg, meta, iter, r, w)
require.NoError(t, err)
}
func MakeTraceWithCustomTags(traceID []byte, service string, intValue int64, isError bool, setHTTP500 bool) *tempopb.Trace {
now := time.Now()
traceID = test.ValidTraceID(traceID)
trace := &tempopb.Trace{
ResourceSpans: make([]*v1_trace.ResourceSpans, 0),
}
var attributes []*v1_common.KeyValue
attributes = append(attributes,
&v1_common.KeyValue{
Key: "stringTag",
Value: &v1_common.AnyValue{
Value: &v1_common.AnyValue_StringValue{StringValue: "value1"},
},
},
&v1_common.KeyValue{
Key: "intTag",
Value: &v1_common.AnyValue{
Value: &v1_common.AnyValue_IntValue{IntValue: intValue},
},
},
)
if setHTTP500 {
attributes = append(attributes,
&v1_common.KeyValue{
Key: "http.status_code",
Value: &v1_common.AnyValue{
Value: &v1_common.AnyValue_IntValue{IntValue: 500},
},
},
)
}
statusCode := v1_trace.Status_STATUS_CODE_OK
statusMsg := "OK"
if isError {
statusCode = v1_trace.Status_STATUS_CODE_ERROR
statusMsg = "Internal Error"
}
trace.ResourceSpans = append(trace.ResourceSpans, &v1_trace.ResourceSpans{
Resource: &v1_resource.Resource{
Attributes: []*v1_common.KeyValue{
{
Key: "service.name",
Value: &v1_common.AnyValue{
Value: &v1_common.AnyValue_StringValue{
StringValue: service,
},
},
},
{
Key: "other",
Value: &v1_common.AnyValue{
Value: &v1_common.AnyValue_StringValue{
StringValue: "other-value",
},
},
},
},
},
ScopeSpans: []*v1_trace.ScopeSpans{
{
Spans: []*v1_trace.Span{
{
Name: "test",
TraceId: traceID,
SpanId: make([]byte, 8),
ParentSpanId: make([]byte, 8),
Kind: v1_trace.Span_SPAN_KIND_CLIENT,
Status: &v1_trace.Status{
Code: statusCode,
Message: statusMsg,
},
StartTimeUnixNano: uint64(now.UnixNano()),
EndTimeUnixNano: uint64(now.Add(time.Second).UnixNano()),
Attributes: attributes,
DroppedLinksCount: rand.Uint32(),
DroppedAttributesCount: rand.Uint32(),
},
},
},
},
})
return trace
}
func BenchmarkMixTraceQL(b *testing.B) {
const query = "{ rootServiceName = `tempo-gateway` && (status = error || span.http.status_code = 500)}"
blockDir, ok := os.LookupEnv("TRACEQL_BLOCKDIR")
require.True(b, ok, "TRACEQL_BLOCKDIR env var must be set")
ctx := context.TODO()
r, _, _, err := local.New(&local.Config{Path: blockDir})
require.NoError(b, err)
rr := backend.NewReader(r)
meta, err := rr.BlockMeta(ctx, blockID, tenantID)
require.NoError(b, err)
opts := common.DefaultSearchOptions()
opts.StartPage = 3
opts.TotalPages = 2
block := newBackendBlock(meta, rr)
_, _, err = block.openForSearch(ctx, opts)
require.NoError(b, err)
b.ResetTimer()
bytesRead := 0
for i := 0; i < b.N; i++ {
e := traceql.NewEngine()
resp, err := e.ExecuteSearch(ctx, &tempopb.SearchRequest{Query: query}, traceql.NewSpansetFetcherWrapper(func(ctx context.Context, req traceql.FetchSpansRequest) (traceql.FetchSpansResponse, error) {
return block.Fetch(ctx, req, opts)
}))
require.NoError(b, err)
require.NotNil(b, resp)
// Read first 20 results (if any)
bytesRead += int(resp.Metrics.InspectedBytes)
}
b.SetBytes(int64(bytesRead) / int64(b.N))
b.ReportMetric(float64(bytesRead)/float64(b.N)/1000.0/1000.0, "MB_io/op")
} generate
after
before
UPD: I'm wondering how to check if a block has dedicated columns at all. |
8e1feb4
to
495c234
Compare
we generally pull a block generated from internal tracing data which is why the benchmarks contain references to loki and tempo. these blocks generally cover a large range of organically created trace data. nice work generating a large block. likely some pattern of data internally at Grafana is causing the regression. maybe you should write some float value http status codes and see what happens?
the meta.json will list all dedicated columns in a block. i do think there's a bug with the current implementation. fixing it may also resolve the regression. not sure. the query
I believe it should be this:
I'm looking into the regression now. |
Yeah, I have the same hypothesis. I've just been looking into how to correctly attach filtering by key. UPD: Done. |
495c234
to
ff2bc15
Compare
Yep, I also think so. Roaming around the code base I got an impression that dedicated columns aren't something by default. I feel I'm missing something. |
What this PR does:
Below is my understanding of the current limitations. Please feel free to correct me if I’ve misunderstood or overlooked something.
Attributes of the same type are stored in the same column. For example, integers are stored in one column and floats in another.
Querying operates in two stages:
The issue arises because predicates are generated based on the operand type. If an attribute is stored as a float but the operand is an integer, the predicate evaluates against the integers column instead of the floats column. This results in incorrect behavior.
Proposed Solution
The idea is to generate predicates for both integers and floats, allowing both columns to be scanned for the queried attribute.
In this PR, I’ve created a proof-of-concept by copying the existing
createAttributeIterator
function tocreateAttributeIterator2
. This duplication is intentional, as the original function is used in multiple places, and I want to avoid introducing unintended side effects until the approach is validated.WDYT? :)
Which issue(s) this PR fixes:
Fixes #4332
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]