Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http_parser: missing the obs-text #938

Closed
kamyuentse opened this issue Mar 6, 2018 · 3 comments
Closed

http_parser: missing the obs-text #938

kamyuentse opened this issue Mar 6, 2018 · 3 comments
Assignees
Labels

Comments

@kamyuentse
Copy link

This is an awesome work, thanks a lot! I am playing with the code of this repo. But I am confused by the HTTP parser here. tempesta use SIMD to boost up the parser, the most important primitive here is the __tfw_strspn_simd, but I find a problem I am trying to run the following tests, according to the comments in the code, the tfw_match_ctext_vchar SHOULD match 0xff(ctext = .... | obs-text), while it doesn't. The reason here might the usage of _mm256_shuffle_epi8 when the most significant bit = 1, the shuffled result would be 0x00, so when the input char = 0b1xxxxxxx, we not accept it, which means we lost the obs-text?

void test0() {
    char buffer[256];
    int i;
    for (i=0; i< 256; i++)
        buffer[i] = 'A';

    size_t res = tfw_match_ctext_vchar(buffer, 256);

    printf("%d\n", res);
}

void test1() {
    char buffer[256];
    int i;
    for (i=0; i< 256; i++)
        buffer[i] = 0xff;

    size_t res = tfw_match_ctext_vchar(buffer, 256);

    printf("%d\n", res);
}

int main() {
    tfw_str_init_const();

    test0();
    test1();
}

Output:

256
0
@krizhanovsky krizhanovsky self-assigned this Mar 7, 2018
@krizhanovsky krizhanovsky added this to the 0.5 alpha milestone Mar 7, 2018
@krizhanovsky
Copy link
Contributor

Hi @kamyuentse ,

thanks for the report! I can confirm that there is a bug: the test aa34dc9 fails with:

[ 1260.905205] tfw_test: TEST_RUN(cstr, simd_match_ctext_vchar)
[ 1260.907304] tfw_test: /root/tempesta/tempesta_fw/t/unit/test_tfw_str.c:119: error: 
[ 1260.908574] tfw_test: FAIL:
[ 1260.909115] tfw_test:   test__cstr__simd_match_ctext_vchar()
[ 1260.909960] tfw_test:   EXPECT_TRUE((tfw_match_ctext_vchar(buf, 300) == 300)) => 0x0
[ 1260.911779] tfw_test: /root/tempesta/tempesta_fw/t/unit/test_tfw_str.c:123: error: 
[ 1260.912913] tfw_test: FAIL:
[ 1260.913624] tfw_test:   test__cstr__simd_match_ctext_vchar()
[ 1260.914610] tfw_test:   EXPECT_TRUE((tfw_match_ctext_vchar(buf, 300) == 23 + 0x7f)) => 0x0
[ 1260.917873] tfw_test: /root/tempesta/tempesta_fw/t/unit/test_tfw_str.c:125: error: 
[ 1260.919521] tfw_test: FAIL:
[ 1260.920570] tfw_test:   test__cstr__simd_match_ctext_vchar()
[ 1260.921969] tfw_test:   EXPECT_TRUE((tfw_match_ctext_vchar(buf, 300) == 23 + 0x1f)) => 0x0
[ 1260.925262] tfw_test: /root/tempesta/tempesta_fw/t/unit/test_tfw_str.c:127: error: 
[ 1260.927406] tfw_test: FAIL:
[ 1260.928565] tfw_test:   test__cstr__simd_match_ctext_vchar()
[ 1260.930113] tfw_test:   EXPECT_TRUE((tfw_match_ctext_vchar(buf, 300) == 23 + 0xa)) => 0x0
[ 1260.933639] tfw_test: /root/tempesta/tempesta_fw/t/unit/test_tfw_str.c:129: error: 
[ 1260.935244] tfw_test: FAIL:
[ 1260.936384] tfw_test:   test__cstr__simd_match_ctext_vchar()
[ 1260.937764] tfw_test:   EXPECT_TRUE((tfw_match_ctext_vchar(buf, 300) == 23 + 8)) => 0x0
[ 1260.939958] tfw_test: /root/tempesta/tempesta_fw/t/unit/test_tfw_str.c:131: error: 
[ 1260.941319] tfw_test: FAIL:
[ 1260.942042] tfw_test:   test__cstr__simd_match_ctext_vchar()
[ 1260.942966] tfw_test:   EXPECT_TRUE((tfw_match_ctext_vchar(buf, 300) == 23)) => 0x0

@kamyuentse
Copy link
Author

@krizhanovsky Thanks for your response, I am looking for a solution for this. Extend the original approach to match 0x00 - 0xFF seems impossible, but specified for HTTP parsing, because obs-text is only one range, we can combine _mm256_cmpgt_epi8 and _mm256_and_si256 to match it, finally, we use or to combine the result.

But, come up with a generalized strspn_simd which support 0x00 - 0xFF is best and valuable.

@krizhanovsky
Copy link
Contributor

@kamyuentse , thank you for working on the problem.

Yes this is fundamental problem of the algorithm. It was designed to solve URI processing issues and during the design I forgot about obs-text matching filelds. Actually there is only one matching function which must care about obs-text, tfw_match_ctext_vchar(), - all other matchers work fine. However, the function is used relatively frequently, e.g. for matching User-Agent headers (which can be kilobytes in size) or all other, unknown for Tempesta FW, headers.

Of course, we can just add a range check, e.g. compare all the vector bytes for signed < 0 or unsigned > 0x7f, but it adds additional vector instructions which hurts other matching functions. So we have implement tfw_match_ctext_vchar() separately from generic __tfw_strspn_simd().

Next, tfw_match_ctext_vchar() matches just 3 ranges x9, x20-x7e and x80-xff. Maybe CloudFlare's approach can be optimized to provide more speed closer to current tfw_match_ctext_vchar().

Moreover, we want to provide custom characters set for URI and HTTP headers, so which seems also require two function: the current __tfw_strspn_simd() accepting custom allowed alphabet with all characters less than 0xFF and the second one must work with full 0-256 range and typically just do double work of current __tfw_strspn_simd(). Unfortunately, I don't see any optimization opportunities for the last one since we have no true 32-byte shuffle in AVX2 and even AVX512...

@krizhanovsky krizhanovsky modified the milestones: 0.5 alpha, 0.6 KTLS Mar 10, 2018
@krizhanovsky krizhanovsky changed the title A question about the parser: missing the obs-text ? http_parser: missing the obs-text Mar 10, 2018
krizhanovsky added a commit to tempesta-tech/blog that referenced this issue Apr 1, 2018
krizhanovsky added a commit that referenced this issue Apr 4, 2018
krizhanovsky added a commit that referenced this issue Apr 4, 2018
krizhanovsky added a commit that referenced this issue Apr 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants