Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzer for href setter and getter #657

Merged
merged 13 commits into from
May 27, 2024

Conversation

CarlosEduR
Copy link
Member

@CarlosEduR CarlosEduR commented May 15, 2024

In order to improve the line coverage for fuzz testing, add fuzzer for set_href and get_href

@anonrig
Copy link
Member

anonrig commented May 15, 2024

cc @lemire i have bad news and good news

@lemire lemire mentioned this pull request May 16, 2024
fuzz/href.cc Outdated
#include <memory>
#include <string>

#include "ada.cpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this these includes, we could add....

#define ADA_DEVELOPMENT_CHECKS 1

This would add additional checks.

@lemire
Copy link
Member

lemire commented May 16, 2024

@anonrig

Indeed, the fuzzer found a bug, here it is...

#659

It is easy to narrow it down, just follow this recipe:

https://gist.github.com/lemire/f501b3b2bb8c33673de4f0a0674a6112

Takes a minute!!!

Fuzzers are great.

@lemire
Copy link
Member

lemire commented May 16, 2024

See #662

@lemire
Copy link
Member

lemire commented May 16, 2024

I expect that I have a fix for the bug in question at #659

@lemire
Copy link
Member

lemire commented May 16, 2024

@CarlosEduR Please rebase. I have merged a bug fix.

@CarlosEduR
Copy link
Member Author

thanks @lemire!
I was initially thinking it was something with the .to_string(), but now I see it was with the .set_pathname().

@CarlosEduR CarlosEduR marked this pull request as ready for review May 16, 2024 22:48
fuzz/href.cc Outdated Show resolved Hide resolved
@CarlosEduR
Copy link
Member Author

See #662

@lemire about this comparative fuzzer, are you thinking about comparing the results of the get_href between the ada::url and the ada::url_aggregator in the fuzzer?

@lemire
Copy link
Member

lemire commented May 17, 2024

I am!!!!!!!!!

@CarlosEduR
Copy link
Member Author

I am!!!!!!!!!

@lemire assertion added!

fuzz/parse.cc Show resolved Hide resolved
fuzz/parse.cc Outdated
@@ -17,12 +17,23 @@ extern "C" int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
auto parse_url = ada::parse<ada::url>(source);
auto parse_url_aggregator = ada::parse<ada::url_aggregator>(source);

if ((parse_url && !parse_url_aggregator) || (!parse_url && parse_url_aggregator)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using xor would make this a lot simpler:

parse_url ^ parse_url_aggregator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting... will update it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated a little bit more: parse_url.has_value() ^ parse_url_aggregator.has_value()
but seems it raised the abort function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

================== Job 0 exited with exit code 77 ============
/github/workspace/build-out/parse -dict=/github/workspace/build-out/url.dict -max_len=1024 -timeout=25 -rss_limit_mb=2560 -len_control=0 -seed=1337 -artifact_prefix=/tmp/tmplfz2au7i/ -max_total_time=300 -print_final_stats=1 /github/workspace/cifuzz-corpus/parse >fuzz-2.log 2>&1
Dictionary: 7 entries
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 1337
INFO: Loaded 1 modules   (12847 inline 8-bit counters): 12847 [0x5562cca46348, 0x5562cca49577), 
INFO: Loaded 1 PC tables (12847 PCs): 12847 [0x5562cca49578,0x5562cca7b868), 
INFO:    13333 files found in /github/workspace/cifuzz-corpus/parse
INFO: seed corpus: files: 13333 min: 1b max: 4394568b total: 16829628b rss: 43Mb
AddressSanitizer:DEADLYSIGNAL
=================================================================
==32==ERROR: AddressSanitizer: ABRT on unknown address 0x000000000020 (pc 0x7fd3f3cc[80](https://github.com/ada-url/ada/actions/runs/9230862431/job/25399824163#step:5:81)0b bp 0x7ffec9050880 sp 0x7ffec90504d0 T0)
SCARINESS: 10 (signal)
    #0 0x7fd3f3cc800b in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: 87b331c034a6458c64ce09c03939e947212e18ce)
    #1 0x7fd3f3ca7858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858) (BuildId: 87b331c034a6458c64ce09c03939e947212e18ce)
    #2 0x5562cc938973 in LLVMFuzzerTestOneInput /src/ada-url/fuzz/parse.cc:21:5
    #3 0x5562cc7468b0 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:614:13
    #4 0x5562cc7460d5 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:516:7
    #5 0x5562cc748062 in fuzzer::Fuzzer::ReadAndExecuteSeedCorpora(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile>>&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:[82](https://github.com/ada-url/ada/actions/runs/9230862431/job/25399824163#step:5:83)9:7
    #6 0x5562cc74[83](https://github.com/ada-url/ada/actions/runs/9230862431/job/25399824163#step:5:84)97 in fuzzer::Fuzzer::Loop(std::__Fuzzer::vector<fuzzer::SizedFile, std::__Fuzzer::allocator<fuzzer::SizedFile>>&) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:867:3
    #7 0x5562cc7369a6 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:914:6
    #8 0x5562cc762ed2 in main /src/llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #9 0x7fd3f3ca9082 in __libc_start_main (/lib/x[86](https://github.com/ada-url/ada/actions/runs/9230862431/job/25399824163#step:5:87)_64-linux-gnu/libc.so.6+0x24082) (BuildId: 87b331c034a6458c64ce09c03939e947212e18ce)
    #10 0x5562cc727b1d in _start (build-out/parse+0x93b1d)

DEDUP_TOKEN: raise--abort--LLVMFuzzerTestOneInput
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: ABRT (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) (BuildId: [87](https://github.com/ada-url/ada/actions/runs/9230862431/job/25399824163#step:5:88)b331c034a6458c64ce09c03939e947212e18ce) in raise
==32==ABORTING

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is relatively simple. Look at what you are feeding the parser:

"Ws:X\xff"

That's not a valid UTF-8 string. Our parsers assume valid UTF-8 inputs.

This breaks the contract: ada::parse<ada::url>(...) will accept it, while ada::parse<ada::url_aggregator>(...) will reject it.

It would be trivial to front-load the parsing with UTF-8 validation, but we specifically do not do it. So the result is effectively undefined when passing an invalid string (because we assume it does not happen).

You can check that the input is valid UTF-8 with a function like so:

bool validate(const char *buf, size_t len) {
  const uint8_t *data = reinterpret_cast<const uint8_t *>(buf);
  uint64_t pos = 0;
  uint32_t code_point = 0;
  while (pos < len) {
    uint64_t next_pos = pos + 16;
    if (next_pos <= len) { // if it is safe to read 16 more bytes, check that they are ascii
      uint64_t v1;
      std::memcpy(&v1, data + pos, sizeof(uint64_t));
      uint64_t v2;
      std::memcpy(&v2, data + pos + sizeof(uint64_t), sizeof(uint64_t));
      uint64_t v{v1 | v2};
      if ((v & 0x8080808080808080) == 0) {
        pos = next_pos;
        continue;
      }
    }
    unsigned char byte = data[pos];
    while (byte < 0b10000000) {
      if (++pos == len) { return true; }
      byte = data[pos];
    }

    if ((byte & 0b11100000) == 0b11000000) {
      next_pos = pos + 2;
      if (next_pos > len) { return false; }
      if ((data[pos + 1] & 0b11000000) != 0b10000000) { return false; }
      code_point = (byte & 0b00011111) << 6 | (data[pos + 1] & 0b00111111);
      if ((code_point < 0x80) || (0x7ff < code_point)) { return false; }
    } else if ((byte & 0b11110000) == 0b11100000) {
      next_pos = pos + 3;
      if (next_pos > len) { return false; }
      if ((data[pos + 1] & 0b11000000) != 0b10000000) { return false; }
      if ((data[pos + 2] & 0b11000000) != 0b10000000) { return false; }
      code_point = (byte & 0b00001111) << 12 |
                   (data[pos + 1] & 0b00111111) << 6 |
                   (data[pos + 2] & 0b00111111);
      if ((code_point < 0x800) || (0xffff < code_point) ||
          (0xd7ff < code_point && code_point < 0xe000)) {
        return false;
      }
    } else if ((byte & 0b11111000) == 0b11110000) { // 0b11110000
      next_pos = pos + 4;
      if (next_pos > len) { return false; }
      if ((data[pos + 1] & 0b11000000) != 0b10000000) { return false; }
      if ((data[pos + 2] & 0b11000000) != 0b10000000) { return false; }
      if ((data[pos + 3] & 0b11000000) != 0b10000000) { return false; }
      code_point =
          (byte & 0b00000111) << 18 | (data[pos + 1] & 0b00111111) << 12 |
          (data[pos + 2] & 0b00111111) << 6 | (data[pos + 3] & 0b00111111);
      if (code_point <= 0xffff || 0x10ffff < code_point) { return false; }
    } else {
      return false;
    }
    pos = next_pos;
  }
  return true;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, validating that we have UTF-8 is kind of a separate issue. So the fuzzer could just require the inputs to be valid UTF-8 by checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, got it, thanks for the investigation and explanation @lemire.
Should I update the fuzzer by adding this function to validate the inputs and only compare the two variables once we have a valid UTF-8 string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might give it a try!!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, looks by adding the function to validate the string, the fuzzer is now passing. thanks @lemire @anonrig

@anonrig anonrig merged commit a404fd8 into ada-url:main May 27, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants