Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ambiguity between null strings and zero-length strings #264

Merged
merged 16 commits into from
Oct 6, 2022

Conversation

Gei0r
Copy link
Contributor

@Gei0r Gei0r commented Jun 10, 2022

fixes #263

test/test_empty_scalar.cpp Outdated Show resolved Hide resolved
@Gei0r
Copy link
Contributor Author

Gei0r commented Jun 16, 2022

I implemented option no. 2:

  • len == 0 && str == nullptr → This is a "null" string
  • len == 0 && str != nullptr → This is a zero-length, but non-null string

However you will need to take a thorough look, because I basically only looked at the points where tests failed. There may be some forgotten points.

Unfortunately, when we only store str == nullptr for null scalars, we lose the ability to get the scalars' locations.

@codecov
Copy link

codecov bot commented Jun 16, 2022

Codecov Report

Merging #264 (6009769) into master (d53b444) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
+ Coverage   96.35%   96.38%   +0.03%     
==========================================
  Files          22       22              
  Lines        7991     8060      +69     
==========================================
+ Hits         7700     7769      +69     
  Misses        291      291              
Impacted Files Coverage Δ
src/c4/yml/emit.def.hpp 92.26% <100.00%> (ø)
src/c4/yml/node.hpp 97.85% <100.00%> (+0.06%) ⬆️
src/c4/yml/parse.cpp 95.75% <100.00%> (+0.05%) ⬆️
src/c4/yml/parse.hpp 100.00% <100.00%> (ø)
src/c4/yml/tree.hpp 98.97% <100.00%> (+0.06%) ⬆️
src/c4/yml/common.hpp 100.00% <0.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Gei0r Gei0r changed the title WIP: Fix 263 Fix 263 Jul 3, 2022
@Gei0r Gei0r marked this pull request as ready for review July 3, 2022 15:36
@Gei0r
Copy link
Contributor Author

Gei0r commented Jul 3, 2022

@biojppm Can you look into this?
I'm not sure, but since we would lose location information for null scalars, it might be a better option to invent a new flag for "key is null"/"value is null".

@biojppm
Copy link
Owner

biojppm commented Jul 4, 2022

I've finally had some occasion to look at this. I've refined the tests and the code for it to pass those tests. While doing this I've found a parser bug that was also fixed in this PR. I pushed those changes to your branch.

Basically an important point is that for any given string, (whether null or otherwise), the result of operator<< and operator= should be semantically equivalent, with the only difference being where that resulting val is located.

There is a thorny ambiguity with std::string because to_csubstr() will return a null csubstr if the input is empty. So this results in an ambiguity. I am inclined to rely in the small-string-optimization and return a nonnull pointing at the storage, but only if the SSO is mandated by the standard.

So for now I've left std::string out of it.

As for the location issue, I have yet to look at it. It is possible that we have to reconsider the whole approach, and adding a flag as you suggest is something that could help.

@biojppm biojppm changed the title Fix 263 Fix ambiguity between null strings and zero-length strings Jul 7, 2022
@biojppm
Copy link
Owner

biojppm commented Sep 27, 2022

@Gei0r sorry for the long delay in looking at this. I hope to be able to do it in the coming week or so.

@biojppm
Copy link
Owner

biojppm commented Sep 27, 2022

@Gei0r I've rebased on master (which has breaking changes). Let me know if I can push to the PR (ie to your remote branch).

Gei0r and others added 8 commits September 28, 2022 19:11
Re biojppm#263

To differentiate between "null" zero-length scalars, these are now stored in
different ways:
- null values are stored with str == nullptr. Unfortunately we lose the
  location information for these.
- zero-length strings are stored with (str != nullptr), either their location
  in the source buffer or the arena.
  If the arena is empty (nullptr) and a zero-length scalar is stored, some
  space is reserved for this string to have a non-nullptr str for this scalar.
  (However, the reserved space will not actually be used up by the scalar.)
@biojppm
Copy link
Owner

biojppm commented Sep 28, 2022

@Gei0r I went ahead and pushed to the PR. The handling of locations was improved such that when the original node is null, in most cases we get a location "close enough" to the original node. Only in some pathological cases does this yield to a null val.

It's not perfect but it's an improvement.

@Gei0r
Copy link
Contributor Author

Gei0r commented Sep 29, 2022

Yeah, likewise sorry for not being more engaged with this issue.
If you want, I can take a look today. Nice that you found a way to store the location for null values, that's definitely an improvement.

@biojppm
Copy link
Owner

biojppm commented Sep 29, 2022

The locations still need some more work; there is still a case where it goes null and didn't need to.

But right now I'm really bothered by what must be a compiler error in gcc's Release builds. Given this version of Tree::to_arena():

#define ______(id) \
    printf("here --- %s ---  a.len==%zu  a.str=%p(%zu)  (a.str==nullptr)==%d  (a.str!=nullptr)==%d\n", \
           #id, a.len, a.str, (intptr_t)a.str,                  \
           (a.str == nullptr), (a.str != nullptr))

csubstr Tree::to_arena(csubstr a)
{
    ______(0);
    substr rem(m_arena.sub(m_arena_pos));
    size_t num = to_chars(rem, a);
    ______(1);
    if(num > rem.len)
    {
        ______(2);
        rem = _grow_arena(num);
        num = to_chars(rem, a);
        RYML_ASSERT(num <= rem.len);
    }
    else if(num == 0u)
    {
        ______(3);
        if(a.str == nullptr)  // ??????  a null string must enter this branch!
        {
            ______(3.1);
            return csubstr{};
        }
        else if(m_arena.str == nullptr)
        {
            ______(3.2);
            // Arena is empty and we want to store a non-null
            // zero-length string.
            // Even though the string has zero length, we need
            // some "memory" to store a non-nullptr string
            rem = _grow_arena(1);
        }
    }
    ______(4);
    rem = _request_span(num);
    return rem;
}

... I'm getting wrong behavior in gcc Release builds (Debug builds are ok). If I pass an empty csubstr (where a.str == nullptr), gcc will not take the branch marked ______(3.1):

TEST(empty_scalar, gcc_error)
{
    Tree tree;
    csubstr nullstr = {};
    ASSERT_EQ(nullstr.str, nullptr);
    ASSERT_EQ(nullstr.len, 0);

    std::cout << "\nserializing with empty arena...\n";
    csubstr result = tree.to_arena(nullstr);
    EXPECT_EQ(result.str, nullptr); // fails!
    EXPECT_EQ(result.len, 0);

    std::cout << "\nserializing with nonempty arena...\n";
    result = tree.to_arena(nullstr);
    EXPECT_EQ(result.str, nullptr); // fails!
    EXPECT_EQ(result.len, 0);
}

This is the output I'm getting:

[----------] 1 test from empty_scalar
[ RUN      ] empty_scalar.gcc_error

serializing with empty arena...
here --- 0 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 1 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 3 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 3.2 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 4 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
/home/usr/proj/rapidyaml/test/test_empty_scalar.cpp:87: Failure
Expected equality of these values:
  result.str
    Which is: 0x555555685b70 pointing to "u<="
  nullptr
    Which is: (nullptr)

serializing with nonempty arena...
here --- 0 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 1 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 3 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
here --- 4 ---  a.len==0  a.str=(nil)(0)  (a.str==nullptr)==1  (a.str!=nullptr)==0
/home/usr/proj/rapidyaml/test/test_empty_scalar.cpp:91: Failure
Expected equality of these values:
  result.str
    Which is: 0x555555685b70 pointing to "u<="
  nullptr
    Which is: (nullptr)
[  FAILED  ] empty_scalar.gcc_error (0 ms)
[----------] 1 test from empty_scalar (0 ms total)

So despite the fact that the 3.1 branch condition a.str == nullptr is met (as the prints show), GCC refuses to take that branch (!). This occurs across several different gcc versions, as can be seen from the CI failures above.

The problem goes away in Debug builds, and everywhere with clang and msvc. So I think this is a GCC optimizer error.

@biojppm
Copy link
Owner

biojppm commented Sep 29, 2022

I reshuffled the branches, and it now works:

modified   src/c4/yml/tree.hpp
@@ -997,15 +997,19 @@ public:
      * @see alloc_arena() */
     csubstr to_arena(csubstr a)
     {
-        substr rem(m_arena.sub(m_arena_pos));
-        size_t num = to_chars(rem, a);
-        if(num > rem.len)
+        if(a.len > 0)
         {
-            rem = _grow_arena(num);
-            num = to_chars(rem, a);
-            RYML_ASSERT(num <= rem.len);
+            substr rem(m_arena.sub(m_arena_pos));
+            size_t num = to_chars(rem, a);
+            if(num > rem.len)
+            {
+                rem = _grow_arena(num);
+                num = to_chars(rem, a);
+                RYML_ASSERT(num <= rem.len);
+            }
+            return _request_span(num);
         }
-        else if(num == 0u)
+        else
         {
             if(a.str == nullptr)  // ??????  should enter this branch!
             {
@@ -1017,11 +1021,10 @@ public:
                 // zero-length string.
                 // Even though the string has zero length, we need
                 // some "memory" to store a non-nullptr string
-                rem = _grow_arena(1);
+                _grow_arena(1);
             }
+            return _request_span(0);
         }
-        rem = _request_span(num);
-        return rem;
     }

@biojppm
Copy link
Owner

biojppm commented Sep 29, 2022

Right, that's one out of the way. This one sucked.

@biojppm
Copy link
Owner

biojppm commented Sep 29, 2022

Three issues remain to be addressed:

  1. trim some more null cases from locations
  2. fix coverage misses
  3. what to do about to_csubstr(std::string const&) and to_substr(std::string &) when the input string is empty, and add tests to verify

I hope to do these in the coming days.

Regarding the std::string question, do you have any thoughts? Should it return a null or an empty csubstr when the input std::string is empty? (Remember that std::string uses SSO, and the standard mandates SSO; so even though the .size() of a std::string is 0, there is still a buffer inside the std::string)

I am inclined to make the resulting csubstr be null instead of empty:

  • it is consistent with most other cases
  • less chance of pointing at a dangling buffer if the std::string goes out of scope.

@Gei0r
Copy link
Contributor Author

Gei0r commented Sep 29, 2022

But right now I'm really bothered by what must be a compiler error in gcc's Release builds.

This is a little over my head, but I suspect this is because in in to_chars(), you pass a.str to memcpy() here.
Because calling memcpy() with an invalid pointer is undefined behavior (I think?), the compiler is allowed to assume a.str != nullptr and elide the branch.

Reminds me of this blog post by Raymond Chen.

@Gei0r
Copy link
Contributor Author

Gei0r commented Sep 29, 2022

Here's a short example of what I mean:

https://godbolt.org/z/KaYxa1qKG

As you can see, with -O3 gcc does not put the if expression, including the printf() call into the output. clang does it.

@Gei0r
Copy link
Contributor Author

Gei0r commented Sep 29, 2022

I am inclined to make the resulting csubstr be null instead of empty

What's important to me is that an empty (e.g. default-constructed) std::string is stored and serialized as an empty string scalar ('') and not a null scalar.

Semantically, I'd prefer to make a csubstr from such a std::string also non-null. That means there are no "null" std::strings, only empty scalars.

The thing with the dangling pointer if std::string goes out of scope is always there... Maybe leave it all behind and switch to Rust 😜


// See also:
// https://github.com/biojppm/rapidyaml/issues/263
// https://github.com/biojppm/rapidyaml/pulls/264
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URL is wrong (pull instead of pulls)

@biojppm
Copy link
Owner

biojppm commented Sep 29, 2022

I suspect this is because in in to_chars(), you pass a.str to memcpy() here.

Kudos and thanks, that is exactly the problem. Eg, looking at cppreference, it does say that:

If either dest or src is an invalid or null pointer, the behavior is undefined, even if count is zero.

... which is exactly the case that I had: count was zero, so I assumed it was ok to still do the call to avoid the branch, while not aware I was incurring UB by doing so. I will need to check the library for this problem, as there are quite a few calls to memcpy() which may trigger this behavior.

@biojppm
Copy link
Owner

biojppm commented Sep 30, 2022

What's important to me is that an empty (e.g. default-constructed) std::string is stored and serialized as an empty string scalar ('') and not a null scalar.

I'm curious. Why is it important, what is the use-case warranting that distinction?

Semantically, I'd prefer to make a csubstr from such a std::string also non-null. That means there are no "null" std::strings, only empty scalars.

Wanting a default-constructed std::string to be empty effectively requires that to_csubstr() returns non-null for that string. But again, that begs the question above.

Consider this:

TEST(empty_scalar, std_string)
{
    std::string stdstr;
    csubstr stdss = to_csubstr(stdstr);
    csubstr nullss;
    ASSERT_EQ(stdss, nullptr);
    ASSERT_EQ(stdss.str, nullptr);
    ASSERT_EQ(stdss.len, 0u);
    ASSERT_EQ(nullss, nullptr);
    ASSERT_EQ(nullss.str, nullptr);
    ASSERT_EQ(nullss.len, 0u);
    Tree tree = parse_in_arena("{ser: {}, eq: {}}");
    tree["ser"]["stdstr"] << stdss;
    tree["ser"]["nullss"] << nullss;
    tree["eq"]["stdstr"] = stdss;
    tree["eq"]["nullss"] = nullss;
    EXPECT_EQ(emitrs_yaml<std::string>(tree),
             "ser:\n"
             "  stdstr: \n"
             "  nullss: \n"
             "eq:\n"
             "  stdstr: \n"
             "  nullss: \n"
              );
}

This is the current situation. Everything is null, including a default-constructed std::string. This is consistent and makes sense. However, if it goes your way it would be like this:

    EXPECT_EQ(emitrs_yaml<std::string>(tree),
             "ser:\n"
             "  stdstr: ''\n"
             "  nullss: \n"
             "eq:\n"
             "  stdstr: ''\n"
             "  nullss: \n"
              );

@Gei0r
Copy link
Contributor Author

Gei0r commented Sep 30, 2022

I think what you called "my way" is the correct way.

Everything is null, including a default-constructed std::string. This is consistent and makes sense.

But the yaml standard says:

Note that a null is different from an empty string.

So these asserts are wrong imo:

    ASSERT_EQ(stdss, nullptr);
    ASSERT_EQ(stdss.str, nullptr);

My use case:

The yaml I emit is read by a different program, which checks the datatype. Some fields must be strings. However, sometimes these strings are empty.
As I noted in my original issue, this was fine prior to 2707b25, because empty strings were serialized to ''. But I can't upgrade rapidyaml because of the change.

The way I see it, currently there is no way to emit an empty string using rapidyaml?

@biojppm
Copy link
Owner

biojppm commented Oct 1, 2022

The yaml I emit is read by a different program, which checks the datatype. Some fields must be strings. However, sometimes these strings are empty.

Understood. Let me think about it. I do agree with the rust note, but I want to give some thought to the several issues.

But whatever happens, whatever defaults end up on rapidyaml, you should be aware that you need not be constrained by those choices.

Eg, even if rapidyaml would have the incoming std::string be null after to_csubstr(), you can ensure that its serialization is not, contrary to what rapidyaml would choose.

For example, you could do something like this in your code (assuming you're using <<):

node << nonnull(str);

where nonnull could look like this:

struct nonnull {  std::string const& subject; };
NodeRef& operator<< (NodeRef node, nonnull nn)
{
    if(nn.subject.empty())
      node << csubstr(""); // use a non-empty string
    else
      node << nn.subject;
    return node;    
}

Of course, then you would need to remember to ensure use of nonnull everywhere in your serialization code where a std::string is used. If that is too much of a downside (if for example there are a lot of places where you would need to remember to use it), instead of using something like nonnull you could wrap << inside a function and specialize it for your types:

template<class T>
C4_ALWAYS_INLINE void myserialize(NodeRef &n, T const& var) { n << var; }
C4_ALWAYS_INLINE void myserialize(NodeRef &n, std::string const& var) { ... }

// then
myserialize(n, str);
myserialize(n, intvar);

Or you could choose to not use any of operator<< and do the serialization yourself and just set the result. tree.alloc_arena() may be helpful, or you may choose to have your own serialization buffer instead.

Or you could even ensure that ryml/std/string.hpp is not included, and provide instead your own to_csubstr(std::string const&) implementation.

I'm highlighting this because it is a conscious design decision for rapidyaml that although it may provide the basic facilities, it is not a framework forcing you to use its approach. It is not a pact with the devil; you are not selling your soul if you choose to use it. The point of rapidyaml is that it should be very easy (and fast) to do what you need, but if it isn't, you still have the freedom to do what's best for you.

And if there is something constraining you for which there is no recourse, I would consider it a design bug of rapidyaml.

@biojppm
Copy link
Owner

biojppm commented Oct 1, 2022

I checked and apparently since c++11 it is no longer undefined behavior to call std::string()[0]:

If pos == size(), a reference to the character with value CharT() (the null character) is returned.

For the first (non-const) version, the behavior is undefined if this character is modified to any value other than CharT() .

So I will change to_substr() and to_csubstr() to reflect this. That means that the behavior will be what you are looking for.

@biojppm biojppm merged commit a758fa7 into biojppm:master Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty strings are parsed as null
2 participants