Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAL_ parsing error #61

Closed
seobilee opened this issue Nov 6, 2023 · 24 comments
Closed

VAL_ parsing error #61

seobilee opened this issue Nov 6, 2023 · 24 comments
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@seobilee
Copy link

seobilee commented Nov 6, 2023

Hello,

I wanted to bring to your attention an issue I recently discovered with the signal value table in some dbc files.
It appears that the contents of the dbc file are not parsing properly, and they seem somewhat weird.
The syntax of the dbc file does not necessarily consist of just one line, which seems to be causing problems with line-level parsing with Regex.

Here's an example:

VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used  
Default value: 0x0";

I noticed that this dbc file is written over two lines, and it's actually being used. Unfortunately, this syntax is also parsed correctly in Vector CANdb++.

Could you please consider including improvements in this area in your plans? If not, should we provide guidance to writers on writing the dbc file accurately, line by line?

Your assistance in addressing this matter would be greatly appreciated.

Thank you for your time and attention.

@Adhara3 Adhara3 added bug Something isn't working enhancement New feature or request labels Nov 6, 2023
@Adhara3
Copy link
Collaborator

Adhara3 commented Nov 6, 2023

Hello @seobilee ,

you are right, unfortunately at the moment the library relies on the file being escaped in a specific way, we do expect a command to start at the beginning of the line. Multiline support is present, but only for some commands (e.g. comments)

Could you please consider including improvements in this area in your plans?

Unfortunately, this would not be an improvement, this would be a complete rewriting of the parsers. We do contribute to this project in our spare time, so I cannot promise this rewriting will happen anytime soon. To be fair, that would be right thing to do but it would take a lot of time. In the vast majority of cases line by line works fine.

Bottom line: thanks for spotting this, we'll keep you posted but realistically nothing will happen in the short term

Regards
A

@seobilee
Copy link
Author

seobilee commented Nov 7, 2023

Hello @Adhara3,

I want to express my gratitude for your prompt response.

In light of your guidance, I will make the necessary adjustments by writing the dbc file in one line to address the issue. Additionally, your information regarding INextLineProvider has been invaluable in this regard.

I also wanted to inform you that, due to the circumstances mentioned, we are presently utilizing tag version 1.2.0.

I am eager to contribute to a resolution code-wise and will explore opportunities to do so.

Once again, thank you for your assistance.

Warm regards
S

@Adhara3
Copy link
Collaborator

Adhara3 commented Nov 9, 2023

Hello @seobilee ,

sorry, could you please elaborate on why you are using version 1.2.0? Is it related to the issue above?
Or is there another reason?

Regards
A

@seobilee
Copy link
Author

Hello @Adhara3

Yes, the observed issue is indeed connected.

There has been a modification in the parser's behavior, starting from version 1.3.0 or later. The change primarily involves parsing using regular expressions. In version 1.2.0, a line is parsed even if it does not end with ";".

        private const string ValueTableLinkParsingRegex = @"VAL_\s+(\d+)\s+([a-zA-Z_][\w]*)\s+([a-zA-Z_][\w]*)\s*;";
        private const string ValueTableParsingRegex = @"VAL_\s+(?:(?:(\d+)\s+([a-zA-Z_][\w]*))|([a-zA-Z_][\w]*))\s+((?:(?:-?\d+)\s+(?:""[^""]*"")\s+)*)\s*;";

To illustrate, consider the following example:

VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used  
Default value: 0x0";

In version 1.2.0, only the first line is parsed as a VALUE TABLE, with the second line being omitted. However, later versions cannot parse the above example, and there are some additional minor issues, such as quotes(") being included in the value of the Dictionary of the Value Table.

Regards
S

@Whitehouse112
Copy link
Contributor

Hi @seobilee,

you are right, since version v1.3.0 regex has been introduced in all parsers to have a more robust and efficient parsing, hence the problem you are facing. However several bugs have been fixed and new features have been added in the latest versions (quotes are no longer included in the dictionary value), if you have the opportunity to modify your dbc I suggest you update to the latest version.

Regards

@Uight
Copy link
Contributor

Uight commented Aug 11, 2024

I just did a quick check if you could add general multi line support for the parser (in this case firstly the val parser) without completly changing it up and with minimal effort yet probably not and amzing solution too.

i came up with the following (one a proof of concept) :

        private static readonly IReadOnlyCollection<string> keywords = new List<string> { "SG_", "CM_", "BO_", "BA_", "VAL_" }; //No complete list just a mock

        private static string TryparseNextLines(string currentLine, IParseFailureObserver observer, INextLineProvider nextLineProvider)
        {
            var stringBuilder = new StringBuilder();
            stringBuilder.AppendLine(currentLine);

            while (nextLineProvider.TryGetLine(out var nextLine))
            {
                var cleanLine = nextLine.Trim(' ');

                if (cleanLine == string.Empty)
                {
                    observer.CurrentLine++;
                    break;
                }

                if (keywords.Any(keyword => cleanLine.StartsWith(keyword)))
                {
                    break;
                }

                observer.CurrentLine++;
                stringBuilder.Append(" " + cleanLine);

                if (cleanLine.EndsWith(";"))
                {
                    break;
                }
            }
            return stringBuilder.ToString().Replace("\r\n", string.Empty) + "\r\n";
        }

this would read lines until either it sees an empty line or the line ends with ";" or the line is the next value (marked by a keyword).

if you the use

            if (!cleanLine.EndsWith(";"))
            {
                cleanLine = TryparseNextLines(cleanLine, m_observer, nextLineProvider);
            }

before the actual regex match in the ValueTableLineParser it combines the two rows into one for parsing.

this worked for me in some test however i notices that:

  1. Looking at the dbc docu file from vector present in the solution im not sure if a VAL_ even must end on ";". This is not explicitly marked as required.
    If it woudl be requried you could just use teh same logic as in the comment for reading until the nextline ends with ;
  2. Parsing the exact string that is given in this issue doesnt even work if in a single line. The problem seems to be that for some reason the line must end with at least one " " (space) before the ";"
    This can be checked by using the "ExplicitValTableIsAppliedTest" Testcase and removing the space there.
    Im not sure why this is because the regex actually specifies zero or more spaces before the ";" but zero doesnt work
  3. Even if i add the space so it gets parsed the Default value is then just added to the string like this:
image Im not sure if thats the intended way. Is the default value here just a comment or is it something that needs to be parsed seperatly?

@Whitehouse112 could you check on 1 and 2 when you find time?

@Uight
Copy link
Contributor

Uight commented Aug 27, 2024

so i looked into this further specially in the part that
VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used Default value: 0x0";
(notably in a single line) wouldn't get parsed.

Actually there is an exact testcase currently present that even checks that this doesnt get parserd in ValueTableLineParserTests.cs:
[TestCase("VAL_ 869 qGearboxOil 0 "Running" 1 "Idle";")]
public void ValueTableSyntaxErrorIsObserved(string line)

I dont know why this testcase is there as the syntax is valid and just missing a final " " which isnt required in my opinion.

this is the parsing string:
@"VAL_\s+(?:(?:(\d+)\s+([a-zA-Z_][\w]))|([a-zA-Z_][\w]))\s+((?:(?:-?\d+)\s+(?:""[^""]"")_____\s+____))\s*;";
I marked the problem by putting it in between lot of underscores. This would only match the final value table value if atleast one space is behind it.

You can fix that by just replacing it with \s* and that would only fail the one UnitTest listed above which in my opinion is wrong. All other tests are successfull.

For the test case i would change it to something like this as this in my opinion is a bit clearer than using the mock behavior strict.
But doing this for all test would be time consuming.

        public void ValueTableSyntaxErrorIsObserved(string line)
        {
            var observerMock = m_repository.Create<IParseFailureObserver>();
            var dbcBuilderMock = m_repository.Create<IDbcBuilder>();
            var nextLineProviderMock = m_repository.Create<INextLineProvider>();

            int counter = 0;
            observerMock.Setup(o => o.ValueTableSyntaxError())
            .Callback(() => counter++);

            var lineParser = new ValueTableLineParser(observerMock.Object);
            lineParser.TryParse(line, dbcBuilderMock.Object, nextLineProviderMock.Object);
            Assert.That(counter, Is.EqualTo(1));
        }

This part of the issure could be solved. The other part with the multiline support is just a matter of if the ";" is actually required and if it is then use the same code as in the comment multiline stuff or if it isnt using the code i posted before

@Adhara3, @Whitehouse112 or @EFeru could one of you check the thing i just posted about with the testcase actually being valid. In that case i could create a fix for that and change the testcase (move it to the valid testcases and use the \s* instead.
I could also do the multiline stuff in the same PR if someone tells me if the ";" is required

@Adhara3
Copy link
Collaborator

Adhara3 commented Aug 28, 2024

Ok, let's stop a second here.

VAL_ parsing without space before ";"

This is a bug in the regex due to the fact that VAL_ contains a list of items. @Whitehouse112 will provide a fix in a dedicated PR

Multiline support

History

In the past we allowed multiline support in an akward way but for a very specific reason and in a very specific scenario:

CM_ EV_ envVarData "We would like to format this comment, with a sort of bulletlist:
* Item 1
* Item 2
* Item 3

Ok, done!";

This comment is somehow formatted, the spacing and the newline have a meaning for the user to help visualize.
Moreover, a comment text is delimited by "" and what is inside is considered free text. Both CANdb++ and Kvaser db editor do preserve the formatting. This has a value we wanted to preserve.
Implementation wise, the solution was akward and we realized that the line by line parsing was weak.

Other examples

Current issue is about a TAG (VAL_) spreading multiple lines. It is not text spreading, it's the fact that a definition is splitted over multiple lines

VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used" 3 "Not Used  
Default value: 0x0";

Note that VAL_ does end with ; so a similar solution to comments could be implemented, but what about this?

BO_ 1160 DAS_steeringControl: 4 NEO
 SG_ DAS_steeringControlType : 23|2@0+ 
 (1,0) [0|0] ""  EPAS
 SG_ DAS_steeringControlChecksum : 31|8@0+ (1,0) [0|0] ""  EPAS, OTHER
 SG_ DAS_steeringControlCounter : 19|4@0+ (1,0) [0|0] ""  EPAS

The first SG_ definition spans multiple lines, but in this case no ; is required. So?

NB: Also note that both the above examples

  • are not explicitly mentioned as allowed or forbidden by the doc
  • are tolerated by CANdb++
  • are not tolerated by Kvaser that fails parsing

Even more

What about this, a single line containing 2 separate definitions?

VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

We currently skip the second one, both CANdb++ and Kvaser parse both correctly.
Having two items in a line is the flip side of the multiline support coin.

Bottom line

DBCParserLib is currently relying on some sort of line based formatting that is respected in most of the examples we found on the net, including the ones we use for unit tests. Moreover the spacing/new line is also applied by CANdb++ when it saves a dbc file. We are basically assuming that there is a value in the file formatting. The doc is not clear on this and even CANdb++ and Kvaser do behave differently.
If we want to provide a full multiline support we then need to change the way we parse the file, stop reading line by line, start reading by block, convert the parser in a sort of state machine, look for starting tags in any place, not only at the beginning of the line.

So current issue spots a potential parsing issue that is nevertheless coherent with the current library approach that requires a dbc file to be gracefully formatted. Good or bad, this is currently a requirement for us.
If we want to fix this, then we should fix it in a comprehensive way, I would avoid fixing each parser separately. The problem is not the single parser, the problem here is the overall parsing approach.

So to me this issue currently is a won't fix. If we want to support unformatted dbc files, then we need to (hopefully partially) rewrite the parsing approach not considering line as something valuable, meaningful or complete/single item.

Cheers
Andrea

Whitehouse112 added a commit to Whitehouse112/DbcParser that referenced this issue Aug 28, 2024
@Whitehouse112
Copy link
Contributor

Whitehouse112 commented Aug 28, 2024

Hi @Uight,
thanks for spotting the regex parsing issue, my focus was on the multiline issue and didn't spot the bug beside that. I've already submitted a PR:

  • VAL_ parsing regex has been modified to correctly read the last key-value pair as a separate matching group. The proposed change (replace \s+ with \s*) would cause the following line to be correctly matched, even if it's wrong:

    "VAL_ 134 TEST_BuckleSwitch 0 "Buckled " 1 " Unbuckle " 2 "Not Used"3 "Not Used_"

  • Added a test that address the above use case

  • Added a test to check the parsing of a single key-value pair value table since the last pair is now treated as a separate regex match group

  • Removed the failing test you spotted above: you are right about that, the final " "(space) is not required

Regarding ";": it's a required final char as stated in vector dbc file format documentation

image

Adhara3 added a commit that referenced this issue Aug 28, 2024
#61 updated VAL_ parsing regex to correctly handle last chars. Added …
@Uight
Copy link
Contributor

Uight commented Aug 28, 2024

@Adhara3 i would think that vector mostly scans for the keywords. A bit like what i tried with the comment from 2 weeks ago but not line based but char based;

@Uight
Copy link
Contributor

Uight commented Aug 28, 2024

@Adhara3 and @Whitehouse112

this is a 13 year old code lying around in one of our old testsoftwares in my company. I adjusted it for .net and to read the two dbc files taht are in this repo.
The read is pretty much done like this:

string fileContent = File.ReadAllText(filePath);
string[] keywords = new[] {
    "VERSION",
    "FILTER",

    "NS_DESC_",
    "NS_",

    "CM_",

    "BA_DEF_DEF_REL_",
    "BA_DEF_REL_",
    "BA_REL_",
    "BA_DEF_SGTYPE_",
    "BA_SGTYPE_",
    "BA_DEF_DEF_",
    "BA_DEF_",
    "BA_",

    "CAT_DEF_",
    "CAT_",

    "SGTYPE_VAL_",
    "SGTYPE_",

    "SIGTYPE_VALTYPE_",

    "VAL_TABLE_",
    "VAL_",

    "SIG_GROUP_",
    "SIG_VALTYPE_",
    "SIG_TYPE_REF_",

    "EV_DATA_",
    "ENVVAR_DATA_",

    "BO_TX_BU_",
    "BO_",

    "BU_SG_REL_",
    "BU_EV_REL_",
    "BU_BO_REL_",
    "BU_",

    "SG_MUL_VAL_",
    "SG_",

    "BS_",
};

string pattern = $@"(?={string.Join("|", keywords.Select(Regex.Escape))})";

string[] parts = Regex.Split(fileContent, pattern, RegexOptions.IgnoreCase);

List<string> resultLines = new List<string>();

foreach (string part in parts)
{
    resultLines.Add(part.Trim());
}

i dont like that it reads the whole file in one swoop but with the pcs that run modern .net i would expect an issue even with big files. in fact i tried it with an 2.5MB file that im not allowed to share and it takes 1,8seconds for just the line readings. Its seems it takes around 1s for 1MB but its not fully linear. For the tesla dbc it takes 50ms.

Another thing i dont really like is that you have to order the keywords.

If this could be done in with a file stream instead i think this would be a good option. You can then parse the "lines" with the current parsers.

Another thing is stuff like this:
CM_ SG_ 961 COUNTER_ALT "only increments on change";
which would parse as empty comment and then as signal which is total bullshit ;)
There must be a logik combining keywords with the ";" end of line parsing in vectors solution

@Adhara3
Copy link
Collaborator

Adhara3 commented Aug 28, 2024

There are tons of subtle problems with this approach, like the CM_ SG_ example you provided: basically the parser it's not just a matter of finding strings but find strings in a context, which is something Regen are not designed for. The SG_ part in your example is in the scope (context) of a CM_ so there is value in the order. Basically a state machine.

Loading the whole file should not be a problem, but you would lose the ability to work over a network. We switched to Stream so that you could parse a DBC while downloading it. We may say "who cares", anyway.

A

@Adhara3 Adhara3 added the wontfix This will not be worked on label Aug 29, 2024
@Adhara3
Copy link
Collaborator

Adhara3 commented Aug 29, 2024

Closing as won't fix now

@Uight
Copy link
Contributor

Uight commented Aug 29, 2024

@Adhara3 i would still consider doing some improvments to the code for this.

I did some stuff in: https://github.com/Uight/DbcParser/tree/TestBranchMultiLineSupport

In this case i moved the m_obersver.CurrentLine++ logic to the Nextline Provider an generally added some stuff in there.
Im not to sure about the virtual line stuff i did in there because that would still require that ";" alway is used as a termination but i could image it can appear in a comment but im not sure if EBNF allows that at all.

However moving the trim and m_obersver.CurrentLine++ logic to the nextline provider is a cleaner solution than now and i would still want to see if we could adjust the NextLine provider in a similar way to what i did in my branch (which was based partly on this: https://stackoverflow.com/questions/842465/reading-a-line-from-a-streamreader-without-consuming
I think some parsers could definitly profit from peaking the nextLine.
E.g if nextLine is empty the definition is always ending there. and if the NextLine starts with another definition you also know that the current definition is finished.
It doesnt mean that we would want to implement multiline support; or multi definitons per line support right away but i think it could be a preparation;

If using virtual lines like in my code this would actually be 1 line containing 2 virtual lines;
VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

This could be parsed by peaking nextline and checking if its starts with an definition identifier. This is based on that in EBNF you can not have an definition identifier at line start if its not a definition (at least thats what i think)
BO_ 1160 DAS_steeringControl: 4 NEO
SG_ DAS_steeringControlType : 23|2@0+
(1,0) [0|0] "" EPAS
SG_ DAS_steeringControlChecksum : 31|8@0+ (1,0) [0|0] "" EPAS, OTHER
SG_ DAS_steeringControlCounter : 19|4@0+ (1,0) [0|0] "" EPAS

The case below shows two things im not sure about:
Would the comment end at *Item2; in this case?
Would this comment fail parsing as second line of comment is starting with an identifier (if not this would mean comment is actually special case were you just read until ";"

CM_ EV_ envVarData "We would like to format this comment, with a sort of bulletlist:
SG_ Signal is a Signal definition

  • Item 2;
  • Item 3

Ok, done!";

@Adhara3
Copy link
Collaborator

Adhara3 commented Aug 29, 2024

I personally would not mess the code for a partial solution.
But, do whatever you like, after all I'm not the owner.

A

@Uight
Copy link
Contributor

Uight commented Aug 31, 2024

@Adhara3 i believe i could make it work as a general solution. At least i like a challenge ;)

https://github.com/Uight/DbcParser/tree/TestBranchMultiLineSupport

i did some stuff in this test branch moving multiline support fully into the NextLineProvider. I removed it from the comment. Also it supports multiple definitions per line.

Some tests fail atm:
image
These fail because they use a ArrayBasedNextline provider which are only defined in test and dont have the changes a made to the NextLineProvider.

image
This fails because its a strange testcase. Could only happen if CM_ is the very last entry. Easy enough to fix by just checking line end to be ";". However returning false would still be strange as the line is a comment after all just ill formated.

And then theres this:
image
fails because it expexts line breaks atm which i remove when concatinatin multilines.

Overall it would be easy to fix these problems (2nd i would change the testcase). As i also moved the string.Trim() options to the NextLineProvider you could remove it from all LineParsers or better replace it with something that replaces "\r\n" with " " in that way all parsers would have multiline support. In the comments if you want to keep "\r\n" in there just dont remove them.

@Whitehouse112 and @Adhara3,
ill add some of the cases Whitehouse112 mentioned in the comments above at some time. Are there other testcases you would like?

@Adhara3 Adhara3 removed the wontfix This will not be worked on label Aug 31, 2024
@Adhara3
Copy link
Collaborator

Adhara3 commented Aug 31, 2024

@Uight I have no problem with you fixing, my concern is timeline, milestone, small increment releases and above all, have the time to manage/review/organize changes.

We now have: ext mux, immutability (with API changes), advanced reading, several fixes, multiline support and probably even more, I do not remember. That is huge

All of that is great stuff but I can't handle all this burden alone being this a spare time collaboration.

So please, slow down.
Let's first release what we have now then we merge your multiline support if you will and release just that finally we create a branch for milesto API 2.0 in which I merge immutability, you put ext mux and then we see.

Can we agree on this?
Thanks
A

@Adhara3 Adhara3 added this to the 1.8 milestone Aug 31, 2024
@Uight
Copy link
Contributor

Uight commented Aug 31, 2024

@Adhara3 its all fine for me. You dont have to review changes right away i would also not have a problem having to merge my stuff a few times. My timeline for needing the message stuff is probably end of this year early next year anyway. And with the other stuff i just do it to get more into programming again and its alway interesting to build stuff on codebases your not used to.

Anyway:
i kind of finished my multiline branch. It successfully runs through all test including the new ones with the testcases you mentioned:

VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

BO_ 1160 DAS_steeringControl: 4 NEO
SG_ DAS_steeringControlType : 23|2@0+
(1,0) [0|0] "" EPAS
SG_ DAS_steeringControlChecksum : 31|8@0+ (1,0) [0|0] "" EPAS, OTHER
SG_ DAS_steeringControlCounter : 19|4@0+ (1,0) [0|0] "" EPAS

VAL_ 1160 DAS_steeringControlType 1 "ANGLE_CONTROL" 3 "DISABLED" 0 "NONE" 2 "RESERVED" ; VAL_ 1160 DAS_steeringAngleRequest 16384 "ZERO_ANGLE" ;

All of that gets parsed correctly. The string for the value table code is currently checked against:
Assert.That(dbc.Messages.First().Signals.First().ValueTableMap.Last().Value, Is.EqualTo("Not Used\r\nDefault value: 0x0"));

notice that the "\r\n" is still in there.
I can easily fix that by replacing var cleanLine = line.Trim(' '); in every parse with something to replace "\r\n" with " ". My line trimming is done in the NextLineProvider anyway.
However this causes problems with tests where there are leading " " which can not happen in the real code as the nextline provider allready cleans it up. I think i should adust these cases to not have leading " ";

But then theres are some really strange testcases (here in MessageLineParserTests.cs):

        [Test]
        public void OnlyPrefixIsIgnored()
        {
            var dbcBuilderMock = m_repository.Create<IDbcBuilder>();
            var messageLineParser = CreateParser();
            var nextLineProviderMock = m_repository.Create<INextLineProvider>();

            Assert.That(messageLineParser.TryParse("BO_ ", dbcBuilderMock.Object, nextLineProviderMock.Object), Is.False);
        }

i have no idea why this is not failing atm. The signal clearly starts with BO_ why is it not detected?
When i remove the trim and replace it with the "\r\n" replace to " " it fails which is completly logical to me.

As for the comment parsing which i did first i allready changed it to this: which would be right in my opinion

        [Test]
        // Should parse as it is a comment but should be observed as error
        // This however would be catched previously by the IgnoreLineParser
        public void OnlyPrefixIsIgnored()
        {
            var dbcBuilderMock = m_repository.Create<IDbcBuilder>();

            var counter = 0;
            var failureObserverMock = new Mock<IParseFailureObserver>();
            failureObserverMock
                .Setup(observer => observer.CommentSyntaxError())
                .Callback(() => counter++);

            var commentLineParser = new CommentLineParser(failureObserverMock.Object);

            var nextLineProviderMock = m_repository.Create<INextLineProvider>();

            Assert.That(commentLineParser.TryParse("CM_ ", dbcBuilderMock.Object, nextLineProviderMock.Object), Is.True);
            Assert.That(counter, Is.EqualTo(1));
        }

Edit: theres a cheeky space at the end of MessageLineStarter = "BO_ "; which causes the timed string to missmatch as the final " " is missing. However the test is still flawed a bit i think

Also more testcases might be appropriate and if you find time you can do a code review whenever you like.

@Adhara3
Copy link
Collaborator

Adhara3 commented Aug 31, 2024

Just a note: do not trust \r\n if you want to replace it, a file may have been saved with other newline (e.g Unix)

Cheers
A

@Uight
Copy link
Contributor

Uight commented Aug 31, 2024

@Adhara3 i was wanting to use ReplacelineEndings() but i cant as its .not available in .net standard and .net462.
but i would handle pretty much all according ro the remarks:

The list of recognized newline sequences is CR (U+000D), LF (U+000A), CRLF (U+000D U+000A), NEL (U+0085), LS (U+2028), FF (U+000C), and PS (U+2029). This list is given by the Unicode Standard, Sec. 5.8, Recommendation R4 and Table 5-2.

This method is guaranteed O(n * r) complexity, where n is the length of the input string, and where r is the length of replacementText.

Then i wanted to use a System.Environment.Newline but that would still work if the file was created on the same system. still searching atm. Also for a better way to do the netline parse check. there must be something faster that that.

btw i now changes all test and code and check against more dbc files from:
https://github.com/jamiejones85/DBC-files
the only fail is BU_ being empty in some cases which we have a ticket for.

Edit:
Did this now:

        // Sequence of return codes was taken from the internals of "String.ReplaceLineEndings" method.
        private const string NewLineCharsExceptLineFeed = "\r\f\u0085\u2028\u2029\n";
        private static readonly string pattern = $"[{Regex.Escape(NewLineCharsExceptLineFeed)}]+";

        public static string ReplaceNewlinesWithSpace(this string input)
        {
            // Would like to use "String.ReplaceLineEndings" but its unavailable because of the target frameworks
            // Feel free to optimate
            return Regex.Replace(input, pattern, " ");
        }

and added a test for using \n only which works

@Uight
Copy link
Contributor

Uight commented Sep 1, 2024

Got the multiline parser running with the method read to nextDefinition. (WIP)
Some stuff is unclear atm:

Is it valid for the a comment to have a line ending on ";"?

var dbcString = @"CM_ SG_ 75 channelName ""This is the first line;
this is the second line
this is the third line"";";

This would fail with the current parser but should be possible according to @Adhara3's comment where he stated
that ";" can occure anywhere and therefor not used as as definite syntax in a comment. If a comment can contain a ";"
why not at the end?

And then theres some additional cases i didnt bother to implement (yet):

var dbcString = @"CM_ SG_ 75 channelName ""This is the first line
this is the second line""; BO_ some correctly formatted message";

I assume this would be possible but very unlikely. (Atm if in multiline mode i dont check multidefinitions two)
This applies to everything so also to something like this:

var dbcString = @"VAL_ 134 TEST_BuckleSwitch 0 ""Buckled "" 1 "" Unbuckle "" 2 ""Not Used"" 3 ""Not Used
Default value: 0x0""; VAL_ 1160 DAS_steeringControlType 1 ""ANGLE_CONTROL"" 3 ""DISABLED"" 0 ""NONE"" 2 ""RESERVED"" ";

This cases could be handled but are they even allowed?

var dbcString = @"CM_ SG_ 75 channelName ""This is is a comment containing a ; char""; BO_ some correctly formatted message";

reading the first ; in the line and seeing that its not proceeded by a keyword i assume the whole line is a comment;

Some other querk of this solution is visible in this case:

var dbcString = @"BU_ Node1, Node1

BO_ some correctly formatted message";

This would be a parsing fail due to duplicated nodes in line one. However the observer would observe
the error in line two as the empty line is also read together with the first definition. (Needed for comments containing empty lines)
remark: This in my opinion can not be handled without removing the LineProvider in big parts and just giving full control to the individual parsers
or:
We could agree that an empty line always terminates a definition except for comments in which case i could handle that explicitly in the comment parser;

@Adhara3
Copy link
Collaborator

Adhara3 commented Sep 2, 2024

From the docs, this is comment definition

comment = 'CM_' (char_string |
         'BU_' node_name char_string |
         'BO_' message_id char_string |
         'SG_' message_id signal_name char_string |
         'EV_' env_var_name char_string)
         ';' ; 

So comment text is defined as a char_string which is defined as follows

char_string: an arbitrary string consisting of any printable characters except double hyphens ('"') and backslashes ('\'). Control
        characters like Line Feed, Horizontal Tab, etc. are tolerated,
        but their interpretation depends on the application.

The CANdb++ editor e.g. interprets the character combination ‘0x0D 0x0A’ as a Microsoft Windows specific newline
and displays this information accordingly in the comment dialog page of the objects. In contrast to this behavior in the
‘Unit’-property field of signals the newline couldn’t be interpreted meaningful. 

So any char seems to be tolerated except " and \, thus ; is valid (no limits on positioning are specified so I guess it could be at the end too)
Plese note that in the spec the ; MUST close a message tag

In my opinion, as already stated before, support multiline isn't the right definition of what we need, the right one would be parse treating newlines and indenting as a human only thing which would imply supporting multiline and multi-definitions per line (which is the other side of the coin).
To me any work that does not lead to these is quite worthless. After all we had just one mention about failing multiline definition and even going through all the dbcs in the repo didn't highlight a high occurrence rate, so the multiline definition is not more unlikely to happen that the multi def per line.

And when I say wothless I know that every improvement is welcome but, if the users pattern is:

  1. I see a failure in DbcParserLib
  2. I open the dbc with CANdb++
  3. I see there it works
  4. DbcParserLib is then bugged, I open an issue

then a whole core parser rewrite to have half of the problem fixed has a low the cost benefit ratio to me.

Current implementation is lacking a feature in a coherent way, which is bad but also good (the coherent part), so it could be listed among known issues knowing that the cases where this happen are quite few.
On the other hand we are supporting comment text the right way, even more, the Vector CANdb++ way, which is good.

A

@Uight
Copy link
Contributor

Uight commented Sep 2, 2024

@Adhara3 i have it running in https://github.com/Uight/DbcParser/tree/TestBranchMultiLineSupport
its can parse all cases that are listed in here. The observe reports right error lines. Pretty much its working.

No idea what the current performance is compared to the old code didnt check that. (i would guess bad by the number of string compared => maybe improvable?)

You should maybe check out the code at some point but my opinion would be to probably not use it. I think youll get it when you see the code. I mean its readable but the NextLineProvider is a complex mess. (it provides all the lines allready put together for the other parsers to use).
Atleast its pretty much only the NextLineProvider that had changes.
Im not sure if anybody wants something like this in a open source code. Its just not maintainable. Maybe theres ways to make it more readable? but im not sure.

If we decide to not use it we could move it to a stale branch or something or just delete it. Maybe keep the testcases around if someone wants to try it at some point. And then at that point remove it from 1.8.

Edit: Its a bit better now ;) i removed some stuff that wasnt needed anymore

Another Edit:
If we would use this i would also change

internal interface ILineParser
{
bool TryParse(string line, IDbcBuilder builder, INextLineProvider nextLineProvider);
}

to:

internal interface ILineParser
{
bool TryParse(string line); //Only check id the line should be parsed by this parser;
bool Parse((string line, IDbcBuilder builder) //Actually parses the line and returns false if not parsable (keep it similar so test changes arent to big)
}

This would allow to cleanup LineParsing even more as you could Trim() and remove newlines chars dependend on the parser that is going to parse it meaning that each line parser can fully focus on parsing the line.

@Adhara3
Copy link
Collaborator

Adhara3 commented Sep 8, 2024

Dear all,

I upgraded home page readme with specifically addressing this issue and there are ongoin discussions about how to improve parsing. I close this specific issue to avoid noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants