Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a wasm browser based playground #41

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

mingodad
Copy link
Contributor

This is the first version of a wasm browser based playground to lalr.

…w it only work on literals and only for ASCII
@cwbaker
Copy link
Owner

cwbaker commented Jul 16, 2023

Thanks very much Domingo. There are lots of great changes here.

I had a go with the playground and it's amazing. Great work! Do you mind if I share the playground link with a few people?

Having seen the railroad diagram generator at https://www.bottlecaps.de/rr/ui I'm convinced that's a useful addition to lalrc. Cheers for pointing that out.

It's much easier for me, and you're more likely to get a prompt response, if I can deal with these queries and changes in smaller chunks. If you email small queries directly, instead of commenting on the #11, I'll be able to respond faster. If you have smaller PRs then we can also get things merged or given feedback faster too.

I've taken the changes that I could and merged them to main. I've rebased the remaining changes to the branch playground-2023-07-16 to hopefully make it convenient for you. But I'll also reply with my thoughts to the remaining commits here. Some of those changes I think are better placed in a separate repository with the playground itself rather than in lalr.

Thanks,
Charles

@@ -387,6 +387,7 @@ void Parser<Iterator, UserData, Char, Traits, Allocator>::parse( Iterator start,
const ParserSymbol* symbol = reinterpret_cast<const ParserSymbol*>( lexer_.symbol() );
while ( parse(symbol, lexer_.lexeme(), lexer_.line(), lexer_.column()) )
{
if(lexer_.full()) break;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixing a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is some grammars that enter a endless loop because the lexer doesn't advance.
I don't know exactly which ones trigger the bug but you can try it with this script:

#!/bin/sh

basep=playground
checkGrammar() {
	echo Now testing $1 $2
	/usr/bin/time ./grammar_test-clang -g $basep/$1 -i $basep/$2
}

checkGrammar json3.g test.json.txt
checkGrammar lua.g test.lua
checkGrammar carbon-lang.g prelude.carbon
checkGrammar postgresql-16.g test.sql
#checkGrammar cxx-parser.g test.cpp
checkGrammar lsl_ext.g test.lsl
checkGrammar bison.g carbon-lang.y
checkGrammar bison-bug.g carbon-lang.y
checkGrammar dparser.g test.dparser
checkGrammar parse_gen.g test.parse_gen
checkGrammar tameparser.g test.tameparser
checkGrammar javascript.g test.js
checkGrammar javascript-core.g test.js
checkGrammar cparser.g test.c
checkGrammar java11.g test.java
checkGrammar rust.g test.rs
checkGrammar go.g test.go
checkGrammar php-8.2.g test.php
checkGrammar gringo-ng.g test.clingo
checkGrammar ada-adayacc.g test.adb

Build script:

#!/bin/sh

umask 022

myflags="-O2 -g"
#myflags="-O2 -g -m32"
#myflags="-g"

clang-16-env clang++ \
	-std=c++17 $myflags -Wall -Wextra -Wno-unused-function -pedantic \
	-Isrc -DLALR_NO_THREADS \
	src/lalr/ErrorPolicy.cpp \
	src/lalr/Grammar.cpp \
	src/lalr/GrammarCompiler.cpp \
	src/lalr/GrammarGenerator.cpp \
	src/lalr/GrammarParser.cpp \
	src/lalr/GrammarState.cpp \
	src/lalr/GrammarSymbol.cpp \
	src/lalr/GrammarSymbolSet.cpp \
	src/lalr/GrammarTransition.cpp \
	src/lalr/RegexCompiler.cpp \
	src/lalr/RegexGenerator.cpp \
	src/lalr/RegexItem.cpp \
	src/lalr/RegexNode.cpp \
	src/lalr/RegexParser.cpp \
	src/lalr/RegexState.cpp \
	src/lalr/RegexSyntaxTree.cpp \
	src/lalr/RegexToken.cpp \
	src/lalr/lalr_examples/grammar_test.cpp \
	-o grammar_test-clang

grammar_test.cpp:


#include <stdio.h>
#include <stdarg.h>
#include <lalr/GrammarCompiler.hpp>
#include <lalr/Parser.hpp>
#include <string.h>
#include <errno.h>
#include <sys/stat.h>
#include <time.h>

static int errors_ = 0;

typedef unsigned char mychar_t;

static void show_error( const char* format, ... )
{
    ++errors_;
    va_list args;
    va_start( args, format );
    vfprintf( stderr, format, args );
    va_end( args );
}

int read_file(const char *fname, std::vector<mychar_t> &content)
{
        struct stat stat;
        int result = ::stat( fname, &stat );
        if ( result != 0 )
        {
            show_error( "Stat failed on '%s' - result=%d\n", fname, result );
            return EXIT_FAILURE;
        }

        FILE* file = fopen( fname, "rb" );
        if ( !file )
        {
            show_error( "Opening '%s' to read failed - errno=%d\n", fname, errno );
            return EXIT_FAILURE;
        }

        int size = stat.st_size;
        content.resize( size+1 );
        int read = int( fread(&content[0], sizeof(mychar_t), size, file) );
        fclose( file );
        file = nullptr;
        if ( read != size )
        {
            show_error( "Reading grammar from '%s' failed - read=%d\n", fname, int(read) );
            return EXIT_FAILURE;
        }
        content[size] = '\0';
	return EXIT_SUCCESS;
}

static clock_t start_time;
clock_t myShowDiffTime(const char *title)
{
    clock_t now = clock();
    clock_t diff = now - start_time;

    int msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("%s: Time taken %d seconds %d milliseconds\n", title, msec/1000, msec%1000);
    start_time = now;
    return now;
}

struct C_MultLineCommentLexer
{
	static lalr::PositionIterator<const mychar_t*> string_lexer( const lalr::PositionIterator<const mychar_t*>& begin,
							const lalr::PositionIterator<const mychar_t*>& end,
							std::basic_string<mychar_t>* lexeme,
							const void** /*symbol*/ )
	{
		LALR_ASSERT( lexeme );

		lexeme->clear();
                //printf("C_MultLineCommentLexer : %s\n", lexeme->c_str());

		bool done = false;
		lalr::PositionIterator<const mychar_t*> i = begin;
		while ( i != end && !done)
		{
			switch( *i )
			{
				case '*':
					++i;
					if(i != end && *i == '/') done = true;
					continue;
					break;
			}
			++i;
		}
		if ( i != end )
		{
			LALR_ASSERT( *i == '/' );
			++i;
		}
		return i;
	}
};

struct AstUserDataDbg {
    int index;
    int stack_index;
    static int next_index;;
    static int total;
    AstUserDataDbg():index(total++), stack_index(next_index++) {};
};
int AstUserDataDbg::next_index = 0;
int AstUserDataDbg::total = 0;


static bool astMakerDbg( AstUserDataDbg& result, const AstUserDataDbg* start, const lalr::ParserNode<mychar_t>* nodes, size_t length )
{
//    //printf("astMaker: %s\n", nodes[0].lexeme().c_str());
//    const char *lexstr = (length > 0 ? (const char *)nodes[0].lexeme().c_str() : "::lnull");
//    const char *idstr = (length > 0 ? nodes[0].symbol()->identifier : "::inull");
//    int line = (length > 0 ? nodes[0].line() : 0);
//    int column = (length > 0 ? nodes[0].column() : 0);
//    //const char *stateLabel = (length > 0 ? nodes[0].state()->label : "::inull");
//    printf("astMaker: %p\t%zd:%d:%d\t%p\t%zd\t->\t%s : %s :%d:%d\n", start, length,
//                length ? start->index : -1, length ? start->stack_index : -1,
//                nodes, length, idstr, lexstr, line, column);
    printf("----\n");
    for(size_t i=0; i< length; ++i)
        printf("%zd:%d\t%p\t%d:%d\t%p <:> %s <:> %s <:> %s <:> %d:%d\n", i, nodes[i].symbol()->type,
                start+i, start[i].index, start[i].stack_index, nodes+i,
                nodes[i].symbol()->identifier, nodes[i].symbol()->lexeme,
                nodes[i].lexeme().c_str(), nodes[i].line(), nodes[i].column());
    return true;
}

struct ParseTreeUserData {
    std::vector<ParseTreeUserData> children;
    const lalr::ParserSymbol *symbol;
    std::basic_string<mychar_t> lexeme; ///< The lexeme at this node (empty if this node's symbol is non-terminal).
    ParseTreeUserData():children(0),symbol(nullptr) {};
};


static bool parsetreeMaker( ParseTreeUserData& result, const ParseTreeUserData* start, const lalr::ParserNode<mychar_t>* nodes, size_t length )
{
    if(length == 0) return false;
    result.symbol = nodes[length-1].state()->transitions->reduced_symbol;
    for(size_t i_node = 0; i_node < length; ++i_node)
    {
        const lalr::ParserNode<mychar_t>& the_node = nodes[i_node];
        switch(the_node.symbol()->type)
        {
            case lalr::SymbolType::SYMBOL_TERMINAL:
            {
                ParseTreeUserData& udt = result.children.emplace_back();
                udt.symbol = the_node.symbol();
                udt.lexeme = the_node.lexeme();
                //printf("TERMINAL: %s : %s\n", udt.symbol->identifier, udt.lexeme.c_str());
            }
            break;
            case lalr::SymbolType::SYMBOL_NON_TERMINAL:
            {
                if(the_node.symbol() == result.symbol)
                {
                    const ParseTreeUserData& startx = start[i_node];
                    for (std::vector<ParseTreeUserData>::const_iterator child = startx.children.begin(); child != startx.children.end(); ++child)
                    {
                        result.children.push_back( std::move(*child) );
                    }
                }
                else
                {
                    ParseTreeUserData& udt = result.children.emplace_back();
                    udt.symbol = the_node.symbol();
                    if(udt.symbol == start[i_node].symbol)
                    {
                        udt.children = start[i_node].children;
                    }
                    else
                        udt.children.push_back(std::move(start[i_node]));                        
                }
                //printf("NON_TERMINAL: %s\n", result.symbol->identifier);
            }
            break;
            default:
                //LALR_ASSERT( ?? );
                printf("Unexpected symbol %p\n", the_node.symbol());
        }
    }
    return true;
}

static void indent( int level )
{
    for ( int i = 0; i < level; ++i )
    {
        printf( " |" );
    }
}

static void print_parsetree( const ParseTreeUserData& ast, int level )
{
    if(ast.symbol)
    {
        indent( level );
        switch(ast.symbol->type)
        {
            case lalr::SymbolType::SYMBOL_TERMINAL:
                if(ast.lexeme.size())
                {
                    //indent( level -1);
                    printf("%s -> %s\n", ast.symbol->identifier, ast.lexeme.c_str());
                }
                break;
            case lalr::SymbolType::SYMBOL_NON_TERMINAL:
                //indent( level );
                printf("%s\n", ast.symbol->lexeme);
                break;
        }
    }

    for (std::vector<ParseTreeUserData>::const_iterator child = ast.children.begin(); child != ast.children.end(); ++child)
    {
        print_parsetree( *child, ast.symbol ? (level + 1) : level );
    }
}

#include <locale.h>

int main(int argc, char *argv[])
{
	const char *grammar_fn = nullptr;
	const char *input_fn = nullptr;
        bool dumpLexer = false;
        start_time = clock();

        setlocale(LC_NUMERIC, "");

	std::vector<char> grammar_txt;
	std::vector<mychar_t> input_txt;

	if ( argc < 2 )
	{
		printf( "%s -g|--grammar grammar_fname -i|--input input_fname -d|--dumpLex\n", argv[0] );
		printf( "\n" );
		return EXIT_FAILURE;
	}

	int argi = 1;
	while ( argi < argc )
	{
		if ( strcmp(argv[argi], "-g") == 0 || strcmp(argv[argi], "--grammar") == 0 )
		{
		    grammar_fn = argv[argi + 1];
		    argi += 2;
		}
		else if ( strcmp(argv[argi], "-i") == 0 || strcmp(argv[argi], "--input") == 0 )
		{
		    input_fn = argv[argi + 1];
		    argi += 2;
		}
		else if ( strcmp(argv[argi], "-d") == 0 || strcmp(argv[argi], "--dumpLex") == 0 )
		{
		    dumpLexer = true;
		    argi += 1;
		}
	}

	if(grammar_fn != nullptr)
	{
		int rc = read_file(grammar_fn, (std::vector<mychar_t>&)grammar_txt);
		if(rc != EXIT_SUCCESS) return rc;
                size_t grammar_txt_size = grammar_txt.size()-1; //-1 to account for the '\0' terminator
                myShowDiffTime("read grammar");
		printf("Grammar size = %d\n", (int)grammar_txt_size);
		lalr::GrammarCompiler compiler;
		lalr::ErrorPolicy error_policy;
		int errors = compiler.compile( &grammar_txt[0], &grammar_txt[0] + grammar_txt_size, &error_policy );
                myShowDiffTime("compile grammar");
		if(errors != 0)
		{
			printf("Error count = %d\n", errors);
			return EXIT_FAILURE;
		}
                compiler.showStats();
		if(input_fn != nullptr)
		{
			rc = read_file(input_fn, input_txt);
			if(rc != EXIT_SUCCESS) return rc;
                        size_t input_txt_size = input_txt.size()-1; //-1 to account for the '\0' terminator
                        myShowDiffTime("read input");
			printf("Input size = %d\n", (int)input_txt_size);
			lalr::ErrorPolicy error_policy_input;
                        lalr::Parser<const mychar_t*, ParseTreeUserData> parser( compiler.parser_state_machine(), &error_policy_input );
                        parser.set_default_action_handler(parsetreeMaker);
                        //lalr::Parser<const mychar_t*, AstUserDataDbg> parser( compiler.parser_state_machine(), &error_policy_input );
                        //parser.set_default_action_handler(astMakerDbg);
			//lalr::Parser<const mychar_t*, int> parser( compiler.parser_state_machine(), &error_policy_input );
                        parser.lexer_action_handlers()
                            ( "C_MultilineComment", &C_MultLineCommentLexer::string_lexer )
                            ;
                        if(dumpLexer) parser.dumpLex( &input_txt[0], &input_txt[0] + input_txt_size );
			else parser.parse( &input_txt[0], &input_txt[0] + input_txt_size );
                        myShowDiffTime("parse input");
			printf( "accepted = %d, full = %d\n", parser.accepted(),  parser.full());
                        if(parser.accepted() &&  parser.full())
                        {
                            print_parsetree( parser.user_data(), 0 );
                        }
		}
	}
	return EXIT_SUCCESS;
}

@mingodad
Copy link
Contributor Author

Of course I don't mind share the playground link, ideally it'll be moved to github pages.
I've got close to a good parse tree dump now see again the playground and my last commit mingodad@eb7ff4c .

I'm glad that we can join efforts to build an amazing tool to facilitate write/debug/develop grammars.

Thank you again for your great work !

@cwbaker
Copy link
Owner

cwbaker commented Jul 16, 2023

Actually I can't comment on individual commits from the PR so I'll just do it here:

Fix to detect identifiers referenced in rules but not defined:
This error check existed until it was possible to use tokens for predence only. Let me go over what is happening here and make sure there aren't two conflicting use cases.

Make possible to accept associativity/precedence syntax like bison/byacc:
Unless I've misunderstood it's already possible to specify precedence but no associativity with the %none directive. I don't want the %prec and %nonassoc keywords from Bison/YACC in lalr. Lalr is supposed to be different, and hopefully better, rather than the same.

Check if '%whitespace' directive is present in the grammar and if not…:
I'm not sure that it's an error to leave out the whitespace directive. Let me think about it for a while.

Add code to allow generate an EBNF for railroad diagram generation:
The railroad diagrams are great but code to generate them belongs outside of the library. I believe there is enough information available from Grammar to do that.

Add method to dump the input from the lexer:
This should also be outside the library. I'd accept it as a debug feature to match what Parser::set_debug_enabled() does but not as its own method on Parser.

Add a method to show grammar compilation stats.:
This should also be outside of the library. I think the playground itself should implement this.

Add a naive implementation of "%case_insensitive" directive, right no…:
The case insensitive lexer will take some work to get in. I think I'm more interested in seeing a) being able to specify case (in)-sensitivity per token and b) what the syntax for that will look like in the grammar. I like the simplicity of lalr not having to deal with case sensitivity itself, i.e. "[Ss][Ee][Ll][Ee][Cc][Tt]".

Make trivial methods inline.:
This is okay but please define the functions outside of the class and preserve any documentation comments. I like the classes to provide a concise summary of the API and that gets lost when functions are defined within the class definition. Also preserve the class per-file, e.g. RegexNodeLess should be in RegexNodeLess.hpp not RegexNode.hpp.

@cwbaker
Copy link
Owner

cwbaker commented Jul 16, 2023

Generally I think the playground directory should be a separate repository that uses a submodule or some dependency mechanism to bring in lalr. Then all of the output specific to the playground can go there too.

I like that because that keeps lalr as a smaller, simpler C++ library. I think that also frees you up to not depend on me for PRs and feedback in a lot of cases.

Thanks heaps,
Charles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants