Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backtrack stack overflow (current limit is 400) #64

Closed
yausteve-ibm opened this issue Oct 16, 2017 · 20 comments
Closed

backtrack stack overflow (current limit is 400) #64

yausteve-ibm opened this issue Oct 16, 2017 · 20 comments

Comments

@yausteve-ibm
Copy link

Hi, Jamie,
I encountered the following error when loading MANIFEST:
Traceback (most recent call last):
File "/opt/log-analytics/bin/mytest", line 33, in
extract()
File "/opt/log-analytics/bin/mytest", line 31, in extract
r = engine.load_manifest("$sys/vmware.MANIFEST")
File "/opt/log-analytics/module/rosie.py", line 154, in load_manifest
retvals = self.rosie.get_retvals(r)
File "/opt/log-analytics/module/rosie.py", line 102, in get_retvals
return self._get_retvals(messages, self.rosie.rosieL_free_stringArray)
File "/opt/log-analytics/module/rosie.py", line 113, in _get_retvals
raise RuntimeError(retvals[1]) # exception indicating that the call failed
RuntimeError: src/core/engine.lua:121: backtrack stack overflow (current limit is 400)

It seems this happens when I have a large number of patterns defined in a number of rpl files.
Could you advise whether there is a way to workaround this?
Appreciate your advice. Many thanks.
-Steve

@jamiejennings
Copy link
Owner

Steve, the change you want is to a constant that limits the amount of backtracking that a single pattern is able to do.

I will create a v0.99k-mod5 containing a change of this value from 400 to 800 (later today when I get back to the office). Meanwhile, if you want to do it yourself, instructions are below.

I do want to understand, at some point in time, how your patterns combine to need more than 400 possible backtracking points. It's certainly possibly with a lot of patterns that are combined into a grand "this or that or the next or the next..." pattern, and I'm interested in the use case.

If you want to make the change yourself, you can edit as shown below and then follow the 2 steps below. Here, I've changed it from 400 to 800 in rosie-pattern-language/submodules/rosie-lpeg/src/lptypes.h:

$ git --no-pager diff lptypes.h 
diff --git a/src/lptypes.h b/src/lptypes.h
index 5eb7987..b031cbe 100644
--- a/src/lptypes.h
+++ b/src/lptypes.h
@@ -51,7 +51,7 @@
 
 /* default maximum size for call/backtrack stack */
 #if !defined(MAXBACK)
-#define MAXBACK         400
+#define MAXBACK         800
 #endif
 
$ 

After editing as shown above, and while in the directory containing lptypes.h, run make and then copy the resulting lpeg.so to rosie's lib directory. (Substitute linux for macosx in the make command if needed.)

Note that running rosie's make clean and then make will reset all the submodules to the official versions, so you don't want to use those here.

$ make macosx
make lpeg.so "DLLFLAGS = -bundle -undefined dynamic_lookup"
gcc -Wall -Wextra -pedantic -Waggregate-return -Wcast-align -Wcast-qual -Wdisabled-optimization -Wpointer-arith -Wshadow -Wsign-compare -Wundef -Wwrite-strings -Wbad-function-cast -Wdeclaration-after-statement -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wunreachable-code  -O2 -std=c99 -I../lua//include -fPIC   -c -o lpvm.o lpvm.c
gcc -Wall -Wextra -pedantic -Waggregate-return -Wcast-align -Wcast-qual -Wdisabled-optimization -Wpointer-arith -Wshadow -Wsign-compare -Wundef -Wwrite-strings -Wbad-function-cast -Wdeclaration-after-statement -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wunreachable-code  -O2 -std=c99 -I../lua//include -fPIC   -c -o lpcap.o lpcap.c
gcc -Wall -Wextra -pedantic -Waggregate-return -Wcast-align -Wcast-qual -Wdisabled-optimization -Wpointer-arith -Wshadow -Wsign-compare -Wundef -Wwrite-strings -Wbad-function-cast -Wdeclaration-after-statement -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wunreachable-code  -O2 -std=c99 -I../lua//include -fPIC   -c -o lptree.o lptree.c
gcc -Wall -Wextra -pedantic -Waggregate-return -Wcast-align -Wcast-qual -Wdisabled-optimization -Wpointer-arith -Wshadow -Wsign-compare -Wundef -Wwrite-strings -Wbad-function-cast -Wdeclaration-after-statement -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wunreachable-code  -O2 -std=c99 -I../lua//include -fPIC   -c -o lpcode.o lpcode.c
env gcc -bundle -undefined dynamic_lookup rcap.o rbuf.o lpvm.o lpcap.o lptree.o lpcode.o -o lpeg.so
$ cp lpeg.so ../../../lib/

@jamiejennings
Copy link
Owner

Ok, I managed to create release v0.99k-mod5 from my remote location. 😄

I'm going to close this issue for now, but of course let me know if you have any problems with the fix. Btw, to upgrade, you'll need to:

git checkout v0.99k-mod5
make clean
make

@yausteve-ibm
Copy link
Author

Hi, Jamie,

Thank you very much for your prompt response and creating v0.99k-mod5 for me.

After the limit is raised, I run into another problem further down.

The following are in the MANIFEST:
$sys/rpl/vmware.1.rpl
$sys/rpl/vmware.2.rpl
$sys/rpl/vmware.3.rpl
$sys/rpl/vmware.4.rpl
$sys/rpl/vmware.5.rpl
$sys/rpl/vmware.6.rpl
$sys/rpl/vmware.7.rpl
$sys/rpl/vmware.8.rpl
$sys/rpl/vmware.9.rpl
$sys/rpl/vmware.10.rpl
$sys/rpl/vmware.11.rpl
$sys/rpl/vmware.12.rpl
$sys/rpl/vmware.rpl

Each of the vmware.*.rpl files contains the following:

  1. Definition of 100 patterns
    e.g. vmware.log1=
  2. A 'matchall' pattern to match all the 100 patterns
    vmware1.matchall = vmware.log1 / vmware.log2 / vmware.log3 / ... / vmware.log98 / vmware.log99 / vmware.log100

The vmware.rpl contains the following:
vmware.matchall = vmware1.matchall / vmware2.matchall / vmware3.matchall / vmware4.matchall / vmware5.matchall / vmware6.matchall / vmware7.matchall / vmware8.matchall / vmware9.matchall / vmware10.matchall / vmware11.matchall / vmware12.matchall

When the limit was 400, I was getting the following:

Traceback (most recent call last):
File "./mytest", line 23, in
r = engine.load_manifest("$sys/vmware.MANIFEST")
File "/opt/log-analytics/module/rosie.py", line 154, in load_manifest
retvals = self.rosie.get_retvals(r)
File "/opt/log-analytics/module/rosie.py", line 102, in get_retvals
return self._get_retvals(messages, self.rosie.rosieL_free_stringArray)
File "/opt/log-analytics/module/rosie.py", line 113, in _get_retvals
raise RuntimeError(retvals[1]) # exception indicating that the call failed
RuntimeError: src/core/engine.lua:121: backtrack stack overflow (current limit is 400)

After the limit has been raised to 800, I am now getting the following:
Traceback (most recent call last):
File "./mytest", line 23, in
r = engine.load_manifest("$sys/vmware.MANIFEST")
File "/opt/log-analytics/module/rosie.py", line 154, in load_manifest
retvals = self.rosie.get_retvals(r)
File "/opt/log-analytics/module/rosie.py", line 102, in get_retvals
return self._get_retvals(messages, self.rosie.rosieL_free_stringArray)
File "/opt/log-analytics/module/rosie.py", line 113, in _get_retvals
raise RuntimeError(retvals[1]) # exception indicating that the call failed
RuntimeError: src/core/compile.lua:528: Internal error (compiler): src/core/compile.lua:310: too many Lua values in pattern

If I remove the very last entry with 'vmware.rpl', the MANIFEST loads ok.

Appreciate your advice on whether this can be worked around. Many thanks.

The use case is to extract unique 'message types' (messages with the same pattern belong to a message type) from a log file so that specific action can be coded for each message type. The above limit would probably not be encountered for most log files. But in this case, we have a log which is relatively complex(rich in patterns) and causes us this problem.

-Steve

@jamiejennings
Copy link
Owner

So the final pattern that is used for matching is an OR of around 1200 smaller (but nontrivial) patterns? I think I see what's happening internally to cause the "too many Lua values" error in that case. I'll work up a fix.

@jamiejennings jamiejennings reopened this Oct 18, 2017
@jamiejennings
Copy link
Owner

Steve, I have not been able to replicate this, even by generating 16,000 patterns, each of which references the previous one. However, it's possible that I am missing something about your use case.

One thought: Is it possible there is a typo that is creating a circular set of definitions? Rosie v1.0.0 detects such situations, but v0.99 does not. In theory, that could cause this error.

Another thought: When the python module for Rosie v1.0.0 is released, will you consider moving to that? There will be two kinds of changes you will have to make (but many benefits):

  1. Pattern names cannot contain a dot in them (but they can be organized into packages)
  2. The output JSON format is a little different (see, e.g. the new Rosie blog)

There are many benefits, though, including built-in unit tests, better tracing, and a significant increase in speed. Not to mention a lot of new features. And, you'll only have to do those changes (above) once: I designed v1.0.0 to keep a stable RPL syntax and data format even as all the planned new features are added.

@yausteve-ibm
Copy link
Author

Hi, Jamie,
Sorry for the delay in replying as I was busy working around the load manifest problem. In the end, I managed to get around it by reducing the number of patterns to around 600 as well as bumping up the back track limit to 1200.
Just in case you would still want to look into the "too many Lua values in pattern" symptom I reported, I have created a testcase for you to reproduce the problem(see below).
Thanks.
-Steve

NOTE: back track limit is at 800 when this testcase is run.
The following script will generate the rpl files(and put them in ROSIE_HOME/rpl) required:
$cat gentestrpl.sh
#!/bin/bash
pattern="datetime.shortdate datetime.simple_time basic.network_patterns basic.punctuation basic.punctuation common.number basic.punctuation basic.punctuation basic.punctuation common.number basic.punctuation common.identifier_not_word basic.punctuation"

typeset -i i=0
typeset -i j=0
typeset -i k=0
typeset -i limit=100
RPLDIR=${ROSIE_HOME}/rpl
matchall=''
matchall2=''
while [ $i -lt 1105 ]
do
if [ $j -eq 0 ];then
k=$k+1
>tempfile
fi
i=$i+1
j=$j+1
echo vmware.string$i='"'This is a string'"' >> tempfile
echo vmware.log$i="{ ${pattern} vmware.string$i } !." >> tempfile
if [ $j -eq 1 ];then
matchall=vmware.log$i
else
matchall=${matchall}" / "vmware.log$i
fi
if [ $j -eq ${limit} ];then
echo vmware$k.matchall=${matchall} >> tempfile
cp tempfile ${RPLDIR}/vmware.$k.rpl
if [ $k -eq 1 ];then
matchall2=vmware$k.matchall
else
matchall2=${matchall2}" / "vmware$k.matchall
fi
matchall=""
j=0
fi
done
if [ $j -ne 0 ];then
echo vmware$k.matchall=${matchall} >> tempfile
cp tempfile ${RPLDIR}/vmware.$k.rpl
fi
echo vmware.matchall=${matchall2} > ${RPLDIR}/vmware.rpl

The following is the python testcase(to be placed in ${ROSIE_HOME}/ffi/samples/python):
$ cat testcase
#!/usr/bin/env python
import os
import rosie
ROSIE_HOME = os.getenv("ROSIE_HOME")
Rosie = rosie.initialize(ROSIE_HOME, ROSIE_HOME + "/ffi/librosie/librosie.so")
engine = Rosie.engine()
r = engine.load_manifest("$sys/MANIFEST")

The following are added to ${ROSIE_HOME}/MANIFEST:
$sys/rpl/vmware.1.rpl
$sys/rpl/vmware.2.rpl
$sys/rpl/vmware.3.rpl
$sys/rpl/vmware.4.rpl
$sys/rpl/vmware.5.rpl
$sys/rpl/vmware.6.rpl
$sys/rpl/vmware.7.rpl
$sys/rpl/vmware.8.rpl
$sys/rpl/vmware.9.rpl
$sys/rpl/vmware.10.rpl
$sys/rpl/vmware.11.rpl
$sys/rpl/vmware.rpl

To reproduce the problem,

  1. set ROSIE_HOME
  2. execute $ROSIE_HOME/testcase
    [root@localhost python]# ./testcase
    Traceback (most recent call last):
    File "./testcase", line 7, in
    r = engine.load_manifest("$sys/MANIFEST")
    File "/opt/log-analytics/rosie-pattern-language/ffi/samples/python/rosie.py", line 154, in load_manifest
    retvals = self.rosie.get_retvals(r)
    File "/opt/log-analytics/rosie-pattern-language/ffi/samples/python/rosie.py", line 102, in get_retvals
    return self._get_retvals(messages, self.rosie.rosieL_free_stringArray)
    File "/opt/log-analytics/rosie-pattern-language/ffi/samples/python/rosie.py", line 113, in _get_retvals
    raise RuntimeError(retvals[1]) # exception indicating that the call failed
    RuntimeError: src/core/compile.lua:528: Internal error (compiler): src/core/compile.lua:310: too many Lua values in pattern

@yausteve-ibm
Copy link
Author

By the way, given all the goodies you have mentioned, I am looking forward to migrate to Rosie v1.0.0 when it's released.
-Steve

@jamiejennings
Copy link
Owner

Steve, thanks for the script! It is exactly what I needed to reproduce the problem. I converted your patterns to Rosie v1.0.0 to confirm that the problem exists there as well, and I fixed it. The fix is in 1.0.0-alpha-3.

Backporting the fix to v0.99k is possible, but will require a little work. I should be able to get to it this weekend.

@jamiejennings
Copy link
Owner

FYI, in the process of replicating the bug, I updated your pattern to Rosie 1.0.0, and made one simplifying change. The result was this script that writes to stdout:

#!/bin/bash
pattern="date.us_long time.simple net.any {[:punct:]{2} num.any [:punct:]{2}} {[:punct:] num.any [:punct:]} id.any [:punct:]"

echo "package vmware"
echo "import date, time, net, num, id" 
echo "alias common_fields=${pattern}"

typeset -i i=0
allstrings=''

while [ $i -lt 1105 ]
do
    i=$i+1
    echo string$i='"'This is a string'"'
    if [ $i -eq 1 ];then
	allstrings=string${i}
    else
	allstrings=${allstrings}" / "string${i}
    fi
done
echo "alias allstrings=( ${allstrings} ) "
echo "matchall=common_fields allstrings $"

And an example of using it: (note how fast it compiles in Rosie 1.0.0)

$ ./genalternative.sh >aa.rpl
$ cat sampledata 
Oct 23 1:02:19 1.2.3.4 [[1234]] -1- foo_1 * This is a string
Oct 23 1:11:48 1.2.3.4 [[1234]] -2- bar_2 ! This is a string
Oct 23 1:15:59 1.2.3.4 [[1234]] -3- baz_3 % This is a string
$ rosie -f aa.rpl match vmware.matchall sampledata 
Oct 23 1 02 19 1.2.3.4 1234 1 foo_1 This is a string
Oct 23 1 11 48 1.2.3.4 1234 2 bar_2 This is a string
Oct 23 1 15 59 1.2.3.4 1234 3 baz_3 This is a string
$ /usr/bin/time -p rosie -f aa.rpl match vmware.matchall sampledata 
Oct 23 1 02 19 1.2.3.4 1234 1 foo_1 This is a string
Oct 23 1 11 48 1.2.3.4 1234 2 bar_2 This is a string
Oct 23 1 15 59 1.2.3.4 1234 3 baz_3 This is a string
real         0.54
user         0.41
sys          0.11
$ wc aa.rpl 
    1110    6655   45540 aa.rpl
$ 

@yausteve-ibm
Copy link
Author

Great! Now I don't need to break the pattern definitions into a number of RPL files anymore. I did that because it seemed to help working around the back track limit at one point. And the speedup in compilation time! I will start looking into migrating to 1.0.0.

@yausteve-ibm
Copy link
Author

Hi, Jamie,
I did some profiling with the progam I am developing using Rosie and found that the Rosie API match() is the biggest time consumer. Currently, my program is single thread. When run on my system using the 'basic.matchall' pattern, 20000 match() call consumed ~30s.
I have to meet a user requirement of processing the order of 1million/minute streaming data. Could you give me an idea on the order of performance improvement we could expect from v1.0.0(vs 0.99), specifically, match() API?
Appreciate your advice. Many thanks.
-Steve

@jamiejennings
Copy link
Owner

Hi Steve,

Performance with v1.0.0 is much better. There are many permutations to consider, so I'll give you a few data points and you can let me know which is closest to your situation.

I'm using this input file (syslog.rpl):

import all, date, time, net, word, num

message = .*
datetime = { date.rfc3339 "T" time.rfc3339 }
syslog = datetime net.ip word.any "["num.int"]" message

And my input data is a file of 2M log entries, all different, and taken from a running IBM Bluemix application. A sample log line looks like this:

2015-08-23T03:36:25-05:00 10.111.69.99 sshd[16666]: Did not receive identification string from 200.40.110.11

My measurements are being done on an Early 2015 Macbook Pro with the specs below. And I'm running each test just 3 times and showing here the one that had the middle User time. (So, not scientific, but indicative of what you can expect.)

  Processor Name:	Intel Core i5
  Processor Speed:	2.9 GHz
  Number of Processors:	1
  Total Number of Cores:	2
  L2 Cache (per Core):	256 KB
  L3 Cache:	3 MB
  Memory:	8 GB

Case 1: Rosie CLI

$ /usr/bin/time -p rosie -o json -f syslog.rpl match syslog ~/Data/syslog2M.log >/tmp/out
real        27.65
user        20.58
sys          5.74

2,000,000 entries in 20.58sec, or just under 100k entries/sec

Note that if you do not need the JSON output, it is much faster. In this example, the input line is returned when there is a match (like grep does):

$ /usr/bin/time -p rosie -o line -f syslog.rpl match syslog ~/Data/syslog2M.log >/tmp/out
real         7.96
user         5.47
sys          2.30

2,000,000 entries in 5.47s, or around 363k entries/sec

Case 2: Python calling Rosie one line at a time, no JSON decoding

Here is file ex.py, though I left out the processing of the two command line args that control whether the JSON is decoded or not, and whether the match is printed or not:

engine = rosie.engine()
ok, pkgname, errs = engine.loadfile("syslog.rpl")
if not ok:
    print "syslog.rpl failed to load"
    print errs
    sys.exit(-1)
syslog, errs = engine.compile("syslog")
if not syslog:
    print "Pattern failed to compile:"
    print errs
    sys.exit(-1)

f = open(datafile)
for line in f:
    # Python insists on leaving a newline at the end of the line read from the file
    match, left, abend, tt, tm = engine.match(syslog, line[:-1], 1, "json")
    if match: 
        if decode_flag:
            obj = json.loads(match[:])
            if print_flag: print obj
        else:
            if print_flag: print match
    else:
        if print_flag: print "No match"

Processing the same data file and printing the resulting matches:

$ /usr/bin/time -p python ex.py ~/Data/syslog2M.log false true | wc -l
real        34.55
user        31.96
sys          2.35
 2000002

2,000,000 entries in 31.96sec, or around 62k entries/sec (Python is slow.)

Processing the same data file and NOT printing the matches saves a small amount of time, but not much.

Note: If you only need to know if the line matched or not, you can change ex.py above to use the output format "line" (instead of "json"), and the results are about 2x better:

$ /usr/bin/time -p python ex.py ~/Data/syslog2M.log false true | wc -l
real        17.68
user        15.67
sys          1.87
 2000002

2,000,000 entries in 16sec, or around 125k entries/sec

Case 3: Python calling Rosie one line at a time, and JSON decoding each match

Here's where Python loses big time. It's JSON decoding is horrible. I can recommend a way to avoid it though -- let me know if you are interested.

Also, Python's ability to print a small data structure to the console is slow as well. In this run, Python is using json.loads to decode the match returned by Rosie, but it is NOT printing it out:

$ /usr/bin/time -p python ex.py ~/Data/syslog2M.log true false | wc -l
real       195.94
user       186.11
sys          3.55
       2

2,000,000 entries in 186sec, or around 10.7k entries/sec

Finally, Python processing 2M entries, JSON-decoding each match, and printing the match. This is not a realistic use case, since a real application would do something with the data that is presumably faster than asking Python to create and write a human-readable representation of the resulting data structure (a representation that takes ~ 1500 chars/entry).

$ /usr/bin/time -p python ex.py ~/Data/syslog2M.log true true | wc -l
real       349.69
user       343.00
sys          3.78
 2000002

2,000,000 entries in 343sec, or around 5800 entries/sec

Conclusion

Python is slow. I suspect Go or C/C++ would be significantly faster here. However, it depends on what you are doing with the data.

If you mostly need to know whether there was a match or not, then you can get around 125k entries/sec from Python (under similar conditions as my test, of course). If you need to decode the JSON to get the details for every match, you are looking at a ceiling of around 10.7k entries/sec according to my tests, or around 650k/min.

It is likely possible to get this figure over 1M/min by avoiding Python's json.loads, as I mentioned above. The technique is to avoid JSON altogether. Rosie can return a very compact byte encoding to Python. It takes Rosie about half the time to generate that encoding than it does to generate JSON, so this is the first time saver. Writing a routine in Python to decode this format and create a Python native data structure is pretty straightforward, because I designed the byte encoding to be dead simple. I would bet that such a routine would be faster than json.loads because Rosie's matches contain only a subset of the JSON data types. (This decoding is the second, and probably much bigger time saver.)

If you are interested in this approach, let me know and I'll make it a priority to implement the decoder in Python and we can each try it on our own data, to see how it performs.

There is also the option of moving away from Python (which, although popular, has many challenges for production use). For Rosie v0.99, I created a sample Go program and a sample C program, and my TODO list for v1.0.0 includes writing new and better ones for v1.0.0. If C or Go would help you, let me know and I'll make it a priority to write the librosie client.

@yausteve-ibm
Copy link
Author

Hi, Jamie,

Thank you very much for your prompt reply and comprehensive information/advice!

To get a very rough estimate of based on the v1.0.0 information provided, I tested case#1 using Rosie v0.99k-mod5 with a test file containing 2 million lines of your sample log line:

time rosie -encode json basic.matchall testfile | wc -l
2000000
real 1m18.355s
user 1m18.025s
sys 0m4.180s

My hardware configuration is roughly the same as yours. So leavng aside the difference between syslog.rpl and basic,rpl, it looks like v1.0.0 needs only one third of the time v0.99 requires in this scenario.

I did some profiling on json.loads(), it takes around 30s for one million calls.
I then use ujson(which is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.) and achieved a 75% speedup. So I think we could skip the byte decoding/encoding for the time being and re-visit when necessary. Anyways, thanks for the suggestion and offer to help!

I also tested case#2 with the following program 'test1':
engine = Rosie.engine()
r = engine.load_manifest("$sys/MANIFEST")
config = json.dumps( {'expression': "basic.matchall", 'encode':'json'} )
r = engine.configure(config)
f = open("testfile","r")
for line in f:
r=engine.match(line[:-1],1)

But it look ages...still running after 10 minutes!
This let me to look into a performance issue I encountered when I first started using Rosie:
In version v0.99, given the same number of lines to be matched, matching one line at a time using match() is a lot slower than match_file().

Using the above program with 'testfile' changed to 10000 lines(i.e. instead of 2000000 lines).

time ./test1

real 0m13.259s
user 0m13.167s
sys 0m0.086s

I profiled the execution as follow:
python -m cProfile -o prof1.out ./test1

Using the following to python script 'prtstat' to print out the statistics:
import pstats
p = pstats.Stats('prof1.out')
p.strip_dirs().sort_stats(-1).print_stats()

prtstat

ncalls tottime percall cumtime percall filename:lineno(function)
...
1 0.264 0.264 0.265 0.265 rosie.py:151(load_manifest)
9999 0.007 0.000 13.009 0.001 rosie.py:157(match)
9999 12.870 0.001 13.002 0.001 rosie.py:163(_match_or_eval)
...

Therefore, bulk of time spent was with _match_or_eval()

Now with a similar program 'test2', using match_file() instead:
engine = Rosie.engine()
r = engine.load_manifest("$sys/MANIFEST")
config = json.dumps( {'expression': "basic.matchall", 'encode':'json'} )
r = engine.configure(config)
r = engine.match_file('testfile', 'testfile.json', 'testfile.err', False)

time ./test2

real 0m0.765s
user 0m0.737s
sys 0m0.028s

prtstat

ncalls tottime percall cumtime percall filename:lineno(function)
...
1 0.264 0.264 0.265 0.265 rosie.py:151(load_manifest)
1 0.000 0.000 0.383 0.383 rosie.py:177(match_file)
1 0.383 0.383 0.383 0.383 rosie.py:183(_match_file)
...
As you can see, test1(i.e. match() called) is 16+ times slower than test2(i.e. match_file() called)

Could you explain why there is such a big difference. This doesn't look right.

When I first started using Rosie, I just had to deal with log files. So I could use match_file() instead of match(). But now I have to deal with streaming data, this will become an issue that has to be resolved.

Appreciate your advice. Many thanks.
-Steve
PS- Could you advise when Python API and some preliminary documentation will be available for v1.0.0?

@yausteve-ibm
Copy link
Author

Btw, my case is the worst case(i.e. case #3). At this point, I would like to get it to work and meet user requirement(or close to it) ASAP without rewriting the whole thing.

@jamiejennings
Copy link
Owner

Steve, the Python API is available now in the current release on the master branch. Doc for this is still needed, although it's a much simpler API than v0.99 -- see test.py.

Thanks for pointing me to ujson, btw. The Python one is so slow!

Why is matchfile() so much faster than match() in v0.99?

The implementation of matchfile() is the same code that the Rosie CLI uses to process a file. As far as performance goes, most of the low-hanging fruit has been picked. Without some non-trivial effort, it won't go much faster. By contrast, match() has a lot of overhead per line of input data: (1) a Python call, (2) marshaling across the FFI interface which includes a lot of copying, and (3) allocation/deallocation of memory for the input string and the match data.

Why is v1.0.0 much faster than v0.99?

The relationship between matchfile() and match() is the same, in that matchfile() is all done in Rosie in a tight loop and the easy performance improvements have been made in the loop. But there are significant improvements in v1.0.0, and, crucially, these apply to both matchfile() and match():

  • The generation of JSON output (and a few other output formats) is done by hand-written code in C, and is much faster than the implementation that v0.99 uses.
  • The implementation of librosie for v1.0.0 is new, and it is much more memory efficient than the old one in v0.99. The new librosie does not copy the input data (Rosie reads it directly from Python), and a single resizable output buffer is used to hold the match data (results). Finally, the match data is returned in a Python (CFFI) buffer, where it can be accessed directly or converted into a string as needed.

@yausteve-ibm
Copy link
Author

Jamie,
Thank you very much for the clarification.
For the time being, I will use match_file() by writing out chunks of lines to a temporary file to avoid using match(). I found that will get my program to a processing ability of 0.5M lines/minute on my system. Although this is still far from the user requirement, it will do for now and buy me time to migrate to v1.0.0.

As I need to return a list of line patterns to the user in the input order, I need to find out the line number of lines that have not been matched so that I can insert back error indicators at the end. The match_file() will write the unmatched lines to the error file only. Is it possible to have the line number written together with an unmatched line in the error file?
e.g. 39:This is line 39
Appreciate your help. Many thanks.
-Steve

@jamiejennings
Copy link
Owner

I can add this to v0.99 for you. An alternative would be to create a "no match" pattern that is a last resort, e.g.

nomatch = .*
pat = pat1 / pat2 / pat3 / nomatch

This way you are guaranteed to get a match, and you can use the name of the match to determine whether it was one you wanted (e.g. pat1) or it was nomatch. The basic.matchall uses this technique to skip to the next token (not skip an entire line), so that every item in the input will appear in the output.

Let me know if this workaround is sufficient? If it is not, then I'll create a v0.99k-mod6 for you with line numbers in the error output.

@yausteve-ibm
Copy link
Author

The workaround will do. Thanks, Jamie.

@yausteve-ibm
Copy link
Author

Hi, Jamie,
I am starting to work on migrating to v1.0.0. I would like to rewrite my program in GO at the same time as you suggested. Are you planning to have v1.0.0 GO api available soon?
-Steve
PS-This also become an incentive for me to pick up a language that I have planned to learn for quite some time but never did:). I will play around with GO using v0.99 for the time being.

@jamiejennings
Copy link
Owner

With 1.0.0 about to enter Beta, we have no plans to fix v0.99 issues. (Also, the suggested work-around was acceptable for this issue.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants