Note: Because of the additional processing of query results in SeaCOW, it is significantly slower that NoSketchEngine or corpquery. This cannot be avoided unless you drop the extra processing, which is the whole point of having SeaCOW in the first place. Please do not file bug reports or complaints about the speed penalty involved in using SeaCOW. It is designed primarily for running unattended queries on a server with built-in processing, filtering, etc. If you are a Python wizard, you are invited to help us make the additional processing more efficient, of course.
- SeaCOW is a class-based rewrite of the old ManaCOW project.
- It uses an efficient Bloom filter for deduplication.
- It does not create huge memory structures but processes concordances on the fly.
- If you want custom processing, create an implementation of the Processor class.
- Included are two processors: ConcordanceWriter and DependencyBuilder.
We currently do not support or recommend installing it. In any case, you need a running Manatee with corpora, and you have to make the SeaCOW Python files visible to your own code.
Get an account on https://www.webcorpora.org/ to use SeaCOW with COW.
For each Processor class, there is a straightforward and annotated demo in the samnples folder!
- Create a
SeaCOW.Query
object. - Set the relevant attributes, including search string (see below).
- Create an object of a descendant class of
SeaCOW.Processor
and set its attributes. - Set the processor as the processor attribute of the query object.
- Call the query's
run()
method.
NOTE! This is currently Python 2.7 only. Please get in contact with us if you need Python 3, and we will asssist you in creating a Python 3 version.
cow_region_to_conc(region, attrs = True)
Formats a Manatee region (as returned within Query objects and passed to Processor objects) to a usable structure. Decodes UTF-8. Set attrs
to False
if your concordance contains no structures and only one positional attribute (pure token stream).
Query(object)
Performs queries and pipes the data into a processor.
If you pass an instance of Nonprocessor
as the processor attribute, Query
will call the prepare()
and finalise()
methods as usual. However, the stream returned by Manatee will not be processed and the process()
method is not called once. Except for corpus
and string
you don't need to set any attributes. Even container
can be left unset.
Using a Nonprocessor
is intended for those who only want to read the count
attribute after Manatee has executed the query (like Manatee's own corpquery -n
).
corpus
The string which identifies the corpus (lower case), such as'decow16a-nano'
.subcorpus
A string which identifies the full path to a subcorpus file (ending with'.subc'
), or just a subcorpus name (such as'Forum'
)attributes
A list of attributes of tokens to be exported, such as['word', 'tag', 'lemma', 'depind', 'dephd', 'deprel']
.structures
A list of structures to be exported, such as['s', 'nx']
.references
A list of reference attributes to be exported, such as['doc.id', 'doc.url', 's.idx']
.container
The container structure to be exported, such as's'
string
The query string, such as'[lemma="Chuzpe"]'
max_hits
The maximum number of hits to be exported.random_subset
A float between 0 and 1 representing the proportion of hits to be exported (chosen randomly).context_left
The number ofcontainer
structures to be exported to the left of the matching one.context_right
The number ofcontainer
structures to be exported to the right of the matching one.processor
The processor object which takes care of the returned results.
set_deduplication(self, off = False)
Enable or disable deduplication of concordances based on a bloom filter. Call without argument to activate filter.
run(self)
Execute the query and process the results after everything has been set up.
Processor(object)
The 'abstract' class from which processors should be derived.
__init__(self)
Standard init.prepare(self, query)
Code executed before the query results are processed.finalise(self, query)
Code executed after the query results are processed.process(self, query, region, meta, match_offset, match_length)
This is the callback called for each hit returned for the query.query
is the query object.region
is the Manatee region which should always be processed withcow_region_to_conc
.meta
is a list of all meta information for the hit (reference attributes; look inquery.references
for what they are).match_offset
andmatch_length
locate the actual matching structure inregion
.
ConcordanceLoader(Processor)
A Processor which loads a concordance in a Pytho list. Each element represents one hit and is organised as a dictionary. The keys are meta
(meta data as requested in setting up Query) , left
(left context), match
(matching region), right
(right context). The three lastmentioned members are lists of strings and dictionaries. Structural markers like <s> are always a encoded as strings. Tokens are either a string (attributes concatenated) or a dictionaries. See full_structure
.
full_structure
IfTrue
, then each token in the matching region and the context will also be a dictionary with annotation names as keys and corresponding values (token, lemma, POS tag, etc.). Else everything will be flattened into one string with the pipes symbol |. Default isFalse
.
ConcordanceWriter(Processor)
A Processor which writes results of a query into a nicely fromatted CSV file (or to the terminal).
filename
Set this to save a CSV file. IfNone
, output is on stdout.
DependencyBuilder(Processor)
A Processor which re-creates dependency information contained in COW corpora and represents it as trees (in anytree format). This is a base class which only writes trees to the terminal, stores them as JSON, or draws Graphviz graphs to DOT or PNG files. Intended for refinement in custom classes.
column_index
The 0-based index into the attribute list, locating the dependency index (see inQuery.attributes
where you specified something like'depind'
).column_head
The 0-based index into the attribute list, locating the dependency head index (see inQuery.attributes
where you specified something like'dephd'
).column_relation
The 0-based index into the attribute list, locating the dependency relation (see inQuery.attributes
where you specified something like'deprel'
).column_token
The 0-based index into the attribute list, locating the token (see inQuery.attributes
where you specified something like'word'
).fileprefix
The path prefix defining the location where the (potentially many) data files will be saved.savejson
SetTrue
to export full JSON for dependency trees (including meta data). One large file.saveimage
Set to'dot'
to export Graphviz DOT files,'png'
to export PNG files,None
to export no graphics files of dependency trees. ATTENTION! Creates one file per hit!printtrees
Set toTrue
to output ASCII renderings of trees at the terminal while processing.imagemetaid1
The 0-based index of the hit'smeta
attribute which will be used to create graphics file names, first part. Recommended:doc.id
. SeeQuery.references
for where you put the reference attributes in the list.imagemetaid2
The 0-based index of the hit'smeta
attribute which will be used to create graphics file names, second part. Recommended:s.idx
. SeeQuery.references
for where you put the reference attributes in the list. NOTE:imagemetaid2
is not required. However, if you only use a document identifier, subsequent sentences will overwrite those from the document already written.
Nonprocessor(Processor)
A Processor which does nothing. All four functions simply call pass
. Use this to read Query.count after executing a query if you just need query result counts. See Query() documentation about the implications.