- Adds
Context
structure andAnnotateWithContextService
.
- Ingesters for
kbp2015
have been moved into theingesters/kbp
module - Added code for mapping TAC KBP 2017 knowledge bases into Java objects
- Updated for Concrete schema 4.14
- Removed
constituent
field fromMentionArgument
andSituationMention
. - Added
dependencies
andconstituent
fields toTokenRefSequence
- Removed
- Added summarization service utility code to
services
submodule
- Folds in the
concrete-zip
submodule.
- Now includes
Summarization
structs and services
- Adds some typedness to the
lid
package inside dictum.
- Fixes some issues with
redis
submodule.
- Depends upon concrete
4.11
.
- Contains utilities for CoNLL conversion and merging communications.
- Folds in the
dictum
submodule.
- Removes the
log4j2.json
file from theutil
submodule.
- Contains a new submodule,
sql
, for SQL-based tasks.
- Contains numerous fixes for BOLT and Webposts ingesters (thanks @tongfei)
- Fix dependency issues in
CoNLL
ingester - Delete some buggy and deprecated code
- Contains a new submodule,
lucene
, for building Lucene indexes over Communication objects. - Added functionality to Miscommunication.
- Miscellaneous fixes and improvements.
- Update to
4.10-rework
concrete, which removes the annotator service.
- Update to thrift
0.9.3
via new concrete parent version,4.10
.
- Several updates to Tift, including cleaning up the API and adding superior capture of twitter-specific tokens.
- Point to the correct parent project.
- Upgraded to concrete
4.9
. - Upgrade supporting libraries to latest versions.
- Added quality hashcode methods to Concrete objects.
- Fix exceptions that occurred when long file names were added to archives.
- Improved most ingesters with scripts and better READMEs.
- Bugfixes to
acere
andconll
ingesters. - Fix an issue with the
simple-ingesters
package, where it did not have the proper package name. The package has been moved from
edu.jhu.hlt.ingesters.simple
to
edu.jhu.hlt.concrete.ingesters.simple
- Add support for CoNLL in the
ingesters/conll
package. See the README for details. - Add support for ACERE in the
ingesters/acere
package. See the README for details.
- Fix an issue with visibility of classes and constructors in utility libraries.
Updates to UUID
generation to better support compression. Now includes a
class
that should be used by analytics to generate UUID
s.
One factory should be created per communication, and a new generator should be created from that factory for each analytic processing the communication. Usually each program represents a single analytic, so common usage is:
Communication c = ... // from an existing Communication
AnalyticUUIDGeneratorFactory f = new AnalyticUUIDGeneratorFactory(c); // per-comm
AnalyticUUIDGenerator gen = f.create(); // per-analytic
UUID newUUID = gen.next(); // concrete UUID
TokenTagging tt = new TokenTagging();
tt.setUuid(newUUID);
or if you're creating a new communication:
AnalyticUUIDGeneratorFactory f = new AnalyticUUIDGeneratorFactory(); // no-arg ctor
AnalyticUUIDGenerator gen = f.create();
Communication c = new Communication();
c.setUuid(gen.next());
where the annotation objects might be objects of type Parse, DependencyParse, TokenTagging, CommunicationTagging, etc.
Additionally, add some serialization calls to ingesters to ensure
valid Communication
objects are produced.
- Gigaword ingester: fix an issue where zero-length text spans were being propagated through.
- Util: add some utility predicates to
TextSpanWrapper
andSectionWrapper
for easier filtering of these types. - Util: add a utility,
FilterArchiveByCommunicationType
, that allows dropping communications of a particular type from an archive.
- Fix an issue with empty sections in BOLT ingester
- Deprecate
SuperTextSpan
in favor ofTextSpanWrapper
Bugfix release: truly fix up the concrete-parent
issue.
Bugfix release: depend upon fixed concrete-parent
.
Contains GigawordDocumentBatchConverter,
capable of taking output from xargs
for bulk .sgml
file ingest.
First cut at a web post ingester.
Bug fix release: Switch bolt
ingester to Woodstox API, fixing a few
underlying issues.
Add an ingester for BOLT forum posts.
Fix up a bad release.
Updates to support concrete v4.6.
Update the ingesters/gigaword
library to take a .gz
file from
English Gigaword v5 and create a .tar.gz
archive of Communication
objects.
See this class for details.
Misc:
- Updated dependencies for
acute
,joda-time
, andgigaword
. - Improved documentation and added some default implementations for serialization classes.
Minor release containing a patched Gigaword ingester library. Should provide additional safety against StackOverflowErrors.
Minor release: add utility factories for creating Parse and DependencyParse objects; also fix an issue in the TokenTaggingFactory where NPEs could occasionally fire.
Update the gigaword
library dependency and rework ingesters/gigaword
to
use the new API.
Also fix an issue in Tift where Strings were being concatenated naively; now uses a StringBuilder.
Tiny update to make validation more verbose to track down a downstream bug.
Fix some UTF-8 encoding landmines, make a few inner classes static, and depend upon the latest acute and utilt dependencies.
Add NoEmptySentenceListOrTokenizedCommunication
, a miscommunication
implementation
for analytics that depend upon section objects with either an unset sentence list
or a sentence list with more than zero members. Primarily to support concrete-stanford.
Changes include:
- Add implementations supporting
EntityMention
s andSituationMention
s. - Fix an issue where
NonTokenizedSentencedCommunication
did not actually have anything implemented. - Add a package for lemmas.
- Refactor the interface to allow production of generic
WrappedCommunication
implementations. - Add an analytic interface,
NonSentencedSectionedCommunicationAnalytic
, that enforces input Communications haveSection
s, but noSentence
s.
- Fixed warnings for deprecated classes across numerous packages.
- Add NonTokenizedSentencedCommunication,
an implementation of
MappedSentenceCommunication
that enforces noSentence
objects haveTokenization
s set. - Add NonSentencedSectionedCommunication,
an implementation of
MappedSectionCommunication
that enforces noSection
objects haveSentence
s set.
- Build against concrete v4.5 (ConstituentRef addition)
- Update to the latest annotated-nyt dependency, fixing an ingest issue.
- Fixes an issue with validation code that uses deprecated libraries - now using the Miscommunication API.
The miscommunication module attempts to add some type
discpline to Concrete Communication objects. Previously this functionality
was handled in an uber-object, SuperCommunication
. This cleaner API allows
for more modular implementations (e.g., aggresively cached vs. not).
See more in the miscommunication
directory.
Various interfaces have been added to the analytics-base
library that utilize
miscommunication
interfaces for more safe annotation.
For example, if an Analytic
implementation produces a SectionedCommunication
object,
there exists an interface
that allows for type-safe Communication
objects to be produced. As a result, interfaces in
analytics-base
have been updated.
Additionally, the ingesters-simple
and analytics-simple
now have example implementations
of these more strongly typed Communication
s.
- Small improvements to
validation
module when working withTokenizations
and their children - Fix a few bugs with respect to
analytics-simple
analytics not correctly validating their inputs
- Fix a bug with the
SingleSectioningAnalytic
's validity check - Add
TokenizationFactory
andTextSpanFactory
utility classes
Contains a bug fix for the Annotated NYT Concrete ingester.
Contains additional methods in SuperCommunication to support entity linking tasks.
Notes coming soon
An ingester for the Annotated NYT Corpus
has been added. See the ingesters/annotated-nyt
package.
Documentation is now build alongside the project and can be accessed here.
Javadocs can be found by clicking on a module, then looking under the
Project Reports
section.
Ingesters for English Gigaword v5 and
the ALNC corpus are now available. They can be found in the ingesters
folder.
Tool names for tools have been improved to include the class, project, and version.
The safe
module now has support for Communication
objects via
SafeCommunication
.
Consumers can now implement Stream-based ingesters via the
edu.jhu.hlt.concrete.ingesters.base.stream
package interfaces.
- Small update to the
validation
library to include testing a facet ofTokenization
objects. - Began adding more
package-info.java
to various packages. - Use latest
acute
library
This update contains the latest edu.jhu.hlt/acute
library.
Utility IO classes that were originally in the base
module
have been moved to a different Maven project.
A new module, safe
, has been added. This project will attempt to map
required Concrete fields to Java interfaces, allowing consumers to
use these implementations without fear of write-time errors due to
missing fields.
The MetadataTool
interface supports more detailed and easier-to-parse
strings for AnnotationMetadata
objects. Consumers can use implementers
to easily parse and read output from tools that are then added to
AnnotationMetadata
objects.
Simple ingesters now implement MetadataTool
and utilize the safe
code from SafeAnnotationMetadata
.
Thrift fields are no longer public. Code that depends upon a thrift field,
such as comm.text
, will need to be changed to use the getter methods.
In most cases, accesses can be changed with the addition of get
or set
,
followed by camel case. For example, comm.getText()
or comm.setText(myText)
.
Ingesters for Concrete are being moved into this project. Currently, a simple ingester
is included, as well as a library with common ingester code. The simple ingester
allows consumers to take character-based files and convert them to Communication
objects.
Currently, two implementations exist: CompleteFileIngester
, which ingests
complete text files into a Communication
, and DoubleLineBreakFileIngester
,
which creates sections for each double line break (platform independent) in a
character-based file.
In the future, additional ingesters for other corpora will be relocated into this project.
Consult the README.md for information on how to run the simple ingester utilities.
Iterators for creating and reading archives with generic Thrift-like objects
(e.g., Clustering
objects) now exist in the serialization
package. These use
reflection to read and write thrift-like object - any class generated by the
Thrift compiler (e.g., any class in concrete-core) can be used as the type bound.
Consumers working with Communication
objects should maintain their dependency
on CommunicationSerializer
and related implementations; these do not reflect.
Consult BoundedThriftAPITest
for example usage, located
here.
Packages for thrift-specific datatypes (Communication
, Section
, etc.) have
been created. These contain utilities for working with these data types. For
example, the SectionFactory
class allows a consumer to create a Section
with a UUID already assigned.
The following Factory
classes are now in the library:
SectionFactory
(source)AnnotationMetadataFactory
(source)CommunicationFactory
(source)UUIDFactory
(source)
Functionality for generating mock Concrete objects has moved to the
edu.jhu.hlt.concrete.random
package. The class RandomConcreteFactory
contains numerous tools to generate synthetic Concrete objects. It replaces the
now deprecated ConcreteFactory
.
The SingleSectionSegmenter
class introduces a method that will convert
an entire String
of text into a Section
object, with the correct TextSpan
,
and the assigned sectionKind
.
The class can be viewed here.
This package did not belong in concrete-java
and has been removed. If
consumers need access to this code, it will appear in another library
in the near future.
The following classes are deprecated and will be removed in a future release.
ConcreteFactory
- replaced byRandomConcreteFactory
ConcreteUUIDFactory
- replaced byUUIDFactory
Util
- replaced by method inUUIDFactory
Serialization
- replaced byThriftSerializer