Skip to content

Commit

Permalink
Merge pull request #304 from dkpro/feature/250-Convenience-for-settin…
Browse files Browse the repository at this point in the history
…g-the-document-language

#250 - Convenience for setting the document language
  • Loading branch information
reckart authored Feb 4, 2024
2 parents ef1d5f4 + e2a8513 commit 5ffc2ba
Show file tree
Hide file tree
Showing 4 changed files with 85 additions and 9 deletions.
45 changes: 38 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,10 @@ Usage

Example CAS XMI and types system files can be found under :code:`tests\test_files`.

Loading a CAS
~~~~~~~~~~~~~
.. _reading_a_cas_file:

Reading a CAS file
~~~~~~~~~~~~~~~~~~

**From XMI:** A CAS can be deserialized from the UIMA CAS XMI (XML 1.0) format either
by reading from a file or string using :code:`load_cas_from_xmi`.
Expand All @@ -98,8 +100,10 @@ Most UIMA JSON CAS files come with an embedded typesystem, so it is not necessar
with open('cas.json', 'rb') as f:
cas = load_cas_from_json(f)
Writing a CAS
~~~~~~~~~~~~~
.. _writing_a_cas_file:

Writing a CAS file
~~~~~~~~~~~~~~~~~~

**To XMI:** A CAS can be serialized to XMI either by writing to a file or be
returned as a string using :code:`cas.to_xmi()`.
Expand All @@ -126,6 +130,30 @@ returned as a string using :code:`cas.to_xmi()`.
# Written to file
cas.to_json("my_cas.json")
.. _creating_a_cas:

Creating a CAS
~~~~~~~~~~~~~~

A CAS (Common Analysis System) object typically represents a (text) document. When using cassis,
you will likely most often :ref:`reading <reading_a_cas_file>` existing CAS files, modify them and then
:ref:`writing <writing_a_cas_file>` them out again. But you can also create CAS objects from scratch,
e.g. if you want to convert some data into a CAS object in order to create a pre-annotated text.
If you do not have a pre-defined typesystem to work with, you will have to :ref:`define one <creating_a_typesystem>`.

.. code:: python
typesystem = TypeSystem()
cas = Cas(
sofa_string = "Joe waited for the train . The train was late .",
document_language = "en",
typesystem = typesystem)
print(cas.sofa_string)
print(cas.sofa_mime)
print(cas.document_language)
Adding annotations
~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -237,6 +265,8 @@ The same goes for setting:
assert lst["tail.tail.head"] == "newer_baz"
.. _creating_a_typesystem:

Creating types and adding features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -269,12 +299,13 @@ properties of the Sofa can be read and written:

.. code:: python
cas = Cas()
cas.sofa_string = "Joe waited for the train . The train was late ."
cas.sofa_mime = "text/plain"
cas = Cas(
sofa_string = "Joe waited for the train . The train was late .",
document_language = "en")
print(cas.sofa_string)
print(cas.sofa_mime)
print(cas.document_language)
Array support
~~~~~~~~~~~~~
Expand Down
35 changes: 35 additions & 0 deletions cassis/cas.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,7 @@ def __init__(
lenient: bool = False,
sofa_string: str = None,
sofa_mime: str = None,
document_language: str = None,
):
"""Creates a CAS with the specified typesystem. If no typesystem is given, then the default one
is used which only contains UIMA-predefined types.
Expand Down Expand Up @@ -241,6 +242,9 @@ def __init__(
else:
self.sofa_mime = "text/plain"

if document_language is not None:
self.document_language = document_language

@property
def typesystem(self) -> TypeSystem:
return self._typesystem
Expand Down Expand Up @@ -512,6 +516,19 @@ def get_sofa(self) -> Sofa:
"""
return self._current_view.sofa

def get_document_annotation(self) -> FeatureStructure:
"""Get the DocumentAnnotation feature structure associated with this CAS view. If none exists, one is created.
Returns:
The DocumentAnnotation associated with this CAS view.
"""
try:
return self.select(TYPE_NAME_DOCUMENT_ANNOTATION)[0]
except IndexError:
document_annotation = self.typesystem.get_type(TYPE_NAME_DOCUMENT_ANNOTATION)()
self.add(document_annotation)
return document_annotation

@property
def sofas(self) -> List[Sofa]:
"""Finds all sofas that this CAS manages
Expand Down Expand Up @@ -598,6 +615,24 @@ def sofa_array(self, value):
"""
self.get_sofa().sofaArray = value

@property
def document_language(self) -> str:
"""The document language contains the language code for the document.
Returns: The document language.
"""
return self.get_document_annotation().get(FEATURE_BASE_NAME_LANGUAGE)

@document_language.setter
def document_language(self, value) -> str:
"""Sets document language.
Args:
value: The document language
"""
self.get_document_annotation().set(FEATURE_BASE_NAME_LANGUAGE, value)

def to_xmi(self, path: Union[str, Path, None] = None, pretty_print: bool = False) -> Optional[str]:
"""Creates a XMI representation of this CAS.
Expand Down
8 changes: 8 additions & 0 deletions tests/test_cas.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,14 @@ def test_sofa_string_and_mime_type_can_be_set_using_constructor():
assert cas.sofa_mime == "text/html"


def test_document_language_can_be_set_using_constructor():
cas = Cas(sofa_string="Ich bin ein test!", document_language="de")

assert cas.sofa_string == "Ich bin ein test!"
assert cas.sofa_mime == "text/plain"
assert cas.document_language == "de"


# Select


Expand Down
6 changes: 4 additions & 2 deletions tests/test_documentation.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,7 @@ def test_readme_is_proper_rst():
with path_to_readme.open() as f:
rst = f.read()

errors = list(rstcheck.check(rst))
assert len(errors) == 0, "; ".join(str(e) for e in errors)
errors = [str(e) for e in list(rstcheck.check(rst))]
# https://github.com/rstcheck/rstcheck-core/issues/4
errors = [s for s in errors if not ("Hyperlink target" in s and "is not referenced." in s)]
assert len(errors) == 0, "; ".join(errors)

0 comments on commit 5ffc2ba

Please sign in to comment.