-
Notifications
You must be signed in to change notification settings - Fork 3
Task Schemata
Task schemata for no-label annotated text opinion extraction from RuSentRel collection of mass-media articles written in Russian with document level sentiment attitude annotations; entity annotation represent a part of BRAT, grouped in synonyms collection by their stemmed version (Yandex Mystem); opinion annotation based on index-based annotation pairs and non-assigned annotation of all pairs in every sentence, for which distance in words does not exceed 50 words:
# 1. text parser pipeline.
text_parser = BaseTextParser(pipeline=[
BratTextEntitiesParser(),
DefaultTextTokenizer(keep_tokens=True),
])
# Initialize empty synonyms collection.
# Using stemming for values grouping.
stemmer = MystemWrapper()
synonyms = StemmerBasedSynonymCollection(
RuSentRelSynonymsHelper.iter_groups()
stemmer, is_read_only=False)
# Initialize provider of the documents.
doc_provider = RuSentrelDocumentProvider()
# 2. text opinion annotation pipeline.
opinion_annotation = text_opinion_extraction_pipeline(
# Describing function that provides doc.
get_doc_func=doc_provider.by_id,
# Pipeline of the text processing.
text_parser=text_parser,
# List of annotations.
annotators=[
# Value-based annotation.
AlgorithmBasedTextOpinionAnnotator(
PairBasedOpinionAnnotationAlgorithm(
dist_in_terms_bound=50,
label_provider=ConstantLabelProvider(NoLabel())),
value_to_group_id_func=lambda v: stemmer.lemmatize_to_str(v))
get_doc_existed_opinions_func=None,
create_empty_collection_func=lambda: OpinionCollection(synonyms))
])
Text opinion annotator declaration, which performs conversion of the document-level RuSentRel collection attitudes onto the context-level opinions
# Custom labels declaration.
class PositiveLabel(Label): pass
class NegativeLabel(Label): pass
# Label formatting declaration.
label_formatter = RuSentRelLabelsFormatter(
pos_label_type=PositiveLabel,
neg_label_type=NegativeLabel)
annot = AlgorithmBasedTextOpinionAnnotator(
PredefinedOpinionAnnotationAlgorithm(
doc_provider=doc_provider,
get_opinions_by_doc_id_func=lambda doc_id: OpinionCollection(
RuSentRelOpinions.iter_from_doc(doc_id, labels_fmt)),
value_to_group_id_func=lambda value: GroupingProviders.provide_value(
synonyms=synonyms, value=value)
create_empty_collection_func=lambda: OpinionCollection(synonyms))
)
Application of the large (252K documents) RuAttitudes collection consist of annotated attitudes using distant supervision technique.
pipeline = text_opinion_extraction_pipeline(
annotators=[
# Index-based annotation.
PredefinedTextOpinionAnnotator(
doc_provider=doc_provider,
label_formatter=RuAttitudesLabelFormatter(RuAttitudesLabelScaler()))
],
get_doc_by_id_func=doc_parser.by_id,
text_parser=text_parser)
© Nicolay Rusnachenko 2016-Present. Released under the MIT license.