Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new command: annotate #292

Merged
merged 15 commits into from
Aug 1, 2022
31 changes: 31 additions & 0 deletions sssom/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@

from sssom.constants import (
DEFAULT_VALIDATION_TYPES,
MAPPING_SET_SLOTS,
MAPPING_SLOTS,
PREFIX_MAP_MODES,
SchemaValidationType,
Expand All @@ -34,6 +35,7 @@
from . import __version__
from .cliques import split_into_cliques, summarize_cliques
from .io import (
annotate_file,
convert_file,
filter_file,
parse_file,
Expand Down Expand Up @@ -644,5 +646,34 @@ def filter(input: str, output: TextIO, **kwargs):
filter_file(input=input, output=output, **kwargs)


@main.command()
@input_argument
@output_option
# TODO Revist the option below.
# If a multivalued slot needs to be partially preserved,
# the users will need to type the ones they need and
# set --replace-multivalued to True.
@click.option(
"--replace-multivalued",
default=False,
type=bool,
help="Multivalued slots should be replaced or not. [default: False]",
)
@dynamically_generate_sssom_options(MAPPING_SET_SLOTS)
def annotate(input: str, output: TextIO, replace_multivalued: bool, **kwargs):
"""Annotate metadata of a mapping set.

:param input: Input path of the SSSOM tsv file.
:param output: Output location.
:param replace_multivalued: Multivalued slots should be
replaced or not, defaults to False
:param **kwargs: Options provided by user
which are added to the metadata (e.g.: --mapping_set_id http://example.org/abcd)
"""
annotate_file(
input=input, output=output, replace_multivalued=replace_multivalued, **kwargs
)


if __name__ == "__main__":
main()
23 changes: 23 additions & 0 deletions sssom/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
from .typehints import Metadata
from .util import (
MappingSetDataFrame,
are_params_slots,
augment_metadata,
is_curie,
is_iri,
raise_for_bad_path,
Expand Down Expand Up @@ -361,3 +363,24 @@ def filter_file(input: str, output: TextIO, **kwargs) -> MappingSetDataFrame:
if multiple_params and idx != len(params):
query += " AND ("
return run_sql_query(query=query, inputs=[input], output=output)


def annotate_file(
input: str, output: TextIO, replace_multivalued: bool = False, **kwargs
) -> MappingSetDataFrame:
"""Annotate a file i.e. add custom metadata to the mapping set.

:param input: SSSOM tsv file to be queried over.
:param output: Output location.
:param replace_multivalued: Multivalued slots should be
replaced or not, defaults to False
:param **kwargs: Options provided by user
which are added to the metadata (e.g.: --mapping_set_id http://example.org/abcd)
:return: Annotated MappingSetDataFrame object.
"""
params = {k: v for k, v in kwargs.items() if v}
hrshdhgd marked this conversation as resolved.
Show resolved Hide resolved
are_params_slots(params)
input_msdf = parse_sssom_table(input)
msdf = augment_metadata(input_msdf, params, replace_multivalued)
write_table(msdf, output)
return msdf
59 changes: 59 additions & 0 deletions sssom/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
ENTITY_REFERENCE_SLOTS,
MAPPING_JUSTIFICATION,
MAPPING_SET_ID,
MAPPING_SET_SLOTS,
MAPPING_SET_SOURCE,
MULTIVALUED_SLOTS,
OBJECT_CATEGORY,
Expand Down Expand Up @@ -1250,3 +1251,61 @@ def get_all_prefixes(msdf: MappingSetDataFrame) -> list:
prefix_list = list(set(prefix_list))

return prefix_list


def augment_metadata(
msdf: MappingSetDataFrame, meta: dict, replace_multivalued: bool = False
) -> MappingSetDataFrame:
"""Augment metadata with parameters passed.

:param msdf: MappingSetDataFrame (MSDF) object.
:param meta: Dictionary that needs to be added/updated to the metadata of the MSDF.
:param replace_multivalued: Multivalued slots should be
replaced or not, defaults to False.
:raises ValueError: If type of slot is neither str nor list.
:return: MSDF with updated metadata.
"""
are_params_slots(meta)

if msdf.metadata:
for k, v in meta.items():
# If slot is multivalued, add to list.
if k in MULTIVALUED_SLOTS and not replace_multivalued:
tmp_value: list = []
if isinstance(msdf.metadata[k], str):
tmp_value = [msdf.metadata[k]]
elif isinstance(msdf.metadata[k], list):
tmp_value = msdf.metadata[k]
else:
raise ValueError(
f"{k} is of type {type(msdf.metadata[k])} and \
as of now only slots of type 'str' or 'list' are handled."
)
tmp_value.extend(v)
msdf.metadata[k] = list(set(tmp_value))
elif k in MULTIVALUED_SLOTS and replace_multivalued:
msdf.metadata[k] = list(v)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this is that if have have to creators in the metadata and want to add a third one.. it wont work. Would it make sense to make the method strictly additive for multi valued metadata? Maybe not. Do you have an opinion on this? Maybe it is correct the way it is..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We will have to think about it. When is it additive and when is it replacement? Should there be a flag?

Copy link
Contributor Author

@hrshdhgd hrshdhgd Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I have implemented now is if a slot to be annotated is multivalued , add new values to the list. But what if the entire slot value needs to be changed? How does the user indicate that?

Possible solutions:

  • create a new method remove_annotation which does exactly the opposite of annotate? It seems inconvenient in my opinion.
  • Have a keyword in params called remove which will be handled separately and would look like this:
params = {
            "mapping_set_id": ("http://w3id.org/my/mapping.sssom.tsv",),
            "mapping_set_version": ("2021-01-01",),
            "creator_id": ("orcid:2345"),
            "remove": { "creator_id": "orcid:1234" }
        }

Basically , remove will be a dict with key = slot to be edited and value will be the one to be removed.

  • Have a --remove CLI option that lists the metadata that needs to be removed. The only allowed values here would be MULTIVALUED_SLOTS or maybe all slots)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a boolean flag --replace-multivalued which defaults to false, and if it is true, it will remove any existing values from a multivalued value?

Copy link
Contributor Author

@hrshdhgd hrshdhgd Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the users want to retain some of the contents of the list and add new stuff? They'll have to type the ones they want again. If that's the case, the first solution would have worked and was a simple one too.

Copy link
Contributor Author

@hrshdhgd hrshdhgd Jul 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've implemented replace_multivalued. I've added a TODO comment next to the click option. If we can think of a cleaner way to do this, we can always go back and improve the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. good one :)

else:
msdf.metadata[k] = v[0]

return msdf


def are_params_slots(params: dict) -> bool:
"""Check if parameters conform to the slots in MAPPING_SET_SLOTS.

:param params: Dictionary of parameters.
:raises ValueError: If params are not slots.
:return: True/False
"""
empty_params = {k: v for k, v in params.items() if v is None or v == ""}
if len(empty_params) > 0:
logging.info(f"Parameters: {empty_params.keys()} has(ve) no value.")

legit_params = all(p in MAPPING_SET_SLOTS for p in params.keys())
if not legit_params:
invalids = [p for p in params if p not in MAPPING_SET_SLOTS]
raise ValueError(
f"The params are invalid: {invalids}. Should be any of the following: {MAPPING_SET_SLOTS}"
)
return True
Loading