-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tracking to data file type column names 2. #553
Conversation
Testing that the changes fix what was raised in ISA-tools#509.
as discussed today:
|
Changed the added test to be more realistic. Had to track name columns like Data Transformation Name so values would be handled correctly.
I did an example more like what was discussed in the meeting. I had to also add tracking for the columns that are added for known protocol types. The problem that this PR solves is that for any ".* File" column, if there is more than 1 then the values don't get written out correctly. Solving this wasn't as easy as it was for a similar issue with Protocol REF columns. Due to how graph nodes were previously handled. Old Table Output:
New Table Output:
|
PRS to test this situation, i.e., several files generated by one data acquisition:
|
@@ -260,7 +263,7 @@ def get_node_by_label_and_key(labl, this_key): | |||
fv_set.add(fv) | |||
material.factor_values = list(fv_set) | |||
|
|||
elif object_label in _LABELS_DATA_NODES: | |||
elif object_label in _LABELS_DATA_NODES or ' File' in object_label: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_label="foo File" would cause issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't think "foo File" would validate.
- There are other instances of patterns like
' File' in
orendswith('File')
where "foo File" would cause an issue that were already present. I just made things consistent. - When this change was originally made not every File column was in _LABELS_DATA_NODES.
- Having a list of specific acceptable file names is pretty fragile anyway and I would have generalized to columns ending in " File" a while ago, or a pattern like what "Protocol REF" does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing in meeting. I will remove this and make the code always just look in _LABELS_DATA_NODES.
isatools/model/utils.py
Outdated
def _build_paths_and_indexes(process_sequence=None): | ||
"""Returns the paths from source/sample to end points and a mapping of sequence_identifier to object.""" | ||
|
||
def _compute_combinations(identifier_list, identifiers_to_objects): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed code using ' File' to find data nodes. Added comments and broke up code to be more readable and understood.
I took a small break from what I got pulled away to work on to address the comments raised when we last met. I added comments and broke up _build_paths_and_indexes() into smaller pieces and removed all of the logic that identified data nodes with 'File'. Let me know if this is still not enough. |
isatools/isatab/dump/write.py
Outdated
elif node.executes_protocol.protocol_type.term.lower() \ | ||
in protocol_types_dict["nucleic acid hybridization"][SYNONYMS]: | ||
columns.extend( | ||
["Hybridization Assay Name", | ||
"Array Design REF"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptth222: doing the code review, and trying to merge, caused 2 tests to fail.
There are several issues we need to discuss but the PR can not be merged as is:
- we never enter this
elif
at line 320 Hybridization Assay Name
,Array Design REF
are appended with .0 when there is one occurrence only. this prevents the df_dict to retrieve the right key, raising aKeyError
. We suggest a first pass to count the number of headers and only append the process number when there is more than one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why I made that an "elif". I think It's been too long and I can't remember. I found a dataset that uses "nucleic acid hybridization" and used that to test with, so now it should work. I'm not sure what you are talking about with the KeyError. If you have a specific dataset to illustrate that would be helpful.
isatools/isatab/dump/write.py
Outdated
|
||
df_dict[new_oname_label][-1] = node.name | ||
name_label_in_path_counts[oname_label] += 1 | ||
elif node.executes_protocol.protocol_type.term.lower() in \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above, same logic
test_core had to change because the config file in the test data changed.
validate/test_core mtbls1846 lost all of the errors for middle initial being required, but is no longer required. Added validate_first=False to some conversion tests so they run faster. assert_tab_content_equal in utils.py was not checking things correctly. Lists of dataframes cannot be sorted without defining a "key" parameter. I simply removed the sorting and added returning False on an error rather than True.
The only conflict is where I deleted a line in testing that was printing. Can this be resolved? |
@ptth222 there is indeed an outstanding merge conflict but resolving it wouldn't allow to merge still as the PR does not solve the issue. We have not found a fix yet unfortunately |
What issue? I addressed everything mentioned in the history above, and all of the tests pass. If there is an issue can we create a test for it? |
@ptth222 finally managed to find time and issue causing the problem while merging into my local branch. All tests are now passing locally but need to investigate one more case. Apologies for the delay |
@ptth222 closing this as your changes have been integrated in issue-511. |
Rebasing the original branch for this did not go well, so I manually created another one. This PR should replace #510. I was able to get this to pass all of the tests and now multiple ".* File" columns won't cause an issue. We don't need this capability like I thought we did originally, but I figured I would go on and make it work anyway.