Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracking to data file type column names 2. #553

Closed
wants to merge 8 commits into from

Conversation

ptth222
Copy link

@ptth222 ptth222 commented Mar 17, 2024

Rebasing the original branch for this did not go well, so I manually created another one. This PR should replace #510. I was able to get this to pass all of the tests and now multiple ".* File" columns won't cause an issue. We don't need this capability like I thought we did originally, but I figured I would go on and make it work anyway.

@coveralls
Copy link

coveralls commented Mar 17, 2024

Coverage Status

coverage: 81.341% (+0.08%) from 81.257%
when pulling e50e65e on ptth222:fix-data-file-name-bug2
into 7d5f19f on ISA-tools:issue-511.

@proccaserra
Copy link
Member

as discussed today:

  • ISA Tab specification allow only 1 "Raw Data File" per Assay Table
  • "Assay Name" is needed as "Raw Data File" field can not stand alone. To ensure output "Assay Name" is output at serialization time, the protocol type should be data acquisition or a synonym has been added to the ./resources/yaml/protocol-types.yml" file
  • similarly, "Derived Data File" can not stand alone and should be associated with a "Data Transformation Name" field, which is output when serializing to ISA-Tab if a protocol type "data transformation" is found or a string declared as synonym to "data transformation" in the ./resources/yaml/protocol-types.yml" file
  • There can be more than one occurrence of "Derived Data File"
  • "Assay Name" header is the default value found associated with "Raw Data File", but 'specialization' exist:

"NMR Assay Name" with "Free Induction Decay Data File" with 'protocol type' = "NMR spectroscopy" (or declared synonyms)
"MS Assay Name" with "Raw Spectral Data File" with 'protocol type' = "mass spectrometry" (or declared synonyms)

@ptth222 ptth222 mentioned this pull request Mar 20, 2024
Changed the added test to be more realistic. Had to track name columns like Data Transformation Name so values would be handled correctly.
@ptth222
Copy link
Author

ptth222 commented Mar 21, 2024

I did an example more like what was discussed in the meeting. I had to also add tracking for the columns that are added for known protocol types. The problem that this PR solves is that for any ".* File" column, if there is more than 1 then the values don't get written out correctly. Solving this wasn't as easy as it was for a similar issue with Protocol REF columns. Due to how graph nodes were previously handled.

Old Table Output:

Sample Name Protocol REF MS Assay Name Raw Data File Protocol REF Data Transformation Name Derived Data File Protocol REF Data Transformation Name Derived Data File
sample1 protocol1 process1 datafile1.raw protocol2 process3 datafile3.raw protocol3 process3 datafile3.raw
sample2 protocol1 process4 datafile4.raw protocol3 process5 datafile5.raw process5 datafile5.raw

New Table Output:

Sample Name Protocol REF MS Assay Name Raw Data File Protocol REF Data Transformation Name Derived Data File Protocol REF Data Transformation Name Derived Data File
sample1 protocol1 process1 datafile1.raw protocol2 process2 datafile2.raw protocol3 process3 datafile3.raw
sample2 protocol1 process4 datafile4.raw protocol3 process5 datafile5.raw

@proccaserra
Copy link
Member

PRS to test this situation, i.e., several files generated by one data acquisition:

Sample Name Protocol REF MS Assay Name Raw Data File Protocol REF Data Transformation Name Derived Data File Protocol REF Data Transformation Name Derived Data File
sample1 protocol1 process1 datafile1.raw protocol2 process2 datafile2.raw protocol3 process3 datafile3.raw
sample1 protocol1 process1 datafile4.raw protocol2 process2 datafile2.raw

@@ -260,7 +263,7 @@ def get_node_by_label_and_key(labl, this_key):
fv_set.add(fv)
material.factor_values = list(fv_set)

elif object_label in _LABELS_DATA_NODES:
elif object_label in _LABELS_DATA_NODES or ' File' in object_label:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object_label="foo File" would cause issue

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think "foo File" would validate.
  2. There are other instances of patterns like ' File' in or endswith('File') where "foo File" would cause an issue that were already present. I just made things consistent.
  3. When this change was originally made not every File column was in _LABELS_DATA_NODES.
  4. Having a list of specific acceptable file names is pretty fragile anyway and I would have generalized to columns ending in " File" a while ago, or a pattern like what "Protocol REF" does.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing in meeting. I will remove this and make the code always just look in _LABELS_DATA_NODES.

def _build_paths_and_indexes(process_sequence=None):
"""Returns the paths from source/sample to end points and a mapping of sequence_identifier to object."""

def _compute_combinations(identifier_list, identifiers_to_objects):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptth222 please refactor to avoid nested function and revisit the nested for loops before @terazus can review.

Removed code using ' File' to find data nodes. Added comments and broke up code to be more readable and understood.
@ptth222
Copy link
Author

ptth222 commented May 20, 2024

I took a small break from what I got pulled away to work on to address the comments raised when we last met. I added comments and broke up _build_paths_and_indexes() into smaller pieces and removed all of the logic that identified data nodes with 'File'. Let me know if this is still not enough.

Comment on lines 320 to 324
elif node.executes_protocol.protocol_type.term.lower() \
in protocol_types_dict["nucleic acid hybridization"][SYNONYMS]:
columns.extend(
["Hybridization Assay Name",
"Array Design REF"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptth222: doing the code review, and trying to merge, caused 2 tests to fail.
There are several issues we need to discuss but the PR can not be merged as is:

  • we never enter this elif at line 320
  • Hybridization Assay Name, Array Design REF are appended with .0 when there is one occurrence only. this prevents the df_dict to retrieve the right key, raising a KeyError. We suggest a first pass to count the number of headers and only append the process number when there is more than one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why I made that an "elif". I think It's been too long and I can't remember. I found a dataset that uses "nucleic acid hybridization" and used that to test with, so now it should work. I'm not sure what you are talking about with the KeyError. If you have a specific dataset to illustrate that would be helpful.


df_dict[new_oname_label][-1] = node.name
name_label_in_path_counts[oname_label] += 1
elif node.executes_protocol.protocol_type.term.lower() in \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above, same logic

test_core had to change because the config file in the test data changed.
validate/test_core mtbls1846 lost all of the errors for middle initial being required, but is no longer required. Added validate_first=False to some conversion tests so they run faster. assert_tab_content_equal in utils.py was not checking things correctly. Lists of dataframes cannot be sorted without defining a "key" parameter. I simply removed the sorting and added returning False on an error rather than True.
@ptth222
Copy link
Author

ptth222 commented Jun 28, 2024

The only conflict is where I deleted a line in testing that was printing. Can this be resolved?

@proccaserra
Copy link
Member

The only conflict is where I deleted a line in testing that was printing. Can this be resolved?

@ptth222 there is indeed an outstanding merge conflict but resolving it wouldn't allow to merge still as the PR does not solve the issue. We have not found a fix yet unfortunately

@ptth222
Copy link
Author

ptth222 commented Jul 1, 2024

What issue? I addressed everything mentioned in the history above, and all of the tests pass. If there is an issue can we create a test for it?

@proccaserra
Copy link
Member

@ptth222 finally managed to find time and issue causing the problem while merging into my local branch. All tests are now passing locally but need to investigate one more case. Apologies for the delay

@proccaserra
Copy link
Member

@ptth222 closing this as your changes have been integrated in issue-511.

@proccaserra proccaserra mentioned this pull request Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants