Add tracking to data file type column names 2. #553

ptth222 · 2024-03-17T01:02:25Z

Rebasing the original branch for this did not go well, so I manually created another one. This PR should replace #510. I was able to get this to pass all of the tests and now multiple ".* File" columns won't cause an issue. We don't need this capability like I thought we did originally, but I figured I would go on and make it work anyway.

Testing that the changes fix what was raised in ISA-tools#509.

coveralls · 2024-03-17T01:25:47Z

coverage: 81.341% (+0.08%) from 81.257%
when pulling e50e65e on ptth222:fix-data-file-name-bug2
into 7d5f19f on ISA-tools:issue-511.

proccaserra · 2024-03-18T21:34:41Z

as discussed today:

ISA Tab specification allow only 1 "Raw Data File" per Assay Table
"Assay Name" is needed as "Raw Data File" field can not stand alone. To ensure output "Assay Name" is output at serialization time, the protocol type should be data acquisition or a synonym has been added to the ./resources/yaml/protocol-types.yml" file
similarly, "Derived Data File" can not stand alone and should be associated with a "Data Transformation Name" field, which is output when serializing to ISA-Tab if a protocol type "data transformation" is found or a string declared as synonym to "data transformation" in the ./resources/yaml/protocol-types.yml" file
There can be more than one occurrence of "Derived Data File"
"Assay Name" header is the default value found associated with "Raw Data File", but 'specialization' exist:

"NMR Assay Name" with "Free Induction Decay Data File" with 'protocol type' = "NMR spectroscopy" (or declared synonyms)
"MS Assay Name" with "Raw Spectral Data File" with 'protocol type' = "mass spectrometry" (or declared synonyms)

Changed the added test to be more realistic. Had to track name columns like Data Transformation Name so values would be handled correctly.

ptth222 · 2024-03-21T08:07:11Z

I did an example more like what was discussed in the meeting. I had to also add tracking for the columns that are added for known protocol types. The problem that this PR solves is that for any ".* File" column, if there is more than 1 then the values don't get written out correctly. Solving this wasn't as easy as it was for a similar issue with Protocol REF columns. Due to how graph nodes were previously handled.

Old Table Output:

Sample Name	Protocol REF	MS Assay Name	Raw Data File	Protocol REF	Data Transformation Name	Derived Data File	Protocol REF	Data Transformation Name	Derived Data File
sample1	protocol1	process1	datafile1.raw	protocol2	process3	datafile3.raw	protocol3	process3	datafile3.raw
sample2	protocol1	process4	datafile4.raw	protocol3	process5	datafile5.raw		process5	datafile5.raw

New Table Output:

Sample Name	Protocol REF	MS Assay Name	Raw Data File	Protocol REF	Data Transformation Name	Derived Data File	Protocol REF	Data Transformation Name	Derived Data File
sample1	protocol1	process1	datafile1.raw	protocol2	process2	datafile2.raw	protocol3	process3	datafile3.raw
sample2	protocol1	process4	datafile4.raw	protocol3	process5	datafile5.raw

proccaserra · 2024-03-22T11:03:30Z

PRS to test this situation, i.e., several files generated by one data acquisition:

Sample Name	Protocol REF	MS Assay Name	Raw Data File	Protocol REF	Data Transformation Name	Derived Data File	Protocol REF	Data Transformation Name	Derived Data File
sample1	protocol1	process1	datafile1.raw	protocol2	process2	datafile2.raw	protocol3	process3	datafile3.raw
sample1	protocol1	process1	datafile4.raw	protocol2	process2	datafile2.raw

proccaserra · 2024-04-12T10:54:36Z

isatools/isatab/load/ProcessSequenceFactory.py

@@ -260,7 +263,7 @@ def get_node_by_label_and_key(labl, this_key):
                            fv_set.add(fv)
                            material.factor_values = list(fv_set)

-            elif object_label in _LABELS_DATA_NODES:
+            elif object_label in _LABELS_DATA_NODES or ' File' in object_label:


object_label="foo File" would cause issue

I don't think "foo File" would validate.

There are other instances of patterns like ' File' in or endswith('File') where "foo File" would cause an issue that were already present. I just made things consistent.

When this change was originally made not every File column was in _LABELS_DATA_NODES.

Having a list of specific acceptable file names is pretty fragile anyway and I would have generalized to columns ending in " File" a while ago, or a pattern like what "Protocol REF" does.

After discussing in meeting. I will remove this and make the code always just look in _LABELS_DATA_NODES.

proccaserra · 2024-04-19T14:22:04Z

isatools/model/utils.py

+def _build_paths_and_indexes(process_sequence=None):
+    """Returns the paths from source/sample to end points and a mapping of sequence_identifier to object."""
+
+    def _compute_combinations(identifier_list, identifiers_to_objects):


@ptth222 please refactor to avoid nested function and revisit the nested for loops before @terazus can review.

Removed code using ' File' to find data nodes. Added comments and broke up code to be more readable and understood.

ptth222 · 2024-05-20T21:45:50Z

I took a small break from what I got pulled away to work on to address the comments raised when we last met. I added comments and broke up _build_paths_and_indexes() into smaller pieces and removed all of the logic that identified data nodes with 'File'. Let me know if this is still not enough.

proccaserra · 2024-05-31T09:52:21Z

isatools/isatab/dump/write.py

+                        elif node.executes_protocol.protocol_type.term.lower() \
+                                in protocol_types_dict["nucleic acid hybridization"][SYNONYMS]:
+                            columns.extend(
+                                ["Hybridization Assay Name",
+                                 "Array Design REF"])


@ptth222: doing the code review, and trying to merge, caused 2 tests to fail.
There are several issues we need to discuss but the PR can not be merged as is:

we never enter this elif at line 320

Hybridization Assay Name, Array Design REF are appended with .0 when there is one occurrence only. this prevents the df_dict to retrieve the right key, raising a KeyError. We suggest a first pass to count the number of headers and only append the process number when there is more than one.

I'm not sure why I made that an "elif". I think It's been too long and I can't remember. I found a dataset that uses "nucleic acid hybridization" and used that to test with, so now it should work. I'm not sure what you are talking about with the KeyError. If you have a specific dataset to illustrate that would be helpful.

proccaserra · 2024-05-31T09:52:43Z

isatools/isatab/dump/write.py

+
+                                df_dict[new_oname_label][-1] = node.name
+                                name_label_in_path_counts[oname_label] += 1
+                            elif node.executes_protocol.protocol_type.term.lower() in \


see comment above, same logic

test_core had to change because the config file in the test data changed.

validate/test_core mtbls1846 lost all of the errors for middle initial being required, but is no longer required. Added validate_first=False to some conversion tests so they run faster. assert_tab_content_equal in utils.py was not checking things correctly. Lists of dataframes cannot be sorted without defining a "key" parameter. I simply removed the sorting and added returning False on an error rather than True.

ptth222 · 2024-06-28T18:44:24Z

The only conflict is where I deleted a line in testing that was printing. Can this be resolved?

proccaserra · 2024-07-01T17:47:47Z

The only conflict is where I deleted a line in testing that was printing. Can this be resolved?

@ptth222 there is indeed an outstanding merge conflict but resolving it wouldn't allow to merge still as the PR does not solve the issue. We have not found a fix yet unfortunately

ptth222 · 2024-07-01T22:01:01Z

What issue? I addressed everything mentioned in the history above, and all of the tests pass. If there is an issue can we create a test for it?

proccaserra · 2024-07-09T10:34:49Z

@ptth222 finally managed to find time and issue causing the problem while merging into my local branch. All tests are now passing locally but need to investigate one more case. Apologies for the delay

proccaserra · 2024-07-15T11:37:59Z

@ptth222 closing this as your changes have been integrated in issue-511.

ptth222 added 2 commits March 16, 2024 19:41

Changed how paths, nodes, and indexes are handled

2cb095b

Added new test

ca12fe4

Testing that the changes fix what was raised in ISA-tools#509.

ptth222 mentioned this pull request Mar 17, 2024

Add tracking to data file type column names. #510

Closed

ptth222 mentioned this pull request Mar 20, 2024

Protocol types refactor #556

Merged

Name headers also tracked

e50e65e

Changed the added test to be more realistic. Had to track name columns like Data Transformation Name so values would be handled correctly.

proccaserra reviewed Apr 12, 2024

View reviewed changes

proccaserra reviewed Apr 19, 2024

View reviewed changes

proccaserra assigned knirirr May 17, 2024

proccaserra requested a review from knirirr May 17, 2024 13:16

Addressed comments in ISA-tools#553

bc33a2f

Removed code using ' File' to find data nodes. Added comments and broke up code to be more readable and understood.

Merge branch 'issue-511' into fix-data-file-name-bug2

ed25daa

proccaserra reviewed May 31, 2024

View reviewed changes

ptth222 added 3 commits June 3, 2024 05:12

Changed write based on comments in ISA-tools#553

ad22cae

test_core had to change because the config file in the test data changed.

Removed commented out code

770db3c

proccaserra added a commit that referenced this pull request Jul 9, 2024

fixes issue-511 and incorporates code from @ptth222 from PR #553'

84942ce

proccaserra closed this Jul 15, 2024

This was referenced Jul 22, 2024

rewrite test for test_get_ontology in test_isatools_utils.py #528

Closed

Comments[] attached to assays aren't serialized #511

Closed

proccaserra mentioned this pull request Jul 22, 2024

OLS3->OLS4 update #506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tracking to data file type column names 2. #553

Add tracking to data file type column names 2. #553

ptth222 commented Mar 17, 2024

coveralls commented Mar 17, 2024 •

edited

Loading

proccaserra commented Mar 18, 2024

ptth222 commented Mar 21, 2024

proccaserra commented Mar 22, 2024

proccaserra Apr 12, 2024

ptth222 Apr 12, 2024

ptth222 Apr 19, 2024

proccaserra Apr 19, 2024

ptth222 commented May 20, 2024

proccaserra May 31, 2024

ptth222 Jun 3, 2024

proccaserra May 31, 2024

ptth222 commented Jun 28, 2024

proccaserra commented Jul 1, 2024

ptth222 commented Jul 1, 2024

proccaserra commented Jul 9, 2024

proccaserra commented Jul 15, 2024

Add tracking to data file type column names 2. #553

Add tracking to data file type column names 2. #553

Conversation

ptth222 commented Mar 17, 2024

coveralls commented Mar 17, 2024 • edited Loading

proccaserra commented Mar 18, 2024

ptth222 commented Mar 21, 2024

proccaserra commented Mar 22, 2024

proccaserra Apr 12, 2024

Choose a reason for hiding this comment

ptth222 Apr 12, 2024

Choose a reason for hiding this comment

ptth222 Apr 19, 2024

Choose a reason for hiding this comment

proccaserra Apr 19, 2024

Choose a reason for hiding this comment

ptth222 commented May 20, 2024

proccaserra May 31, 2024

Choose a reason for hiding this comment

ptth222 Jun 3, 2024

Choose a reason for hiding this comment

proccaserra May 31, 2024

Choose a reason for hiding this comment

ptth222 commented Jun 28, 2024

proccaserra commented Jul 1, 2024

ptth222 commented Jul 1, 2024

proccaserra commented Jul 9, 2024

proccaserra commented Jul 15, 2024

coveralls commented Mar 17, 2024 •

edited

Loading