Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADNI] Handle reading new format of clinical csv #1016

Merged
merged 18 commits into from
Nov 20, 2023

Conversation

MatthieuJoulot
Copy link
Contributor

@MatthieuJoulot MatthieuJoulot commented Nov 15, 2023

Closes #1002

ADNI has changed the format of the clinical csv, thus breaking the converter. This PR aims at fixing this problem.

@pep8speaks
Copy link

pep8speaks commented Nov 15, 2023

Hello @MatthieuJoulot! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1594:29: W605 invalid escape sequence '\d'
Line 1594:47: W605 invalid escape sequence '\d'
Line 1606:5: E722 do not use bare 'except'

Line 374:18: W605 invalid escape sequence '\d'
Line 374:36: W605 invalid escape sequence '\d'

Comment last updated at 2023-11-17 13:42:50 UTC

adni_merge_path = path.join(clinical_data_dir, "ADNIMERGE.csv")
participants = set(
read_csv(adni_merge_path, sep=",", usecols=["PTID"], squeeze=True).unique()
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one

Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @MatthieuJoulot !
I think this is looking good. I made some suggestions to have clearer error messages.
I think it would be great to add one or two subject/session with the new format to the CI data such that we can test that we are able to handle both formats. WDYT ?

import re
from pathlib import Path

pattern = filename + "(_\d{1,2}[A-Za-z]{3}\d{4})?.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting to have exactly one file matching this pattern ?

If yes, I think we should verify this and raise if more than one file was found. Currently the function returns the last one found. Maybe something like this:

files_matching_pattern = [
    f for f in Path(clinical_dir).rglob("*.csv") if re.search(pattern, (z.name))
]
if len(files_matching_pattern) != 1:
    raise IOError(
        f"Expecting to find exactly one file in folder {clinical_dir} "
        f"matching pattern {pattern}. {len(files_matching_pattern)} "
        f"files were found instead : {'\n- '.join(files_matching_pattern)}"
    )
try:
    return ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you download the data twice, yes, we expect only one file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

try:
return pd.read_csv(adni_merge_path, sep=",", low_memory=False)
except:
raise ValueError(f"{filename}.csv was not found. Please check your data.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be a bit more specific here. If you get to this line, the file exists but the reading as a CSV file failed:

raise ValueError(
    f"File {adni_merge_path} was found but could not "
    "be loaded as a DataFrame. Please check your data."
)

Copy link
Contributor Author

@MatthieuJoulot MatthieuJoulot Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not what happens. The file is actually not found (because of the name change), and it errors because adni_merge_path is therefore not defined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, if you implement the suggestion above, the file has to exist otherwise an error would have been raised before. So you would get to this line only if the CSV file wasn't formatted as expected. Right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I have been reading from bottom to top, I'll come back afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as well, works better reading from the top


file_to_read_path = path.join(clinical_data_dir, location)
cprint(f"\tReading clinical data file: {location}")
pattern = location.split(".")[0] + "(_\d{1,2}[A-Za-z]{3}\d{4})?.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use your load_clinical_csv() function here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed that one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha yes, actually there was. The way this was built before I changed things was different and I did not see how to use load_clinical_csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but I don't understand why you couldn't do:

for location in files:
    location = location.split("/")[0]
    df_file = load_clinical_csv(clinical_data_dir, location.split(".")[0])
    df_filtered = filter_subj_bids(df_file, location, bids_ids).copy()
    ...

Copy link
Contributor Author

@MatthieuJoulot MatthieuJoulot Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it looks like this should work. I did not use it, because this part of the code needs to be able not to find a file and this suggestion would make the code crash. Maybe we can work something out though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I nested this part in a try. I'm not fond of it, but it works. WDYT

clinica/iotools/converters/adni_to_bids/adni_utils.py Outdated Show resolved Hide resolved

file_to_read_path = path.join(clinical_data_dir, location)
cprint(f"\tReading clinical data file: {location}")
pattern = location.split(".")[0] + "(_\d{1,2}[A-Za-z]{3}\d{4})?.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but I don't understand why you couldn't do:

for location in files:
    location = location.split("/")[0]
    df_file = load_clinical_csv(clinical_data_dir, location.split(".")[0])
    df_filtered = filter_subj_bids(df_file, location, bids_ids).copy()
    ...

cprint(f"\tReading clinical data file: {location}")

df_file = pd.read_csv(file_to_read_path, dtype=str)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put as few lines as possible in the try-catch and catch explicit errors. In this case, we need to catch the IOError which happens when the number of found files isn't exactly one. It could be a good idea to give a warning and continue the loop. WDYT ?

try:
    df_file = load_clinical_csv(clinical_data_dir, location.split(".")[0])
except IOError as e:
    warnings.warn(e)
    continue
 df_filtered = ...

@MatthieuJoulot MatthieuJoulot marked this pull request as ready for review November 16, 2023 14:48
Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @MatthieuJoulot !

Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MatthieuJoulot !
Just a small suggestion on removing parametrization for tests using a single value.
LGTM otherwise !

assert_frame_equal(load_clinical_csv(tmp_path, csv_to_look_for), input_df)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the parametrization if you have only one value.

remove csv_to_look_for from the arguments of the test function and simply use "adnimerge" in the test function's body.

load_clinical_csv(tmp_path, csv_to_look_for)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here

with open(tmp_path / "adnimerge.csv", "w") as fp:
fp.write("col1,col2,col3\n1,2,3\n1,2,3,4")

# input_df.to_csv(tmp_path / csv_name, sep="\t", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# input_df.to_csv(tmp_path / csv_name, sep="\t", index=False)

Copy link
Member

@NicolasGensollen NicolasGensollen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the linter is unhappy... Could you format ?

input_df.to_csv(tmp_path / csv_name, index=False)
assert_frame_equal(load_clinical_csv(tmp_path, csv_to_look_for), input_df)

def test_load_clinical_csv_error(tmp_path, ):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_load_clinical_csv_error(tmp_path, ):
def test_load_clinical_csv_error(tmp_path):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange I thought I did. I will again then

@NicolasGensollen NicolasGensollen merged commit 21f179b into aramis-lab:dev Nov 20, 2023
16 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ADNI converter unable to handle new csv name
3 participants