Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem creating JOSNL file with newick_to_taxonium #576

Open
NicolaDM opened this issue Feb 26, 2024 · 8 comments
Open

Problem creating JOSNL file with newick_to_taxonium #576

NicolaDM opened this issue Feb 26, 2024 · 8 comments

Comments

@NicolaDM
Copy link

Hi, I have a 3M tips tree that seems to large for the browser version of Taxonium, and I am trying to visualize it locally.
I want to visualize mutations on the tree, and for this I have either a nexus tree, or newick+tsv metadata.
This worked fine for 1M trees on the browser version of Taxonium.
However now I have to convert this to jasonl to run the Taxonium desktop app.
However, when I run

newick_to_taxonium -i 3M_tree.tree -m 3M_metaData.tsv -o 3M_tree.jsonl -c mutationsInf,errors,Ns

I get the following error message:

_File "/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 1302, in validate_usecols_names
raise ValueError(
ValueError: Usecols do not match columns, columns expected but not found: ['strain']

Do certain columns have to be present in the metadata file to make this work? My data is from MAPLE, not UShER, so the headings of the metadata file are different, is this an issue?

@theosanderson
Copy link
Owner

theosanderson commented Feb 26, 2024

Hi @NicolaDM , if you are supplying metadata we need to know how to match up the metadata to the node names in the tree. This needs you either to have node names in a column called strain or to supply a --key_column myAlternativeColumnName parameter that gives another column to use.

@NicolaDM
Copy link
Author

Thank you very much Theo, that works!
Now I get this error though:

ValueError: Error: The key column 'node' contains non-unique values in the metadata file.

Despite the fact that node names are unique in my file. Is it because names are not allowed to be contained in each other (e.g. a node name should not be a prefix of another node name)?

@theosanderson
Copy link
Owner

I really suspect your metadata file does have genuinely duplicated entries in the key column - there isn't any complex logic in the code on prefixes or anything. Feel free to email the file if helpful.

@theosanderson
Copy link
Owner

Hi Nicola,

How are you making your TSV?

Here I have replaced the tabs with pipes for clarity

node|collapsedTo|mutationsInf|Ns|errors
SRR11578335|||5297-5586,22878-23144||

You'll see that the first line has 4 pipes (tabs) but the second has 5 pipes (tabs). These should be equal in a normal TSV.

As a result, pandas is assuming that SRR11578335 here is not the node column but another index column. I can fix this, by setting index_col = False, which will probably in general cause less confusion, but you might also want to look at the TSV generation script.

@NicolaDM
Copy link
Author

I see - indeed I carelessly added an extra tab at the end of each non-title row. Thanks, I'll fix it!

@NicolaDM
Copy link
Author

Indeed I confirm this fixed the problem and now it works - thanks again!

@theosanderson
Copy link
Owner

Great!

@theosanderson
Copy link
Owner

Reopening to consider change to avoid the same confusion in future

@theosanderson theosanderson reopened this Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants