Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow import and export of sequences with three letter amino acid codes #5556

Closed
ljubica-milovic opened this issue Sep 20, 2024 · 1 comment · Fixed by #5889
Closed

Allow import and export of sequences with three letter amino acid codes #5556

ljubica-milovic opened this issue Sep 20, 2024 · 1 comment · Fixed by #5889
Assignees
Labels
Export to Sequence Bucket: Bugs related to Export to Sequence mode Test Cases Written Test cases has been written for that issue

Comments

@ljubica-milovic
Copy link
Collaborator

ljubica-milovic commented Sep 20, 2024

Background

In addition to a single letter codes for amino acids, there exist standard three letter codes that are preferred by some biochemists/molecular biologists etc.
In the scope of this ticket is import and export of amino acid sequences (sequence format) with three letter codes, and not the canvas representation using three letters.

Three letter codes

Amino acid(s) name Amino acid single letter code Amino acid three letter code Note
Alanine A Ala
Aspartic acid or Asparagine B Asx The "alternatives" ambiguous amino acid.
Cysteine C Cys
Aspartic acid D Asp
Glutamic acid E Glu
Phenylalanine F Phe
Glycine G Gly
Histidine H His
Isoleucine I Ile
Leucine or Isoleucine J Xle The "alternatives" ambiguous amino acid.
Lysine K Lys
Leucine L Leu
Methionine M Met
Asparagine N Asn
Pyrrolysine O Pyl
Proline P Pro
Glutamine Q Gln
Arginine R Arg
Serine S Ser
Threonine T Thr
Selenocysteine U Sec
Valine V Val
Tryptophan W Trp
Any amino acid X Xaa The "alternatives" ambiguous amino acid.
Tyrosine Y Tyr
Glutamine or Gluramic acid Z Glx The "alternatives" ambiguous amino acid.

Requirements

1. Import logic described bellow:

1.1. On import/paste from clipboard an additional drop down menu should appear if "Sequence" and then "Peptide" is selected. The options in that menu should be "1-letter code" and "3-letter code".

Image for better understanding:
image

1.2. Valid input string can consist of 26*2 uppercase and lowercase English alphabet letters, spaces and line breaks.

1.3. Spaces should be interpreted as separating different sequences.

AlaAla CysCys
should result in:
image
(same behavior for current one letter sequence import)

1.4. Line breaks should be ignored.

AlaAla
CysCys
should result in:
image
(same behavior for current one letter sequence import)

1.5. If an invalid symbol if used an error message should appear.

image
(same behavior for current one letter sequence import)

1.6. Within one sequence every n*3+1 letter symbol has to be uppercase.

AlaAlaCysCys is valid
AlaAl
aCysCys is valid (ignoring line breaks - requirement 1.4)
AlaAla CysCys is valid (two sequences - requirement 1.5)
AlAalaCysCys is not valid (third letter is uppercase and the fourth one is not)

1.7. If requirement 1.6. is not fulfilled an error message should appear, with a title of "Incorrect Formatting" and text "Given string cannot be interpreted as a valid three letter sequence because of incorrect formatting."

1.8. Every triplet of letters in the sequence (that has the first letter uppercase and others lowercase - requirement 1.6) should be interpreted as an amino acid using the table above.

1.9. If requirement 1.8. in not fulfilled an error message should appear, with a title "Invalid Sequence" and text "Given string cannot be interpreted as a valid three letter sequence."

AlaAsxAspAdsAsnArg is invalid because Ads does not correspond to any amino acid from the table above.

2. Export logic described bellow:

2.1. On export "Sequence" should be replaced with two options "Sequence (1-letter code)" and "Sequence (3-letter code)"

Image for better understanding:
image

2.1.1. The current export to sequence is the new export to "Sequence (1-letter code)"

2.2. Only purely amino acid sequences without non-standard ambiguous amino acids can be exported to "Sequence (3-letter code)"

2.3. If one of the amino acids is a non-standard ambiguous amino acid an error message should appear, with the title "Non-standard amino acid" and text "Non-standard ambiguous amino acids cannot be exported to the selected format".

By standard ambiguous amino acids we consider amino acids that we have in the library (B, J, X, and Z).

2.4. If the sequence is not a purely amino acid sequence on export an error message should appear.

image
(same behavior for current one letter sequence export)

2.5. All amino acids should be exported as the three letter code of their natural analogue.

2.6. All amino acids with the natural analogue of X should be exported as Xun.

2.7. All standard ambiguous amino acids should be exported as the appropriate three letter code from the table above.

2.8. The different sequences should be separated by space.

UX

image

@AlexeyGirin
Copy link
Collaborator

Import part is tested.
Export is blocked by epam/Indigo#2660

  • Ketcher Version 2.27.0-rc.1 Build at 2024-11-05; 14:55:18
  • Indigo Toolkit Version 1.26.0-rc.1.0-g904d2d992-wasm32-wasm-clang-19.0.0
  • Chrome Version 130.0.6723.117 (Official Build) (64-bit)
  • Win10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Export to Sequence Bucket: Bugs related to Export to Sequence mode Test Cases Written Test cases has been written for that issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants