DNA, the carrier of genetic information, has been a cornerstone in forensic science for decades. This project demonstrates how DNA profiling works by identifying to whom a given DNA sequence belongs using Short Tandem Repeats (STRs).
STRs are short sequences of DNA bases that repeat consecutively at specific locations in a genome. The number of repeats varies among individuals, providing a unique identifier when analyzed. By using multiple STRs, the likelihood of accurately identifying a match increases significantly.
- Parses a CSV database containing individuals' STR counts.
- Reads a DNA sequence from a text file.
- Computes the longest run of consecutive STR repeats in the DNA sequence.
- Matches the STR counts against a database to identify the individual or determine if no match exists.
Run the program as follows:
python3 dna.py <database.csv> <sequence.txt>
Examples:
$ python3 dna.py databases/small.csv sequences/1.txt
Bob
$ python3 dna.py databases/small.csv sequences/2.txt
No match
$ python3 dna.py databases/large.csv sequences/5.txt
Lavender
If the incorrect number of arguments is provided, the program will display an error message:
$ python dna.py
Usage: python dna.py <database.csv> <sequence.txt>
|── dna/
│ ├── databases/
│ │ ├── small.csv
│ │ └── large.csv
│ ├── dna.py
│ └── sequences/
│ ├── 1.txt
│ ├── 2.txt
│ ├── ...
│ └── 20.txt
└── README.md
-
Input:
- A CSV file containing individuals' STR counts.
- A text file containing a DNA sequence.
-
Output:
- The name of the individual whose STR counts match the DNA sequence.
- "No match" if no individual matches the DNA sequence.
-
Steps:
- Parse the CSV file to extract STRs and their counts.
- Analyze the DNA sequence to calculate the longest consecutive repeats for each STR.
- Compare the computed STR counts against the database.
- Print the matching individual's name or "No match."
name,AGAT,AATG,TATC
Alice,28,42,14
Bob,17,22,19
Charlie,36,18,25
AAGATAGATAGATAGATAATGTATC
$ python dna.py databases/small.csv sequences/4.txt
Alice
- The program leverages Python's
csv
module for handling CSV files and efficient data processing. - String slicing is used to identify and count STR sequences within the DNA string.
- A dictionary is used to store STR counts for easy comparison with the database.
- Python 3.x
Developed by Shahir Ahmed.
This project is licensed under the MIT License.