Computing GC Content
A Rosalind Bioinformatics Stronghold Problem
Background on the problem and its importance to bioinformatics can be found on the Rosalind problem page.
Summary
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string.
Rosalind_0808
60.919540
Solution
from collections import defaultdict
with open("./rosalind_gc.txt",
"r") as in_file, open("./out_gc.txt", "w") as out_file:
lines = [x for x in in_file.readlines()]
keys = []
values = []
for line in lines:
if line.startswith(">"):
values.append("")
keys.append(line.lstrip('>').rstrip('\n'))
else:
values[len(keys)-1] += line.rstrip('\n')
d = dict(zip(keys, values))
for k, dna in d.items():
base_counts = defaultdict(int)
base_pct = defaultdict(int)
for base in dna:
base_counts[base] += 1
for base in base_counts.keys():
base_pct[base] = base_counts[base] / sum(base_counts.values()) * 100
d[k] = base_pct['G'] + base_pct['C']
max_key = max(d.keys(), key=(lambda key: d[key]))
# Will just print instead of write file
# out_file.write("{}\n{}".format(max_key, d[max_key]))
print("Input file (ellipsis indicates continued DNA string):")
for line in lines[:3]:
print(line, end='')
print("...")
for line in lines[18:21]:
print(line, end='')
print("...")
print("etc.\n")
print("Output file:")
print("{}\n{}".format(max_key, d[max_key]))
Output
Input file (ellipsis indicates continued DNA string):
>Rosalind_3714
AATCCGTATCACGCCGTGATCTACAGTTGAAAAGAGTTATTGGGCACCCTTCCTAGCCAC
TGATGAGAGCGCGTGGGCTGGTTGTCCCTTTTCCTCAGAAGTGCTTAGGCCATACGGCCT
...
>Rosalind_3190
TCTTTGTCAAATGTAGCTATCGACGGGCCGAGTCGTACTAGCATTCAGTCCGTGGGCCTA
CACGTATAGCTAAGTAGTAAGTGCGAATGCCGAGTTGGCTGTAGCCGTAAGCATTATCTT
...
etc.
Output file:
Rosalind_3055
51.31004366812227