Computing GC Content
A Rosalind Bioinformatics Stronghold Problem
Background on the problem and its importance to bioinformatics can be found on the Rosalind problem page.
Summary
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string.
Rosalind_0808
60.919540
Solution
from collections import defaultdict with open("./rosalind_gc.txt", "r") as in_file, open("./out_gc.txt", "w") as out_file: lines = [x for x in in_file.readlines()] keys = [] values = [] for line in lines: if line.startswith(">"): values.append("") keys.append(line.lstrip('>').rstrip('\n')) else: values[len(keys)-1] += line.rstrip('\n') d = dict(zip(keys, values)) for k, dna in d.items(): base_counts = defaultdict(int) base_pct = defaultdict(int) for base in dna: base_counts[base] += 1 for base in base_counts.keys(): base_pct[base] = base_counts[base] / sum(base_counts.values()) * 100 d[k] = base_pct['G'] + base_pct['C'] max_key = max(d.keys(), key=(lambda key: d[key])) # Will just print instead of write file # out_file.write("{}\n{}".format(max_key, d[max_key])) print("Input file (ellipsis indicates continued DNA string):") for line in lines[:3]: print(line, end='') print("...") for line in lines[18:21]: print(line, end='') print("...") print("etc.\n") print("Output file:") print("{}\n{}".format(max_key, d[max_key]))
Output
Input file (ellipsis indicates continued DNA string): >Rosalind_3714 AATCCGTATCACGCCGTGATCTACAGTTGAAAAGAGTTATTGGGCACCCTTCCTAGCCAC TGATGAGAGCGCGTGGGCTGGTTGTCCCTTTTCCTCAGAAGTGCTTAGGCCATACGGCCT ... >Rosalind_3190 TCTTTGTCAAATGTAGCTATCGACGGGCCGAGTCGTACTAGCATTCAGTCCGTGGGCCTA CACGTATAGCTAAGTAGTAAGTGCGAATGCCGAGTTGGCTGTAGCCGTAAGCATTATCTT ... etc. Output file: Rosalind_3055 51.31004366812227