Chris Martin

Computing GC Content

A Rosalind Bioinformatics Stronghold Problem

Background on the problem and its importance to bioinformatics can be found on the Rosalind problem page.

Summary

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG

>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC

>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string.

Rosalind_0808
60.919540

Solution

    from collections import defaultdict

    with open("./rosalind_gc.txt", 
            "r") as in_file, open("./out_gc.txt", "w") as out_file:
        
        lines = [x for x in in_file.readlines()]
        
        keys = []
        values = []
        
        for line in lines:
            if line.startswith(">"):
                values.append("")
                keys.append(line.lstrip('>').rstrip('\n'))
            else:
                values[len(keys)-1] += line.rstrip('\n')
                
        d = dict(zip(keys, values))
        
        for k, dna in d.items():
            base_counts = defaultdict(int)
            base_pct = defaultdict(int)
            for base in dna:
                base_counts[base] += 1
            for base in base_counts.keys():
                base_pct[base] = base_counts[base] / sum(base_counts.values()) * 100
            d[k] = base_pct['G'] + base_pct['C']     
        
        max_key = max(d.keys(), key=(lambda key: d[key]))    
        
    #     Will just print instead of write file
    #     out_file.write("{}\n{}".format(max_key, d[max_key]))
        
        print("Input file (ellipsis indicates continued DNA string):")
        for line in lines[:3]:
            print(line, end='')
        print("...")
        for line in lines[18:21]:
            print(line, end='')
        print("...")
        print("etc.\n")
        
        print("Output file:")
        print("{}\n{}".format(max_key, d[max_key]))
    

Output

    Input file (ellipsis indicates continued DNA string):
    >Rosalind_3714
    AATCCGTATCACGCCGTGATCTACAGTTGAAAAGAGTTATTGGGCACCCTTCCTAGCCAC
    TGATGAGAGCGCGTGGGCTGGTTGTCCCTTTTCCTCAGAAGTGCTTAGGCCATACGGCCT
    ...
    >Rosalind_3190
    TCTTTGTCAAATGTAGCTATCGACGGGCCGAGTCGTACTAGCATTCAGTCCGTGGGCCTA
    CACGTATAGCTAAGTAGTAAGTGCGAATGCCGAGTTGGCTGTAGCCGTAAGCATTATCTT
    ...
    etc.
    
    Output file:
    Rosalind_3055
    51.31004366812227