CIS 67: Homework 11

Handed out: 04/03/07
Due: by 10pm on 04/09/07
Email program to TA

CIS67: Homework 11

Playing with DNA

We know that DNA is structured as a double helix. If we consider just one of the strands of the helix we see that it is a sequence of elements called nucleotides. There are 4 different possible nucleotides, called adenine (A), guanine (G), thymine (T), and cytosine (C). Corresponding elements in the two strands are complementary: if on one side we have A, on the other is T, if on one side G, on the other C (and viceversa). Thus all the information of the helix is available on just one strand and we can think of that strand as just a string on the characters, A, T, C, G.

Let's not worry that DNA is transcribed into RNA and other mechanisms before it can be used to synthesize proteins. Instead let's assume that DNA is used directly to specify proteins.
It goes as follows: proteins are sequences of amino acids. In our discussion we assume a total of 20 possible amino acids, and that each amino acid is identified by a letter. Thus a protein can be seen as a string on these 20 letters.
A sequence of 3 consecutive nucleotides is called a codon. And codons map into amino acids as indicated in the attached table. [A possible use of the information in that table is the following Java variable:

    private static final String[][] CODON_AMINO =
        {
          {"att", "i"}, {"atc", "i"}, {"ata", "i"}, {"ctt", "l"},
          {"ctc", "l"}, {"cta", "l"}, {"ctg", "l"}, {"tta", "l"},
          {"ttg", "l"}, {"gtt", "v"}, {"gtc", "v"}, {"gta", "v"},
          {"gtg", "v"}, {"ttt", "f"}, {"ttc", "f"}, {"atg", "m"}, 
	  {"tgt", "c"}, {"tgc", "c"}, {"gct", "a"}, {"gcc", "a"}, 
          {"gca", "a"}, {"gcg", "a"}, {"ggt", "g"}, {"ggc", "g"}, 
          {"gga", "g"}, {"ggg", "g"}, {"cct", "p"}, {"ccc", "p"}, 
	  {"cca", "P"}, {"ccg", "p"}, {"act", "t"}, {"acc", "t"}, 
	  {"aca", "t"}, {"acg", "t"}, {"tct", "s"}, {"tcc", "s"}, 
	  {"tca", "s"}, {"tcg", "s"}, {"agt", "s"}, {"agc", "s"}, 
          {"tat", "y"}, {"tac", "y"}, {"tgg", "w"}, {"caa", "q"}, 
          {"cag", "q"}, {"aat", "n"}, {"aac", "n"}, {"cat", "h"}, 
          {"cac", "h"}, {"gaa", "e"}, {"gag", "e"}, {"gat", "d"}, 
          {"gac", "d"}, {"aaa", "k"}, {"aag", "k"}, {"cgt", "r"}, 
          {"cgc", "r"}, {"cga", "r"}, {"cgg", "r"}, {"aga", "r"}, 
          {"agg", "r"}
        };

] A specific codon, ATG, is called the start codon, i.e. the translation from DNA to protein starts at such a codon. Three codons, TAA, TAG, TGA, are called stop codons. The definition of a protein starts at a start codon (excluded) and ends at the first stop codon following it (excluded).

You are to write a program that is given as command line parameter the name of a file containing DNA information as a string (here is an example of such a file). The string may be broken into multiple lines and contain spaces. You should pay no attention to such line breaks and spaces. You should:

  1. Write to a new file, say proteins.txt, the proteins that are defined in the given file. A protein will just be a string of the single letter codes of its constituent amino acids. This string should be broken across lines to make sure that no line has more than 70 characters. Proteins will be separated by blank lines.
  2. Print out to the screen for each amino acid found in the output file the total number of such occurrences in absolute and as a percentage of the amino acids in the output proteins. Also, the identity and number of the codons that identified such an amino acid (remember, an amino acid may be identified by more than one codon).
Notice that if we had the DNA string: ATGCCCAATAG, it would consist of the start codon ATG, the stop codon TAG, and in between CCCAA, which is not a multiple of 3, thus not a whole number of codons. In this case we will take the CCC and throw away the excess AA. Thus this imaginary protein would just consist of the amino acid P.

Send to the TA a case analysis for this problem: problem statement, analysis, design, implementation, and testing.