Question

Formatted question description: https://leetcode.ca/all/187.html

 187	Repeated DNA Sequences

 All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG".
 When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

 Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

 Example:

 Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

 Output: ["AAAAACCCCC", "CCCCCAAAAA"]

Algorithm

Using HashSet can not have the characteristics of duplicate items.

A => binary: 00
C => binary: 01
G => binary: 10
T => binary: 11

Basically, this questions is more of a compression problem, how to encode the 10-char string to use less disk space.

  • plain char is 8 bits (2 bytes)
  • encoding to 2 bits (00,01,10,11)

Code

Java

  • 
    public class Repeated_DNA_Sequences {
    
        public static void main(String[] args) {
            Repeated_DNA_Sequences out = new Repeated_DNA_Sequences();
            Solution s = out.new Solution();
    
            System.out.println(s.findRepeatedDnaSequences("AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"));
        }
    
        class Solution_optimize {
            public List<String> findRepeatedDnaSequences(String s) {
                Set<Integer> words = new HashSet<>();
    
                // need this to remove duplicates
                // Input "AAAAAAAAAAAAA", result ["AAAAAAAAAA"]
                // to avoid result ["AAAAAAAAAA","AAAAAAAAAA","AAAAAAAAAA"]
                Set<Integer> repteated = new HashSet<>();
                List<String> result = new ArrayList<>();
    
                char[] map = new char[26];
                //map['A' - 'A'] = 0; // binary: 00
                map['C' - 'A'] = 1; // binary: 01
                map['G' - 'A'] = 2; // binary: 10
                map['T' - 'A'] = 3; // binary: 11
    
                for(int i = 0; i < s.length() - 9; i++) {
                    int v = 0;
                    for(int j = i; j < i + 10; j++) {
                        v <<= 2; // every time, use the new 2 bits after shifting for current char
                        v |= map[s.charAt(j) - 'A'];
                    }
                    if(words.contains(v) && !repteated.contains(v)) {
                        repteated.add(v);
                        result.add(s.substring(i, i + 10));
                    }
    
                    words.add(v); // re-add, if already contains
                }
                return result;
            }
        }
    
        class Solution {
            HashSet<String> existed = new HashSet<>();
            HashSet<String> result = new HashSet<>();
    
            public List<String> findRepeatedDnaSequences(String s) {
    
                if (s.length() <= 10) {
                    return new ArrayList<String>();
                }
    
                for (int i = 0; i < 10; i++) {
                    for (int j = i; j + 9 < s.length(); j += 10) {
    
                        String one = s.substring(j, j + 10);
    
                        if (existed.contains(one)) {
                            result.add(one);
                        }
    
                        existed.add(one);
                    }
                }
    
                return new ArrayList<String>(result);
            }
    
        }
    }
    
  • Todo
    
  • class Solution(object):
      def findRepeatedDnaSequences(self, s):
        """
        :type s: str
        :rtype: List[str]
        """
        d = {}
        ans = []
        for i in range(len(s) - 9):
          key = s[i:i + 10]
          if key in d:
            d[key] += 1
            if d[key] == 2:
              ans.append(key)
          else:
            d[key] = 1
        return ans
    
    

All Problems

All Solutions