Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
首先想到的是用Hash Table判断只要O(n)的时间复杂度。但是内存超过限制,原因是String作为key占用了很大的内存。
解决方法就是二进制编码,因为只有4个字母,可以A=00,C=01,G=10,T=11。十位字母也就20位,而int有4个字节,32位,完全够用。
这样,我们就可以使用int表示一个string,节省了大量的内存。
1 class Solution { 2 public: 3 vector<string> findRepeatedDnaSequences(string s) { 4 vector<string> result; 5 if(s.length()<=10) return result; 6 unordered_map<int,int> showed; 7 for(int i=0;i<=s.length()-10;i++) 8 { 9 string temp_str = s.substr(i,10); 10 int temp = 0; 11 for(int i=0;i<10;i++) 12 { 13 if(temp_str[i]=='A') temp = temp*4+0; 14 else if(temp_str[i]=='C') temp = temp*4+1; 15 else if(temp_str[i]=='G') temp = temp*4+2; 16 else temp = temp*4+3; 17 } 18 if(showed.find(temp)!=showed.end()) 19 { 20 if(showed[temp]==1) 21 { 22 result.push_back(temp_str); 23 showed[temp]++; 24 } 25 else 26 { 27 showed[temp]++; 28 } 29 } 30 else 31 { 32 showed[temp]=1; 33 } 34 } 35 return result; 36 } 37 };