zoukankan      html  css  js  c++  java
  • Leetcode:Repeated DNA Sequences详细题解

    题目

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

    Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

    原题链接:https://oj.leetcode.com/problems/repeated-dna-sequences/

    straight-forward method(TLE)

    算法分析

    直接字符串匹配;设计next数组,存字符串中每个字母在其中后续出现的位置;遍历时以next数组为起始。

    简化考虑长度为4的字符串

    case1:

    src A C G T A C G T

    next [4] [5] [6] [7] [-1] [-1] [-1] [-1]

    那么匹配ACGT字符串的过程,匹配next[0]之后的3位字符即可

    case2:

    src A C G T A A C G T

    next [4] [5] [6] [7] [5] [-1] [-1] [-1] [-1]

    多个A字符后继,那么需要匹配所有后继,匹配next[0]不符合之后,还要匹配next[next[0]]

    case3:

    src A A A A A A

    next [1] [2] [3] [4] [5] [-1]

    重复的情况,在next[0]匹配成功时,可以把next[next[0]]置为-1,即以next[0]开始的长度为4的字符串已经成功匹配过了,无需再次匹配了;当然这么做只能减少重复的情况,并不能消除重复,因此仍需要使用一个set存储匹配成功的结果,方便去重

    时间复杂度

    构造next数组的复杂度O(n^2),遍历的复杂度O(n^2);总时间复杂度O(n^2)

    代码实现

     1 #include <string>
     2 #include <vector>
     3 #include <set>
     4 
     5 class Solution {
     6 public:
     7     std::vector<std::string> findRepeatedDnaSequences(std::string s);
     8 
     9     ~Solution();
    10 
    11 private:
    12     std::size_t* next;
    13 };
    14 
    15 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
    16     std::vector<std::string> rel;
    17 
    18     if (s.length() <= 10) {
    19         return rel;
    20     }
    21 
    22     next = new std::size_t[s.length()];
    23 
    24     // cal next array
    25     for (int pos = 0; pos < s.length(); ++pos) {
    26         next[pos] = s.find_first_of(s[pos], pos + 1);
    27     }
    28 
    29     std::set<std::string> tmpRel;
    30 
    31     for (int pos = 0; pos < s.length(); ++pos) {
    32         std::size_t nextPos = next[pos];
    33         while (nextPos != std::string::npos) {
    34             int ic = pos;
    35             int in = nextPos;
    36             int count = 0;
    37             while (in != s.length() && count < 9 && s[++ic] == s[++in]) {
    38                 ++count;
    39             }
    40             if (count == 9) {
    41                 tmpRel.insert(s.substr(pos, 10));
    42                 next[nextPos] = std::string::npos;
    43             }
    44             nextPos = next[nextPos];
    45         }
    46     }
    47 
    48     for (auto itr = tmpRel.begin(); itr != tmpRel.end(); ++itr) {
    49         rel.push_back(*itr);
    50     }
    51 
    52     return rel;
    53 }
    54 
    55 Solution::~Solution() {
    56     delete [] next;
    57 }
    View Code

    hash table plus bit manipulation method

    (view the Show Tags and Runtime 10ms !)

    算法分析

    首先考虑将ACGT进行二进制编码

    A -> 00

    C -> 01

    G -> 10

    T -> 11

    在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

    ACGTACGTAC -> 00011011000110110001

    AAAAAAAAAA -> 00000000000000000000

    20位的二进制数,至多有2^20种组合,因此hash table的大小为2^20,即1024 * 1024,将hash table设计为bool hashTable[1024 * 1024];

    遍历字符串的设计

    每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。例如

    src CAAAAAAAAAC

    subStr CAAAAAAAAA

    int 0100000000

    subStr AAAAAAAAAC

    int 0000000001

    时间复杂度

    字符串遍历O(n),hash tableO(1);总时间复杂度O(n)

    代码实现

     1 #include <string>
     2 #include <vector>
     3 #include <unordered_set>
     4 #include <cstring>
     5 
     6 bool hashMap[1024*1024];
     7 
     8 class Solution {
     9 public:
    10     std::vector<std::string> findRepeatedDnaSequences(std::string s);
    11 };
    12 
    13 std::vector<std::string> Solution::findRepeatedDnaSequences(std::string s) {
    14     std::vector<std::string> rel;
    15     if (s.length() <= 10) {
    16         return rel;
    17     }
    18 
    19     // map char to code
    20     unsigned char convert[26];
    21     convert[0] = 0; // 'A' - 'A'  00
    22     convert[2] = 1; // 'C' - 'A'  01
    23     convert[6] = 2; // 'G' - 'A'  10
    24     convert[19] = 3; // 'T' - 'A' 11
    25 
    26     // initial process
    27     // as ten length string
    28     memset(hashMap, false, sizeof(hashMap));
    29 
    30     int hashValue = 0;
    31 
    32     for (int pos = 0; pos < 10; ++pos) {
    33         hashValue <<= 2;
    34         hashValue |= convert[s[pos] - 'A'];
    35     }
    36 
    37     hashMap[hashValue] = true;
    38 
    39     std::unordered_set<int> strHashValue;
    40 
    41     // 
    42     for (int pos = 10; pos < s.length(); ++pos) {
    43         hashValue <<= 2;
    44         hashValue |= convert[s[pos] - 'A'];
    45         hashValue &= ~(0x300000);
    46         
    47         if (hashMap[hashValue]) {
    48             if (strHashValue.find(hashValue) == strHashValue.end()) {
    49                 rel.push_back(s.substr(pos - 9, 10));
    50                 strHashValue.insert(hashValue);
    51             }
    52         } else {
    53             hashMap[hashValue] = true;
    54         }
    55     }
    56 
    57     return rel; 
    58 }
  • 相关阅读:
    Ubuntu软件工具推荐
    利用Github Actions自动同步博客园最新内容到GitHub首页
    vscode 使用zsh powerline主题乱码解决方案
    搜索插入位置
    判断二分图
    ~~并发编程(十三):信号量,Event,定时器~~
    ~~并发编程(十二):死锁和递归锁~~
    ~~并发编程(十一):GIL全局解释锁~~
    ~~并发编程(十):线程方法~~
    ~~并发编程(九):多线程与多进程~~
  • 原文地址:https://www.cnblogs.com/hzhesi/p/4285793.html
Copyright © 2011-2022 走看看