zoukankan      html  css  js  c++  java
  • Leetcode: Repeated DNA Sequence

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
    
    Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
    
    For example,
    
    Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
    
    Return:
    ["AAAAACCCCC", "CCCCCAAAAA"].

    方法2:进一步的方法是用HashSet, 每次取长度为10的字符串,O(N)时间遍历数组,重复就加入result,但这样需要O(N)的space, 准确说来O(N*10bytes), java而言一个char是2 bytes,所以O(N*20bytes)。String一大就MLE

    最优解:是在方法2基础上用bit operation,大概思想是把字符串映射为整数,对整数进行移位以及位与操作,以获取相应的子字符串。众所周知,位操作耗时较少,所以这种方法能节省运算时间。

    首先考虑将ACGT进行二进制编码

    A -> 00

    C -> 01

    G -> 10

    T -> 11

    在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

    ACGTACGTAC -> 00011011000110110001

    AAAAAAAAAA -> 00000000000000000000

    每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。

    Cost分析:

    时间复杂度O(N), 而且众所周知,位操作耗时较少,所以这种方法能节省运算时间。

    省空间,原来10个char要10 Byte,现在10个char总共20bit,总共O(N*20bits)

    空间复杂度:20位的二进制数,至多有2^20种组合,因此HashSet的大小为2^20,即1024 * 1024,O(1)

     follow up : 如果是inorder 的话用radix sort 

    follow up 如果是scanner:

     Scanner scanner=new Scanner(System.in);
           char a=scanner.nextCharacter();

    或 String a=scanner.next();//注意不是nextString()

    public static int[] RadixSort(int[] ArrayToSort, int digit)
    {
        //low to high digit
        for (int k = 1; k <= digit; k++)
        {
            //temp array to store the sort result inside digit
            int[] tmpArray = new int[ArrayToSort.Length];
     
            //temp array for countingsort
            int[] tmpCountingSortArray = new int[10]{0,0,0,0,0,0,0,0,0,0};
     
            //CountingSort
            for (int i = 0; i < ArrayToSort.Length; i++)
            {
                //split the specified digit from the element
                int tmpSplitDigit = ArrayToSort[i]/(int)Math.Pow(10,k-1) - (ArrayToSort[i]/(int)Math.Pow(10,k))*10;
                tmpCountingSortArray[tmpSplitDigit] += 1; 
            }
     
            for (int m = 1; m < 10; m++)
            {
                tmpCountingSortArray[m] += tmpCountingSortArray[m - 1];
            }
     
            //output the value to result
            for (int n = ArrayToSort.Length - 1; n >= 0; n--)
            {
                int tmpSplitDigit = ArrayToSort[n] / (int)Math.Pow(10,k - 1) - (ArrayToSort[n]/(int)Math.Pow(10,k)) * 10;
                tmpArray[tmpCountingSortArray[tmpSplitDigit]-1] = ArrayToSort[n];
                tmpCountingSortArray[tmpSplitDigit] -= 1;
            }
     
            //copy the digit-inside sort result to source array
            for (int p = 0; p < ArrayToSort.Length; p++)
            {
                ArrayToSort[p] = tmpArray[p];
            }
        }
     
        return ArrayToSort;
    }
    

     As our alphabet A consists of only 4 letters we can be not afraid of collisions. The hash for a current window slice could be found in a constant time by subtracting the former first character 

    public class Solution {
        public List<String> findRepeatedDnaSequences(String s) {
            ArrayList<String> res = new ArrayList<String>();
            if (s==null || s.length()<=10) return res;
            HashMap<Character, Integer> dict = new HashMap<Character, Integer>();
            dict.put('A', 0);
            dict.put('C', 1);
            dict.put('G', 2);
            dict.put('T', 3);
            HashSet<Integer> set = new HashSet<Integer>();
            HashSet<String> result = new HashSet<String>(); //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect
            int hashcode = 0;
            for (int i=0; i<s.length(); i++) {
                if (i < 9) {
                    hashcode = (hashcode<<2) + dict.get(s.charAt(i));
                }
                else {
                    hashcode = (hashcode<<2) + dict.get(s.charAt(i));
                    hashcode &= (1<<20) - 1;
                    if (!set.contains(hashcode)) {
                        set.add(hashcode);
                    }
                    else {
                        //duplicate hashcode, decode the hashcode, and add the string to result
                        String temp = s.substring(i-9, i+1);
                        result.add(temp);
                    }
                }
            }
            for (String item : result) {
                res.add(item);
            }
            return res;
        }
    }
    

      

  • 相关阅读:
    GIT 基本语句
    SpringBoot查看哪些配置类自动生效
    LeetCode第一题 两数之和
    static{} java中的静态代码块
    mybatis引入mapper映射文件的4种方法(转)
    MySQL Charset/Collation(字符集/校对)(转)
    MySQL数据库的创建(详细)
    Eclipse出现Tomcat无法启动:Server Tomcat v8.5 Server at localhost failed to start问题
    判断一个int类型数字的奇偶性
    linux中安装erlang时使用make命令报错问题
  • 原文地址:https://www.cnblogs.com/apanda009/p/7951171.html
Copyright © 2011-2022 走看看