zoukankan      html  css  js  c++  java
  • [LeetCode#187]Repeated DNA Sequences

    Problem:

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
    
    Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
    
    For example,
    
    Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
    
    Return:
    ["AAAAACCCCC", "CCCCCAAAAA"].

    Analysis:

    This problem has a genius solution.
    If you have not encounter it before, you may never be able to solve it out.
    
    Idea:
    Since we only have four characters "A", "C", "G", "T", We can map each character with a sole 2 bits. (Note: not the ASCII code)
    And each sub sequence is 10 characters long, after mapping, which would only take up 20 bits. (Since an Integer in Java takes up 32 bits, a subsequence could be represented into an Integer, or we call this as an Integer hash code)
    
    Another benefits of this mapping is that, as long we add new character, we can update on related hash code through bit movement operation.
    
    1. prepare the HashMap for the mapping.
    
    HashMap<Character, Integer> map = new HashMap<Character, Integer> ();
    map.put('A', 0);
    map.put('C', 1);
    map.put('G', 2);
    map.put('T', 3);
    
    
    2. move the subsequence window, and get realted Hashcode.
    int hash = 0;
    for (int i = 0; i < s.length(); i++) {
        if (i < 9) {
            hash = (hash << 2) + map.get(s.charAt(i));
        } else{
            hash = (hash << 2) + map.get(s.charAt(i));
            hash = hash & ((1 << 20) - 1);
            ...
            
        }
    }
    Note: once the slide window's size meet 10 characters, we should get the hash code for the window. The skill here is to use '&' with a 20 bits "1" to get those bits. 
    2.1  get 20 bits '1'.
    ((1 << 20) - 1)
    The idea is not hard: like 4 - 1 = 100 - 1 = 011
    2.2  use '&'' operator to get the bits.
    hash = hash & ((1 << 20) - 1);
    
    
    Errors:
    When you put a <key, value> pair into hashmap, and the value based on the existing in the HashMap, you must test if the pair exist or not.
    if (counted.containsKey(hash))
        counted.put(hash, counted.get(hash)+1);
    else 
        counted.put(hash, 1);

    Solution:

    public class Solution {
        public List<String> findRepeatedDnaSequences(String s) {
            ArrayList<String> ret = new ArrayList<String> ();
            if (s.length() < 10)
                return ret;
            HashMap<Character, Integer> map = new HashMap<Character, Integer> ();
            map.put('A', 0);
            map.put('C', 1);
            map.put('G', 2);
            map.put('T', 3);
            
            HashMap<Integer, Integer> counted = new HashMap<Integer, Integer> ();
            int hash = 0;
            for (int i = 0; i < s.length(); i++) {
                if (i < 9) {
                    hash = (hash << 2) + map.get(s.charAt(i));
                } else{
                    hash = (hash << 2) + map.get(s.charAt(i));
                    hash = hash & ((1 << 20) - 1);
                    if (counted.containsKey(hash) && counted.get(hash) == 1) {
                        ret.add(s.substring(i-9, i+1));
                        counted.put(hash, 2);
                    } else{
                        if (counted.containsKey(hash))
                            counted.put(hash, counted.get(hash)+1);
                        else 
                            counted.put(hash, 1);
                    }
                }
            }
            return ret;
        }
    }
    Actually, since we only care about if a subsequence has appeared twice, we could use two HashSet to avoid the above ugly code.
    public class Solution {
        public List<String> findRepeatedDnaSequences(String s) {
            ArrayList<String> ret = new ArrayList<String> ();
            if (s.length() < 10)
                return ret;
            HashMap<Character, Integer> map = new HashMap<Character, Integer> ();
            map.put('A', 0);
            map.put('C', 1);
            map.put('G', 2);
            map.put('T', 3);
            HashSet<Integer> appeared = new HashSet<Integer> ();
            HashSet<Integer> counted = new HashSet<Integer> ();
            
            int hash = 0;
            for (int i = 0; i < s.length(); i++) {
                if (i < 9) {
                    hash = (hash << 2) + map.get(s.charAt(i));
                } else{
                    hash = (hash << 2) + map.get(s.charAt(i));
                    hash = hash & ((1 << 20) - 1);
                    if (appeared.contains(hash) && !counted.contains(hash)) {
                        ret.add(s.substring(i-9, i+1));
                        counted.add(hash);
                    } else{
                        appeared.add(hash);
                    }
                }
            }
            return ret;
        }
    }
  • 相关阅读:
    一起谈.NET技术,验证.NET强命称的思路和实例 狼人:
    一起谈.NET技术,基于SQL Server 2008 Service Broker构建企业级消息系统 狼人:
    一起谈.NET技术,一句代码实现批量数据绑定[下篇] 狼人:
    一起谈.NET技术,晚绑定场景下对象属性赋值和取值可以不需要PropertyInfo 狼人:
    一起谈.NET技术,Silverlight 2.5D RPG游戏技巧与特效处理:(六)流光追影 狼人:
    一起谈.NET技术,ASP.NET下用URLRewriter重写二级域名 狼人:
    这样覆写Object类的toString方法对吗
    【Cocos2DX 】初窥门径(7)无限地图滚动
    poj2975——Caesar密码
    [置顶] poi最简单易学解析xls代码
  • 原文地址:https://www.cnblogs.com/airwindow/p/4759694.html
Copyright © 2011-2022 走看看