zoukankan      html  css  js  c++  java
  • Admixture的监督分群(Supervised analysis)

    说明

    Admixture通过EM算法一般用于指定亚群分类;或者在不知材料群体结构背景下,通过迭代交叉验证获得error值,取最小error对应的K值为推荐亚群数目。如果我们预先已知群体的类型(百分百确信),那么可以考虑监督分类方法,设置标签,提高分群的准确性。

    Admixture目前是1.3.0,文档也刚更新不久。

    怕翻译有误,贴上官方文档:

    Estimating P and Q from the SNP matrix G, without any additional information, can be
    viewed as an unsupervised learning problem. However it is not uncommon that some or
    all of the individuals in our data sample will have known ancestries, allowing us to set
    some rows in the matrix Q to known constants. This allows more accurate estimation of
    the ancestries of the remaining individuals, and of the ancestral allele frequencies. Viewing
    these reference individuals as training samples, the problem is transformed into a supervised
    learning problem.

    Supervised learning mode is enabled with the flag --supervised and requires an additional
    file with a .pop suffix, specifying the ancestries of the reference individuals. It is assumed
    that all reference samples have 100% ancestry from some ancestral population. Each line
    of the .pop file corresponds to individual listed on the same line number in the .fam or
    .ped file. If the individual is a population reference, the .pop file line should be a string
    (beginning with an alphanumeric character) designating the population. If the individual
    is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to
    indicate that the ancestry should be estimated.

    文档中说要准备一个.pop为后缀的群体文件,就是对个体进行分类(字符型),类型未知的可用“-”替代。不建议在windows中创建,因为换行符不同的问题。

    如何验证准备的.pop文件?作者建议使用paste .fam .pop查看个体数目是不是相等(用wc -l不是更简单吗?)。

    问题来了,作者根本就没说明到底怎么运行?我尝试了下,简单记录下。

    实战

    下载官网示例数据:
    http://dalexander.github.io/admixture/download.html
    image.png
    解压后,有plink数据格式,配套的bed,bim,fam,但少了个ped,没有和map配套。这个作者有点粗心,不过可以用plink转一下:

    wget http://dalexander.github.io/admixture/hapmap3-files.tar.gz
    tar -xvf hapmap3-files.tar.gz
    plink --bfile hapmap3 --recode --out hapmap3--noweb
    wc -l hapmap3*
    

    准备hapmap3.pop文件(注意前缀和pink数据保持一致,且在同一目录),可用R、awk等工具,随意模拟一个:

    dat = data.frame(V1 = rep(c("A","-","B","-","C","-"),each=54))
    write.table(dat,"hapmap3.pop",row.names=F,col.names=F,quote=F,sep="	")
    

    image.png

    加上supervised,运行admixture即可:

    admixture hapmap3.ped 3 --supervised
    

    可以看看不加supervised和加了的区别,没加的结果:
    image.png

    加了的结果:
    image.png

    还是有很大差异的。具体对后续结果的影响这里就不研究了。

  • 相关阅读:
    POJ 3007 Organize Your Train part II (字典树 静态)
    POJ 3096 Surprising Strings
    移动APP安全测试
    monkey亲测
    PHP笔记
    my sql 只展示 前10条数据的写法
    面试问的东西
    定时构建常用 设置
    jenkins 定时构建 位置
    jenkins 使用的python 不是指定的python 的解决方法
  • 原文地址:https://www.cnblogs.com/jessepeng/p/14148988.html
Copyright © 2011-2022 走看看