zoukankan      html  css  js  c++  java
  • pysam

    Introduction

    Pysam is a python module that makes it easy to read and manipulate mapped short read sequence data stored in SAM/BAM files. It is a lightweight wrapper of the htslib C-API.

    This page provides a quick introduction in using pysam followed by the API. See Working with BAM/CRAM/SAM-formatted files for more detailed usage instructions.

    To use the module to read a file in BAM format, create a AlignmentFile object:

    import pysam
    samfile = pysam.AlignmentFile("ex1.bam", "rb")
    

    Once a file is opened you can iterate over all of the read mapping to a specified region using fetch(). Each iteration returns a AlignedSegment object which represents a single read along with its fields and optional tags:

    for read in samfile.fetch('chr1', 100, 120):
         print read
    
    samfile.close()
    

    To give:

    EAS56_57:6:190:289:82       0       99      <<<7<<<;<<<<<<<<8;;<7;4<;<;;;;;94<;     69      CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA     0       192     1
    EAS56_57:6:190:289:82       0       99      <<<<<<;<<<<<<<<<<;<<;<<<<;8<6;9;;2;     137     AGGGGTGCAGAGCCGAGTCACGGGGTTGCCAGCAC     73      64      1
    EAS51_64:3:190:727:308      0       102     <<<<<<<<<<<<<<<<<<<<<<<<<<<::<<<844     99      GGTGCAGAGCCGAGTCACGGGGTTGCCAGCACAGG     99      18      1
    ...
    

    You can also write to a AlignmentFile:

    import pysam
    samfile = pysam.AlignmentFile("ex1.bam", "rb")
    pairedreads = pysam.AlignmentFile("allpaired.bam", "wb", template=samfile)
    for read in samfile.fetch():
         if read.is_paired:
                 pairedreads.write(read)
    
    pairedreads.close()
    samfile.close()
    

    An alternative way of accessing the data in a SAM file is by iterating over each base of a specified region using the pileup() method. Each iteration returns a PileupColumn which represents all the reads in the SAM file that map to a single base in the reference sequence. The list of reads are represented as PileupRead objects in the PileupColumn.pileups property:

    import pysam
    samfile = pysam.AlignmentFile("ex1.bam", "rb" )
    for pileupcolumn in samfile.pileup("chr1", 100, 120):
        print ("
    coverage at base %s = %s" %
               (pileupcolumn.pos, pileupcolumn.n))
        for pileupread in pileupcolumn.pileups:
            if not pileupread.is_del and not pileupread.is_refskip:
                # query position is None if is_del or is_refskip is set.
                print ('	base in read %s = %s' %
                      (pileupread.alignment.query_name,
                       pileupread.alignment.query_sequence[pileupread.query_position]))
    
    samfile.close()
    

    The above code outputs:

    coverage at base 99 = 1
        base in read EAS56_57:6:190:289:82 = A
    
    coverage at base 100 = 1
        base in read EAS56_57:6:190:289:82 = G
    
    coverage at base 101 = 1
        base in read EAS56_57:6:190:289:82 = G
    
    coverage at base 102 = 2
        base in read EAS56_57:6:190:289:82 = G
        base in read EAS51_64:3:190:727:308 = G
    ...
    

    Commands available in csamtools are available as simple function calls. For example:

    pysam.sort("-o", "output.bam", "ex1.bam")
    

    corresponds to the command line:

    samtools sort -o output.bam ex1.bam
    

    Analogous to AlignmentFile, a TabixFile allows fast random access to compressed and tabix indexed tab-separated file formats with genomic data:

    import pysam
    tabixfile = pysam.TabixFile("example.gtf.gz")
    
    for gtf in tabixfile.fetch("chr1", 1000, 2000):
        print (gtf.contig, gtf.start, gtf.end, gtf.gene_id)
    

    TabixFile implements lazy parsing in order to iterate over large tables efficiently.

    More detailed usage instructions is at Working with BAM/CRAM/SAM-formatted files.

    Note

    Coordinates in pysam are always 0-based (following the python convention). SAM text files use 1-based coordinates.

    Note

    The above examples can be run in the tests directory of the
    installation directory. Type ‘make’ before running them.

    See also

    https://github.com/pysam-developers/pysam

    The pysam code repository, containing source code and download instructions

    http://pysam.readthedocs.org/en/latest/

    The pysam website containing documentation

    API

    SAM/BAM files

    Objects of type AlignmentFile allow working with BAM/SAM formatted files.

    class pysam.AlignmentFile

    AlignmentFile(filepath_or_object, mode=None, template=None, reference_names=None, reference_lengths=None, text=NULL, header=None, add_sq_text=False, check_header=True, check_sq=True, reference_filename=None, filename=None, index_filename=None, filepath_index=None, require_index=False, duplicate_filehandle=True, ignore_truncation=False)

    A SAM/BAM/CRAM formatted file.

    If filepath_or_object is a string, the file is automatically opened. If filepath_or_object is a python File object, the already opened file will be used.

    If the file is opened for reading and an index exists (if file is BAM, a .bai file or if CRAM a .crai file), it will be opened automatically. index_filename may be specified explicitly. If the index is not named in the standard manner, not located in the same directory as the BAM/CRAM file, or is remote. Without an index, random access via fetch() and pileup() is disabled.

    For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):

    1. If template is given, the header is copied from a another AlignmentFile (template must be a AlignmentFile).
    2. If header is given, the header is built from a multi-level dictionary.
    3. If text is given, new header text is copied from raw text.
    4. The names (reference_names) and lengths (reference_lengths) are supplied directly as lists.

    When reading or writing a CRAM file, the filename of a FASTA-formatted reference can be specified with reference_filename.

    By default, if a file is opened in mode ‘r’, it is checked for a valid header (check_header = True) and a definition of chromosome names (check_sq = True).

    Parameters:
    • mode (string) –

      mode should be r for reading or w for writing. The default is text mode (SAM). For binary (BAM) I/O you should append b for compressed or u for uncompressed BAM output. Use h to output header information in text (TAM) mode. Use c for CRAM formatted files.

      If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb, wbu, wb0, rc and wc. For instance, to open a BAM formatted file for reading, type:

      f = pysam.AlignmentFile('ex1.bam','rb')
      

      If mode is not specified, the method will try to auto-detect in the order ‘rb’, ‘r’, thus both the following should work:

      f1 = pysam.AlignmentFile('ex1.bam')
      f2 = pysam.AlignmentFile('ex1.sam')
      
    • template (AlignmentFile) – when writing, copy header frem template.
    • header (dict) – when writing, build header from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
    • text (string) – when writing, use the string provided as the header
    • reference_names (list) – see reference_lengths
    • reference_lengths (list) – when writing or opening a SAM file without header build header from list of chromosome names and lengths. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option can be changed by unsetting the flag add_sq_text.
    • add_sq_text (bool) – do not add ‘SQ’ and ‘LN’ tags to header. This option permits construction SAM formatted files without a header.
    • add_sam_header (bool) – when outputting SAM the default is to output a header. This is equivalent to opening the file in ‘wh’ mode. If this option is set to False, no header will be output. To read such a file, set check_header=False.
    • check_header (bool) – obsolete: when reading a SAM file, check if header is present (default=True)
    • check_sq (bool) – when reading, check if SQ entries are present in header (default=True)
    • reference_filename (string) – Path to a FASTA-formatted reference file. Valid only for CRAM files. When reading a CRAM file, this overrides both $REF_PATH and the URL specified in the header (UR tag), which are normally used to find the reference.
    • index_filename (string) – Explicit path to the index file. Only needed if the index is not named in the standard manner, not located in the same directory as the BAM/CRAM file, or is remote. An IOError is raised if the index cannot be found or is invalid.
    • filepath_index (string) – Alias for index_filename.
    • require_index (bool) – When reading, require that an index file is present and is valid or raise an IOError. (default=False)
    • filename (string) – Alternative to filepath_or_object. Filename of the file to be opened.
    • duplicate_filehandle (bool) – By default, file handles passed either directly or through File-like objects will be duplicated before passing them to htslib. The duplication prevents issues where the same stream will be closed by htslib and through destruction of the high-level python object. Set to False to turn off duplication.
    • ignore_truncation (bool) – Issue a warning, instead of raising an error if the current file appears to be truncated due to a missing EOF marker. Only applies to bgzipped formats. (Default=False)
    check_index(self)

    return True if index is present.

    Raises:
    • AttributeError – if htsfile is SAM formatted and thus has no index.
    • ValueError – if htsfile is closed or index could not be opened.
    close(self)

    closes the pysam.AlignmentFile.

    count(self, contig=None, start=None, stop=None, region=None, until_eof=False, read_callback='nofilter', reference=None, end=None)

    count the number of reads in region

    The region is specified by contig, start and stop. reference and end are also accepted for backward compatiblity as synonyms for contig and stop, respectively. Alternatively, a samtools region string can be supplied.

    A SAM file does not allow random access and if region or contig are given, an exception is raised.

    Parameters:
    • contig (string) – reference_name of the genomic region (chromosome)
    • start (int) – start of the genomic region (0-based inclusive)
    • stop (int) – end of the genomic region (0-based exclusive)
    • region (string) – a region string in samtools format.
    • until_eof (bool) – count until the end of the file, possibly including unmapped reads as well.
    • read_callback (string or function) –

      select a call-back to ignore reads when counting. It can be either a string with the following values:

      all
      skip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
      nofilter
      uses every single read

      Alternatively, read_callback can be a function check_read(read) that should return True only for those reads that shall be included in the counting.

    • reference (string) – backward compatible synonym for contig
    • end (int) – backward compatible synonym for stop
    Raises:

    ValueError – if the genomic coordinates are out of range or invalid.

    count_coverage(self, contig=None, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None)

    count the coverage of genomic positions by reads in region.

    The region is specified by contig, start and stop. reference and end are also accepted for backward compatiblity as synonyms for contig and stop, respectively. Alternatively, a samtools region string can be supplied. The coverage is computed per-base [ACGT].

    Parameters:
    • contig (string) – reference_name of the genomic region (chromosome)
    • start (int) – start of the genomic region (0-based inclusive)
    • stop (int) – end of the genomic region (0-based exclusive)
    • region (int) – a region string.
    • quality_threshold (int) – quality_threshold is the minimum quality score (in phred) a base has to reach to be counted.
    • read_callback (string or function) –

      select a call-back to ignore reads when counting. It can be either a string with the following values:

      all
      skip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
      nofilter
      uses every single read

      Alternatively, read_callback can be a function check_read(read) that should return True only for those reads that shall be included in the counting.

    • reference (string) – backward compatible synonym for contig
    • end (int) – backward compatible synonym for stop
    Raises:

    ValueError – if the genomic coordinates are out of range or invalid.

    Returns:

    four array.arrays of the same length in order A C G T

    Return type:

    tuple

    fetch(self, contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)

    fetch reads aligned in a region.

    See AlignmentFile.parse_region() for more information on genomic regions. reference and end are also accepted for backward compatiblity as synonyms for contig and stop, respectively.

    Without a contig or region all mapped reads in the file will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file. This mode of iteration still requires an index. If there is no index, use until_eof=True.

    If only reference is set, all reads aligned to reference will be fetched.

    A SAM file does not allow random access. If region or contig are given, an exception is raised.

    FastaFile IteratorRow IteratorRow IteratorRow IteratorRow

    Parameters:
    • until_eof (bool) – If until_eof is True, all reads from the current file position will be returned in order as they are within the file. Using this option will also fetch unmapped reads.
    • multiple_iterators (bool) – If multiple_iterators is True, multiple iterators on the same file can be used at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.
    Returns:  
    Return type:

    An iterator over a collection of reads.

    Raises:

    ValueError – if the genomic coordinates are out of range or invalid or the file does not permit random access to genomic coordinates.

    find_introns(self, read_iterator)

    Return a dictionary {(start, stop): count} Listing the intronic sites in the reads (identified by ‘N’ in the cigar strings), and their support ( = number of reads ).

    read_iterator can be the result of a .fetch(...) call. Or it can be a generator filtering such reads. Example samfile.find_introns((read for read in samfile.fetch(...) if read.is_reverse)

    get_reference_name(self, tid)

    return reference name corresponding to numerical tid

    get_tid(self, reference)

    return the numerical tid corresponding to reference

    returns -1 if reference is not known.

    getrname(self, tid)

    deprecated, use get_reference_name() instead

    gettid(self, reference)

    deprecated, use get_tid() instead

    has_index(self)

    return true if htsfile has an existing (and opened) index.

    head(self, n, multiple_iterators=True)

    return an iterator over the first n alignments.

    This iterator is is useful for inspecting the bam-file.

    Parameters: multiple_iterators (bool) – is set to True by default in order to avoid changing the current file position.
    Returns:  
    Return type: an iterator over a collection of reads
    header

    two-level dictionay with header information from the file.

    This is a read-only attribute.

    The first level contains the record (HD, SQ, etc) and the second level contains the fields (VN, LN, etc).

    The parser is validating and will raise an AssertionError if if encounters any record or field tags that are not part of the SAM specification. Use the pysam.AlignmentFile.text attribute to get the unparsed header.

    The parsing follows the SAM format specification with the exception of the CL field. This option will consume the rest of a header line irrespective of any additional fields. This behaviour has been added to accommodate command line options that contain characters that are not valid field separators.

    is_valid_tid(self, tid)

    return True if the numerical tid is valid; False otherwise.

    lengths

    tuple of the lengths of the reference sequences. This is a read-only attribute. The lengths are in the same order as pysam.AlignmentFile.references

    mapped

    int with total number of mapped alignments according to the statistics recorded in the index. This is a read-only attribute.

    mate(self, AlignedSegment read)

    return the mate of AlignedSegment read.

    Note

    Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.

    Note

    This method is too slow for high-throughput processing. If a read needs to be processed with its mate, work from a read name sorted file or, better, cache reads.

    Returns: :class:`~pysam.AlignedSegment`
    Return type: the mate
    Raises: ValueError – if the read is unpaired or the mate is unmapped
    next
    nocoordinate

    int with total number of reads without coordinates according to the statistics recorded in the index. This is a read-only attribute.

    nreferences

    “int with the number of reference sequences in the file. This is a read-only attribute.

    pileup(self, contig=None, start=None, stop=None, region=None, reference=None, end=None, **kwargs)

    perform a pileup within a region. The region is specified by contig, start and stop (using 0-based indexing). reference and end are also accepted for backward compatiblity as synonyms for contig and stop, respectively. Alternatively, a samtools ‘region’ string can be supplied.

    Without ‘contig’ or ‘region’ all reads will be used for the pileup. The reads will be returned ordered by contig sequence, which will not necessarily be the order within the file.

    Note that SAM formatted files do not allow random access. In these files, if a ‘region’ or ‘contig’ are given an exception is raised.

    Note

    ‘all’ reads which overlap the region are returned. The first base returned will be the first base of the first read ‘not’ necessarily the first base of the region used in the query.

    Parameters: stepper (string) –

    The stepper controls how the iterator advances. Possible options for the stepper are

    all
    skip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
    nofilter
    uses every single read
    samtools
    same filter and read processing as in csamtools pileup. This requires a ‘fastafile’ to be given.

    fastafile : FastaFile object.

    This is required for some of the steppers.
    max_depth : int
    Maximum read depth permitted. The default limit is ‘8000’.

    truncate : bool

    By default, the samtools pileup engine outputs all reads overlapping a region. If truncate is True and a region is given, only columns in the exact region specificied are returned.
    Returns:  
    Return type: an iterator over genomic positions.
    references

    tuple with the names of reference sequences. This is a read-only attribute

    text

    string with the full contents of the sam file header as a string.

    This is a read-only attribute.

    See pysam.AlignmentFile.header to get a parsed representation of the header.

    unmapped

    int with total number of unmapped reads according to the statistics recorded in the index. This number of reads includes the number of reads without coordinates. This is a read-only attribute.

    write(self, AlignedSegment read) → int

    write a single pysam.AlignedSegment to disk.

    Raises: ValueError – if the writing failed
    Returns: int – this will be 0.
    Return type: the number of bytes written. If the file is closed,

    An AlignedSegment represents an aligned segment within a SAM/BAM file.

    class pysam.AlignedSegment

    Class representing an aligned segment.

    This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.

    For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to be resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state.

    One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.

    aend

    deprecated, reference_end instead

    alen

    deprecated, reference_length instead

    aligned_pairs

    deprecated, use get_aligned_pairs() instead

    bin

    properties bin

    blocks

    deprecated, use get_blocks() instead

    cigar

    deprecated, use cigartuples instead

    cigarstring

    the cigar alignment as a string.

    The cigar string is a string of alternating integers and characters denoting the length and the type of an operation.

    Note

    The order length,operation is specified in the SAM format. It is different from the order of the cigar property.

    Returns None if not present.

    To unset the cigarstring, assign None or the empty string.

    cigartuples

    the cigar alignment. The alignment is returned as a list of tuples of (operation, length).

    If the alignment is not present, None is returned.

    The operations are:

    M BAM_CMATCH 0
    I BAM_CINS 1
    D BAM_CDEL 2
    N BAM_CREF_SKIP 3
    S BAM_CSOFT_CLIP 4
    H BAM_CHARD_CLIP 5
    P BAM_CPAD 6
    = BAM_CEQUAL 7
    X BAM_CDIFF 8
    B BAM_CBACK 9

    Note

    The output is a list of (operation, length) tuples, such as [(0, 30)]. This is different from the SAM specification and the cigarstring property, which uses a (length, operation) order, for example: 30M.

    To unset the cigar property, assign an empty list or None.

    compare(self, AlignedSegment other)

    return -1,0,1, if contents in this are binary <,=,> to other

    flag

    properties flag

    get_aligned_pairs(self, matches_only=False, with_seq=False)

    a list of aligned read (query) and reference positions.

    For inserts, deletions, skipping either query or reference position may be None.

    Padding is currently not supported and leads to an exception.

    Parameters:
    • matches_only (bool) – If True, only matched bases are returned - no None on either side.
    • with_seq (bool) – If True, return a third element in the tuple containing the reference sequence. Substitutions are lower-case. This option requires an MD tag to be present.
    Returns:

    aligned_pairs

    Return type:

    list of tuples

    get_blocks(self)

    a list of start and end positions of aligned gapless blocks.

    The start and end positions are in genomic coordinates.

    Blocks are not normalized, i.e. two blocks might be directly adjacent. This happens if the two blocks are separated by an insertion in the read.

    get_cigar_stats(self)

    summary of operations in cigar string.

    The output order in the array is “MIDNSHP=X” followed by a field for the NM tag. If the NM tag is not present, this field will always be 0.

    M BAM_CMATCH 0
    I BAM_CINS 1
    D BAM_CDEL 2
    N BAM_CREF_SKIP 3
    S BAM_CSOFT_CLIP 4
    H BAM_CHARD_CLIP 5
    P BAM_CPAD 6
    = BAM_CEQUAL 7
    X BAM_CDIFF 8
    B BAM_CBACK 9
    NM NM tag 10

    If no cigar string is present, empty arrays will be returned.

    Returns: arrays – each cigar operation, the second contains the number of blocks for each cigar operation.
    Return type: two arrays. The first contains the nucleotide counts within
    get_overlap(self, uint32_t start, uint32_t end)

    return number of aligned bases of read overlapping the interval start and end on the reference sequence.

    Return None if cigar alignment is not available.

    get_reference_positions(self, full_length=False)

    a list of reference positions that this read aligns to.

    By default, this method only returns positions in the reference that are within the alignment. If full_length is set, None values will be included for any soft-clipped or unaligned positions within the read. The returned list will thus be of the same length as the read.

    get_reference_sequence(self)

    return the reference sequence.

    This method requires the MD tag to be set.

    get_tag(self, tag, with_value_type=False)

    retrieves data from the optional alignment section given a two-letter tag denoting the field.

    The returned value is cast into an appropriate python type.

    This method is the fastest way to access the optional alignment section if only few tags need to be retrieved.

    Parameters:
    • tag – data tag.
    • with_value_type (Optional[bool]) – if set to True, the return value is a tuple of (tag value, type code). (default False)
    Returns:
    • A python object with the value of the tag. The type of the
    • object depends on the data type in the data record.
    Raises:

    KeyError – If tag is not present, a KeyError is raised.

    get_tags(self, with_value_type=False)

    the fields in the optional aligment section.

    Returns a list of all fields in the optional alignment section. Values are converted to appropriate python values. For example:

    [(NM, 2), (RG, “GJP00TM04”)]

    If with_value_type is set, the value type as encode in the AlignedSegment record will be returned as well:

    [(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]

    This method will convert all values in the optional alignment section. When getting only one or few tags, please see get_tag() for a quicker way to achieve this.

    has_tag(self, tag)

    returns true if the optional alignment section contains a given tag.

    infer_query_length(self, always=False)

    infer query length from CIGAR alignment.

    This method deduces the query length from the CIGAR alignment but does not include hard-clipped bases.

    Returns None if CIGAR alignment is not present.

    If always is set to True, infer_read_length is used instead. This is deprecated and only present for backward compatibility.

    infer_read_length(self)

    infer read length from CIGAR alignment.

    This method deduces the read length from the CIGAR alignment including hard-clipped bases.

    Returns None if CIGAR alignment is not present.

    inferred_length

    deprecated, use infer_query_length() instead

    is_duplicate

    true if optical or PCR duplicate

    is_paired

    true if read is paired in sequencing

    is_proper_pair

    true if read is mapped in a proper pair

    is_qcfail

    true if QC failure

    is_read1

    true if this is read1

    is_read2

    true if this is read2

    is_reverse

    true if read is mapped to reverse strand

    is_secondary

    true if not primary alignment

    is_supplementary

    true if this is a supplementary alignment

    is_unmapped

    true if read itself is unmapped

    isize

    deprecated, use template_length instead

    mapping_quality

    mapping quality

    mapq

    deprecated, use mapping_quality instead

    mate_is_reverse

    true is read is mapped to reverse strand

    mate_is_unmapped

    true if the mate is unmapped

    mpos

    deprecated, use next_reference_start instead

    mrnm

    deprecated, use next_reference_id instead

    next_reference_id

    the reference id of the mate/next read.

    next_reference_name

    reference name of the mate/next read (None if no AlignmentFile is associated)

    next_reference_start

    the position of the mate/next read.

    opt(self, tag)

    deprecated, use get_tag() instead

    overlap(self)

    deprecated, use get_overlap() instead

    pnext

    deprecated, use next_reference_start instead

    pos

    deprecated, use reference_start instead

    positions

    deprecated, use get_reference_positions() instead

    qend

    deprecated, use query_alignment_end instead

    qlen

    deprecated, use query_alignment_length instead

    qname

    deprecated, use query_name instead

    qqual

    deprecated, query_alignment_qualities instead

    qstart

    deprecated, use query_alignment_start instead

    qual

    deprecated, query_qualities instead

    query

    deprecated, query_alignment_sequence instead

    query_alignment_end

    end index of the aligned query portion of the sequence (0-based, exclusive)

    This the index just past the last base in seq that is not soft-clipped.

    query_alignment_length

    length of the aligned query sequence.

    This is equal to qend - qstart

    query_alignment_qualities

    aligned query sequence quality values (None if not present). These are the quality values that correspond to query, that is, they exclude qualities of soft clipped bases. This is equal to qual[qstart:qend].

    Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.

    This property is read-only.

    query_alignment_sequence

    aligned portion of the read.

    This is a substring of seq that excludes flanking bases that were soft clipped (None if not present). It is equal to seq[qstart:qend].

    SAM/BAM files may include extra flanking bases that are not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.

    query_alignment_start

    start index of the aligned query portion of the sequence (0-based, inclusive).

    This the index of the first base in seq that is not soft-clipped.

    query_length

    the length of the query/read.

    This value corresponds to the length of the sequence supplied in the BAM/SAM file. The length of a query is 0 if there is no sequence in the BAM/SAM file. In those cases, the read length can be inferred from the CIGAR alignment, see pysam.AlignedSegment.infer_query_length().

    The length includes soft-clipped bases and is equal to len(query_sequence).

    This property is read-only but can be set by providing a sequence.

    Returns 0 if not available.

    query_name

    the query template name (None if not present)

    query_qualities

    **read sequence base qualities, including* – term* – soft clipped bases (None if not present).

    Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.

    Note that to set quality scores the sequence has to be set beforehand as this will determine the expected length of the quality score array.

    This method raises a ValueError if the length of the quality scores and the sequence are not the same.

    query_sequence

    read sequence bases, including soft clipped bases (None if not present).

    Note that assigning to seq will invalidate any quality scores. Thus, to in-place edit the sequence and quality scores, copies of the quality scores need to be taken. Consider trimming for example:

    q = read.query_qualities
    read.query_squence = read.query_sequence[5:10]
    read.query_qualities = q[5:10]
    

    The sequence is returned as it is stored in the BAM file. Some mappers might have stored a reverse complement of the original read sequence.

    reference_end

    aligned reference position of the read on the reference genome.

    reference_end points to one past the last aligned residue. Returns None if not available (read is unmapped or no cigar alignment present).

    reference_id

    reference ID

    Note

    This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.AlignmentFile.getrname()

    reference_length

    aligned length of the read on the reference genome.

    This is equal to aend - pos. Returns None if not available.

    reference_name

    reference name (None if no AlignmentFile is associated)

    reference_start

    0-based leftmost coordinate

    rlen

    deprecated, query_length instead

    rname

    deprecated, use reference_id instead

    rnext

    deprecated, use next_reference_id instead

    seq

    deprecated, use query_sequence instead

    setTag(self, tag, value, value_type=None, replace=True)

    deprecated, use set_tag() instead

    set_tag(self, tag, value, value_type=None, replace=True)

    sets a particular field tag to value in the optional alignment section.

    value_type describes the type of value that is to entered into the alignment record.. It can be set explicitly to one of the valid one-letter type codes. If unset, an appropriate type will be chosen automatically.

    An existing value of the same tag will be overwritten unless replace is set to False. This is usually not recommened as a tag may only appear once in the optional alignment section.

    If value is None, the tag will be deleted.

    set_tags(self, tags)

    sets the fields in the optional alignmest section with a list of (tag, value) tuples.

    The value type of the values is determined from the python type. Optionally, a type may be given explicitly as a third value in the tuple, For example:

    x.set_tags([(NM, 2, “i”), (RG, “GJP00TM04”, “Z”)]

    This method will not enforce the rule that the same tag may appear only once in the optional alignment section.

    tags

    deprecated, use get_tags() instead

    template_length

    the observed query template length

    tid

    deprecated, use reference_id instead

    tlen

    deprecated, use template_length instead

    tostring(self, AlignmentFile_t htsfile)

    returns a string representation of the aligned segment.

    The output format is valid SAM format.

    Parameters: -- AlignmentFile object to map numerical (htsfile) – identifiers to chromosome names.
    class pysam.PileupColumn

    A pileup of reads at a particular reference sequence position (column). A pileup column contains all the reads that map to a certain target base.

    This class is a proxy for results returned by the samtools pileup engine. If the underlying engine iterator advances, the results of this column will change.

    nsegments

    number of reads mapping to this column.

    pileups

    list of reads (pysam.PileupRead) aligned to this column

    reference_id

    the reference sequence number as defined in the header

    reference_name

    reference name (None if no AlignmentFile is associated)

    reference_pos

    the position in the reference sequence (0-based).

    class pysam.PileupRead

    Representation of a read aligned to a particular position in the reference sequence.

    alignment

    a pysam.AlignedSegment object of the aligned read

    indel

    indel length for the position following the current pileup site.

    This quantity peeks ahead to the next cigar operation in this alignment. If the next operation is an insertion, indel will be positive. If the next operation is a deletion, it will be negation. 0 if the next operation is not an indel.

    is_del

    1 iff the base on the padded read is a deletion

    is_head

    1 iff the base on the padded read is the left-most base.

    is_refskip

    1 iff the base on the padded read is part of CIGAR N op.

    is_tail

    1 iff the base on the padded read is the right-most base.

    level

    the level of the read in the “viewer” mode. Note that this value is currently not computed.

    query_position

    position of the read base at the pileup site, 0-based. None if is_del or is_refskip is set.

    query_position_or_next

    position of the read base at the pileup site, 0-based.

    If the current position is a deletion, returns the next aligned base.

    class pysam.IndexedReads(AlignmentFile samfile, int multiple_iterators=True)

    *(AlignmentFile samfile, multiple_iterators=True)

    Index a Sam/BAM-file by query name while keeping the original sort order intact.

    The index is kept in memory and can be substantial.

    By default, the file is re-openend to avoid conflicts if multiple operators work on the same file. Set multiple_iterators = False to not re-open samfile.

    Parameters:
    • samfile (AlignmentFile) – File to be indexed.
    • multiple_iterators (bool) – Flag indicating whether the file should be reopened. Reopening prevents existing iterators being affected by the indexing.
    build(self)

    build the index.

    find(self, query_name)

    find query_name in index.

    Returns: Returns an iterator over all reads with query_name.
    Return type: IteratorRowSelection
    Raises: KeyError – if the query_name is not in the index.

    Tabix files

    TabixFile opens tabular files that have been indexed with tabix.

    class pysam.TabixFile

    Random access to bgzf formatted files that have been indexed by tabix.

    The file is automatically opened. The index file of file <filename> is expected to be called <filename>.tbi by default (see parameter index).

    Parameters:
    • filename (string) – Filename of bgzf file to be opened.
    • index (string) – The filename of the index. If not set, the default is to assume that the index is called ``filename.tbi`
    • mode (char) – The file opening mode. Currently, only r is permitted.
    • parser (pysam.Parser) – sets the default parser for this tabix file. If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for example asTuple and asGTF).
    • encoding (string) – The encoding passed to the parser
    Raises:
    close(self)

    closes the pysam.TabixFile.

    contigs

    list of chromosome names

    fetch(self, reference=None, start=None, end=None, region=None, parser=None, multiple_iterators=False)

    fetch one or more rows in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

    Without reference or region all entries will be fetched.

    If only reference is set, all reads matching on reference will be fetched.

    If parser is None, the default parser will be used for parsing.

    Set multiple_iterators to true if you will be using multiple iterators on the same file at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.

    header

    the file header.

    The file header consists of the lines at the beginning of a file that are prefixed by the comment character #.

    Note

    The header is returned as an iterator presenting lines without the newline character.

    Note

    The header is only available for local files. For remote files an Attribute Error is raised.

    To iterate over tabix files, use tabix_iterator():

    pysam.tabix_iterator(infile, parser)

    return an iterator over all entries in a file.

    Results are returned parsed as specified by the parser. If parser is None, the results are returned as an unparsed string. Otherwise, parser is assumed to be a functor that will return parsed data (see for example asTuple and asGTF).

    pysam.tabix_compress(filename_in, filename_out, force=False)

    compress filename_in writing the output to filename_out.

    Raise an IOError if filename_out already exists, unless force is set.

    pysam.tabix_index(filename, force=False, seq_col=None, start_col=None, end_col=None, preset=None, meta_char='#', int line_skip=0, zerobased=False, int min_shift=-1)

    index tab-separated filename using tabix.

    An existing index will not be overwritten unless force is set.

    The index will be built from coordinates in columns seq_col, start_col and end_col.

    The contents of filename have to be sorted by contig and position - the method does not check if the file is sorted.

    Column indices are 0-based. Note that this is different from the tabix command line utility where column indices start at 1.

    Coordinates in the file are assumed to be 1-based unless zerobased is set.

    If preset is provided, the column coordinates are taken from a preset. Valid values for preset are “gff”, “bed”, “sam”, “vcf”, psltbl”, “pileup”.

    Lines beginning with meta_char and the first line_skip lines will be skipped.

    If filename does not end in ”.gz”, it will be automatically compressed. The original file will be removed and only the compressed file will be retained.

    If filename ends in gz, the file is assumed to be already compressed with bgzf.

    min-shift sets the minimal interval size to 1<<INT; 0 for the old tabix index. The default of -1 is changed inside htslib to the old tabix default of 0.

    returns the filename of the compressed data

    class pysam.asTuple

    converts a tabix row into a python tuple.

    A field in a row is accessed by numeric index.

    class pysam.asVCF

    converts a tabix row into a VCF record with the following fields:

    Column Field Contents
    1 contig chromosome
    2 pos chromosomal position, zero-based
    3 id id
    4 ref reference allele
    5 alt alternate alleles
    6 qual quality
    7 filter filter
    8 info info
    9 format format specifier.

    Access to genotypes is via index:

    contig = vcf.contig
    first_sample_genotype = vcf[0]
    second_sample_genotype = vcf[1]
    
    class pysam.asBed

    converts a tabix row into a bed record with the following fields:

    Column Field Contents
    1 contig contig
    2 start genomic start coordinate (zero-based)
    3 end genomic end coordinate plus one (zero-based)
    4 name name of feature.
    5 score score of feature
    6 strand strand of feature
    7 thickStart thickStart
    8 thickEnd thickEnd
    9 itemRGB itemRGB
    10 blockCount number of bocks
    11 blockSizes ‘,’ separated string of block sizes
    12 blockStarts ‘,’ separated string of block genomic start positions

    Only the first three fields are required. Additional fields are optional, but if one is defined, all the preceding need to be defined as well.

    class pysam.asGTF

    converts a tabix row into a GTF record with the following fields:

    Column Name Content
    1 contig the chromosome name
    2 feature The feature type
    3 source The feature source
    4 start genomic start coordinate (0-based)
    5 end genomic end coordinate (0-based)
    6 score feature score
    7 strand strand
    8 frame frame
    9 attributes the attribute field

    GTF formatted entries also define the following fields that are derived from the attributes field:

    Name Content
    gene_id the gene identifier
    transcript_id the transcript identifier

    Fasta files

    class pysam.FastaFile

    Random access to fasta formatted files that have been indexed by faidx.

    The file is automatically opened. The index file of file <filename> is expected to be called <filename>.fai.

    Parameters:
    • filename (string) – Filename of fasta file to be opened.
    • filepath_index (string) – Optional, filename of the index. By default this is the filename + ”.fai”.
    Raises:
    close(self)

    close the file.

    closed

    “bool indicating the current state of the file object. This is a read-only attribute; the close() method changes the value.

    fetch(self, reference=None, start=None, end=None, region=None)

    fetch sequences in a region.

    A region can either be specified by reference, start and end. start and end denote 0-based, half-open intervals.

    Alternatively, a samtools region string can be supplied.

    If any of the coordinates are missing they will be replaced by the minimum (start) or maximum (end) coordinate.

    Note that region strings are 1-based, while start and end denote an interval in python coordinates. The region is specified by reference, start and end.

    Returns:

    string

    Return type:

    a string with the sequence specified by the region.

    Raises:
    filename

    filename associated with this object. This is a read-only attribute.

    get_reference_length(self, reference)

    return the length of reference.

    is_open(self)

    return true if samfile has been opened.

    lengths

    tuple with the lengths of reference sequences.

    nreferences

    “int with the number of reference sequences in the file. This is a read-only attribute.

    references

    tuple with the names of reference sequences.

    Fastq files

    class pysam.FastxFile

    Stream access to fasta or fastq formatted files.

    The file is automatically opened.

    Entries in the file can be both fastq or fasta formatted or even a mixture of the two.

    This file object permits iterating over all entries in the file. Random access is not implemented. The iteration returns objects of type FastqProxy

    Parameters:
    • filename (string) – Filename of fasta/fastq file to be opened.
    • persist (bool) – If True (default) make a copy of the entry in the file during iteration. If set to False, no copy will be made. This will permit faster iteration, but an entry will not persist when the iteration continues or is not in-place modifyable.

    Notes

    Prior to version 0.8.2, this class was called FastqFile.

    Raises: IOError – if file could not be opened

    Examples

    >>> with pysam.FastxFile(filename) as fh:
    ...    for entry in fh:
    ...        print(entry.name)
    ...        print(entry.sequence)
    ...        print(entry.comment)
    ...        print(entry.quality)
    >>> with pysam.FastxFile(filename) as fin, open(out_filename, mode='w') as fout:
    ...    for entry in fin:
    ...        fout.write(str(entry))
    
    close(self)

    close the file.

    closed

    “bool indicating the current state of the file object. This is a read-only attribute; the close() method changes the value.

    filename

    string with the filename associated with this object.

    is_open(self)

    return true if samfile has been opened.

    next
    class pysam.FastqProxy

    A single entry in a fastq file.

    get_quality_array(self, int offset=33) → array

    return quality values as integer array after subtracting offset.

    name

    The name of each entry in the fastq file.

    quality

    The quality score of each entry in the fastq file, represented as a string.

    sequence

    The sequence of each entry in the fastq file.

    VCF files

    class pysam.VariantFile(*args, **kwargs)

    (filename, mode=None, index_filename=None, header=None, drop_samples=False, duplicate_filehandle=True, ignore_truncation=False)

    A VCF/BCF formatted file. The file is automatically opened.

    If an index for a variant file exists (.csi or .tbi), it will be opened automatically. Without an index random access to records via fetch() is disabled.

    For writing, a VariantHeader object must be provided, typically obtained from another VCF file/BCF file.

    Parameters:
    • mode (string) –

      mode should be r for reading or w for writing. The default is text mode (VCF). For binary (BCF) I/O you should append b for compressed or u for uncompressed BCF output.

      If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb, wbu and wb0. For instance, to open a BCF formatted file for reading, type:

      f = pysam.VariantFile('ex1.bcf','r')
      

      If mode is not specified, we will try to auto-detect the file type. All of the following should work:

      f1 = pysam.VariantFile('ex1.bcf')
      f2 = pysam.VariantFile('ex1.vcf')
      f3 = pysam.VariantFile('ex1.vcf.gz')
      
    • index_filename (string) – Explicit path to an index file.
    • header (VariantHeader) – VariantHeader object required for writing.
    • drop_samples (bool) – Ignore sample information when reading.
    • duplicate_filehandle (bool) – By default, file handles passed either directly or through File-like objects will be duplicated before passing them to htslib. The duplication prevents issues where the same stream will be closed by htslib and through destruction of the high-level python object. Set to False to turn off duplication.
    • ignore_truncation (bool) – Issue a warning, instead of raising an error if the current file appears to be truncated due to a missing EOF marker. Only applies to bgzipped formats. (Default=False)
    close(self)

    closes the pysam.VariantFile.

    copy(self)
    fetch(self, contig=None, start=None, stop=None, region=None, reopen=False, end=None, reference=None)

    fetch records in a region using 0-based indexing. The region is specified by contig, start and end. Alternatively, a samtools region string can be supplied.

    Without contig or region all mapped records will be fetched. The records will be returned ordered by contig, which will not necessarily be the order within the file.

    Set reopen to true if you will be using multiple iterators on the same file at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file incurrs some overhead, so use with care.

    If only contig is set, all records on contig will be fetched. If both region and contig are given, an exception is raised.

    Note that a bgzipped VCF.gz file without a tabix/CSI index (.tbi/.csi) or a BCF file without a CSI index can only be read sequentially.

    get_reference_name(self, tid)

    return reference name corresponding to numerical tid

    get_tid(self, reference)

    return the numerical tid corresponding to reference

    returns -1 if reference is not known.

    is_valid_tid(self, tid)

    return True if the numerical tid is valid; False otherwise.

    returns -1 if reference is not known.

    new_record(self, *args, **kwargs)

    Create a new empty VariantRecord.

    See VariantHeader.new_record()

    next
    open(self, filename, mode='r', index_filename=None, VariantHeader header=None, drop_samples=False, duplicate_filehandle=True, ignore_truncation=False)

    open a vcf/bcf file.

    If open is called on an existing VariantFile, the current file will be closed and a new file will be opened.

    reset(self)

    reset file position to beginning of file just after the header.

    subset_samples(self, include_samples)

    Read only a subset of samples to reduce processing time and memory. Must be called prior to retrieving records.

    write(self, VariantRecord record) → int

    write a single pysam.VariantRecord to disk.

    returns the number of bytes written.

    class pysam.VariantHeader

    header information for a VariantFile object

    add_line(self, line)

    Add a metadata line to this header

    add_meta(self, key, value=None, items=None)

    Add metadata to this header

    add_record(self, VariantHeaderRecord record)

    Add an existing VariantHeaderRecord to this header

    add_sample(self, name)

    Add a new sample to this header

    alts

    alt metadata (dict ID->record).

    The data returned just a snapshot of alt records, is created every time the property is requested, and modifications will not be reflected in the header metadata and vice versa.

    i.e. it is just a dict that reflects the state of alt records at the time it is created.

    contigs

    contig information (VariantHeaderContigs)

    copy(self)
    filters

    filter metadata (VariantHeaderMetadata)

    formats

    format metadata (VariantHeaderMetadata)

    info

    info metadata (VariantHeaderMetadata)

    merge(self, VariantHeader header)
    new_record(self, contig=None, start=0, stop=0, alleles=None, id=None, qual=None, filter=None, info=None, samples=None, **kwargs)

    Create a new empty VariantRecord.

    Arguments are currently experimental. Use with caution and expect changes in upcoming releases.

    records

    header records (VariantHeaderRecords)

    samples
    version

    VCF version

    class pysam.VariantRecord(*args, **kwargs)

    Variant record

    alleles

    tuple of reference allele followed by alt alleles

    alts

    tuple of alt alleles

    chrom

    chromosome/contig name

    contig

    chromosome/contig name

    copy(self)

    return a copy of this VariantRecord object

    filter

    filter information (see VariantRecordFilter)

    format

    sample format metadata (see VariantRecordFormat)

    id

    record identifier or None if not available

    info

    info data (see VariantRecordInfo)

    pos

    record start position on chrom/contig (1-based inclusive)

    qual

    phred scaled quality score or None if not available

    ref

    reference allele

    rid

    internal reference id number

    rlen

    record length on chrom/contig (aka rec.stop - rec.start)

    samples

    sample data (see VariantRecordSamples)

    start

    record start position on chrom/contig (0-based inclusive)

    stop

    record stop position on chrom/contig (0-based exclusive)

    translate(self, VariantHeader dst_header)
    class pysam.VariantHeaderRecord(*args, **kwargs)

    header record from a VariantHeader object

    attrs

    sequence of additional header attributes

    get(self, key, default=None)

    D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.

    items(self)

    D.items() -> list of D’s (key, value) pairs, as 2-tuples

    iteritems(self)

    D.iteritems() -> an iterator over the (key, value) items of D

    iterkeys(self)

    D.iterkeys() -> an iterator over the keys of D

    itervalues(self)

    D.itervalues() -> an iterator over the values of D

    key

    header key (the part before ‘=’, in FILTER/INFO/FORMAT/contig/fileformat etc.)

    keys(self)

    D.keys() -> list of D’s keys

    pop(self, key, default=_nothing)
    remove(self)
    type

    header type – FILTER, INFO, FORMAT, CONTIG, STRUCTURED, or GENERIC

    update(self, items=None, **kwargs)

    D.update([E, ]**F) -> None.

    Update D from dict/iterable E and F.

    value

    header value. Set only for generic lines, None for FILTER/INFO, etc.

    values(self)

    D.values() -> list of D’s values

  • 相关阅读:
    复杂json格式转化为javabean
    lambda常用方法
    solr设置分片和副本
    Python之Scrapy爬虫框架安装及简单使用
    40多个纯CSS绘制的图形
    windows下自动启动Redis隐藏命令行窗口
    用Webpack构建Vue项目
    上传本地代码到gitHub过程详解
    html/css常用合集
    数据库索引与b+树
  • 原文地址:https://www.cnblogs.com/Raymontian/p/7442789.html
Copyright © 2011-2022 走看看