zoukankan      html  css  js  c++  java
  • String Manipulation related with pandas

    String Manipulation related with pandas

    String object Methods

    import pandas as pd
    
    import numpy as np
    
    val='a,b, guido'
    
    val.split(',') # normal python built-in method split
    
    ['a', 'b', ' guido']
    
    pieces=[x.strip() for x in val.split(',')];pieces  # strip whitespace
    
    ['a', 'b', 'guido']
    
    '::'.join(pieces)
    
    'a::b::guido'
    
    val.count(',')
    
    2
    
    val.count('guido')
    
    1
    
    val.replace(',',':')
    
    'a:b: guido'
    
    val.swapcase()
    
    'A,B, GUIDO'
    
    val[::-1]
    
    'odiug ,b,a'
    

    Regular expression

    The re module functions fall into 3 categories:pattern matching,substitution,splliting.

    import re
    
    text='foo   bar	 baz  	 qux'
    
    re.split('s+',text)
    
    ['foo', 'bar', 'baz', 'qux']
    
    regex=re.compile('s+')
    
    regex.split(text)
    
    ['foo', 'bar', 'baz', 'qux']
    
    regex.findall(text)
    
    ['   ', '	 ', '  	 ']
    
    • To avoid unwanted escaping with in a regular expression,use raw string literals
    text="""Dave dave@google.com
    Steve steve@mail.com
    Rob rob@mail.com
    Ryan ryan@yahoo.com
    """
    
    pattern=r'[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}'
    
    regex=re.compile(pattern,re.I)
    

    Using findall() produces a list of the email address.

    regex.findall(text)
    
    ['dave@google.com', 'steve@mail.com', 'rob@mail.com', 'ryan@yahoo.com']
    
    regex.findall(r' J.onepy+@w-m.co')
    
    ['J.onepy+@w-m.co']
    

    search() returns a specified match object for the first email address in the text.

    m=regex.search(text)
    
    m
    
    <re.Match object; span=(5, 20), match='dave@google.com'>
    
    regex.match(text)
    
    text[m.start():m.end()]
    
    'dave@google.com'
    

    regex.match(text) returns None,as it onlyu will match if the pattern occurs at the start of the string.

    sub() will return a new string with occurences of the pattern replaced by a new string.

    print(regex.sub('READACTED',text))
    
    Dave READACTED
    Steve READACTED
    Rob READACTED
    Ryan READACTED
    

    Vectorized string functions in pandas

    data={'Dave':'dave@google.com','Steve':'steve@gmeil.com','Rob':'rob@gmail.com','Wes':np.nan}
    
    data=pd.Series(data);data
    
    Dave     dave@google.com
    Steve    steve@gmeil.com
    Rob        rob@gmail.com
    Wes                  NaN
    dtype: object
    
    data.isnull()
    
    Dave     False
    Steve    False
    Rob      False
    Wes       True
    dtype: bool
    
    data.str.contains('gmail')
    
    Dave     False
    Steve    False
    Rob       True
    Wes        NaN
    dtype: object
    
    data
    
    Dave     dave@google.com
    Steve    steve@gmeil.com
    Rob        rob@gmail.com
    Wes                  NaN
    dtype: object
    
    data.map(lambda x:x[:2],na_action='ignore')  # x is the value in data, the returned Series has the same index with caller,data here.
    
    Dave      da
    Steve     st
    Rob       ro
    Wes      NaN
    dtype: object
    
    help(data.map)
    
    Help on method map in module pandas.core.series:
    
    map(arg, na_action=None) method of pandas.core.series.Series instance
        Map values of Series using input correspondence (a dict, Series, or
        function).
        
        Parameters
        ----------
        arg : function, dict, or Series
            Mapping correspondence.
        na_action : {None, 'ignore'}
            If 'ignore', propagate NA values, without passing them to the
            mapping correspondence.
        
        Returns
        -------
        y : Series
            Same index as caller.
        
        Examples
        --------
        
        Map inputs to outputs (both of type `Series`):
        
        >>> x = pd.Series([1,2,3], index=['one', 'two', 'three'])
        >>> x
        one      1
        two      2
        three    3
        dtype: int64
        
        >>> y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3])
        >>> y
        1    foo
        2    bar
        3    baz
        
        >>> x.map(y)
        one   foo
        two   bar
        three baz
        
        If `arg` is a dictionary, return a new Series with values converted
        according to the dictionary's mapping:
        
        >>> z = {1: 'A', 2: 'B', 3: 'C'}
        
        >>> x.map(z)
        one   A
        two   B
        three C
        
        Use na_action to control whether NA values are affected by the mapping
        function.
        
        >>> s = pd.Series([1, 2, 3, np.nan])
        
        >>> s2 = s.map('this is a string {}'.format, na_action=None)
        0    this is a string 1.0
        1    this is a string 2.0
        2    this is a string 3.0
        3    this is a string nan
        dtype: object
        
        >>> s3 = s.map('this is a string {}'.format, na_action='ignore')
        0    this is a string 1.0
        1    this is a string 2.0
        2    this is a string 3.0
        3                     NaN
        dtype: object
        
        See Also
        --------
        Series.apply : For applying more complex functions on a Series.
        DataFrame.apply : Apply a function row-/column-wise.
        DataFrame.applymap : Apply a function elementwise on a whole DataFrame.
        
        Notes
        -----
        When `arg` is a dictionary, values in Series that are not in the
        dictionary (as keys) are converted to ``NaN``. However, if the
        dictionary is a ``dict`` subclass that defines ``__missing__`` (i.e.
        provides a method for default values), then this default is used
        rather than ``NaN``:
        
        >>> from collections import Counter
        >>> counter = Counter()
        >>> counter['bar'] += 1
        >>> y.map(counter)
        1    0
        2    1
        3    0
        dtype: int64
    
    pattern
    
    '[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
    
    data.str.findall(pattern,flags=re.I)
    
    Dave     [dave@google.com]
    Steve    [steve@gmeil.com]
    Rob        [rob@gmail.com]
    Wes                    NaN
    dtype: object
    
    matches=data.str.match(pattern,flags=re.I);matches
    
    Dave     True
    Steve    True
    Rob      True
    Wes       NaN
    dtype: object
    
    matches.str.get(1)
    
    Dave    NaN
    Steve   NaN
    Rob     NaN
    Wes     NaN
    dtype: float64
    
    matches.str[0]
    
    Dave    NaN
    Steve   NaN
    Rob     NaN
    Wes     NaN
    dtype: float64
    
    data.str[:5]
    
    Dave     dave@
    Steve    steve
    Rob      rob@g
    Wes        NaN
    dtype: object
    
    
    
    ##### 愿你一寸一寸地攻城略地,一点一点地焕然一新 #####
  • 相关阅读:
    正则式记录
    限制键盘只能按数字键、小键盘数字键、退格键
    windows服务安装记录
    CheckBox使用记录
    you need to be root to perform this command
    Code First 更新数据库 记录
    EF查询记录
    sqlserver数据库存储汉字出现?
    【转】THE ROAD TO SUCCESS--听ERIC XING讲课记录
    Nice Computer Vision package collections
  • 原文地址:https://www.cnblogs.com/johnyang/p/12715387.html
Copyright © 2011-2022 走看看