zoukankan      html  css  js  c++  java
  • 爬虫相关

    爬虫基础:requests以及BeautifulSoup模块
    http://www.cnblogs.com/wupeiqi/articles/6283017.html

    爬虫性能相关以及Scrapy框架
    http://www.cnblogs.com/wupeiqi/articles/6283017.html

    Python开发【第十五篇】:Web框架之Tornado
    http://www.cnblogs.com/wupeiqi/articles/5702910.html

    200行自定义异步非阻塞Web框架
    http://www.cnblogs.com/wupeiqi/p/6536518.html

    模块

    requests.get(url='URL路径)

    beautifulsoup

    soup = beautifulsoup('HTML格式字符串',‘html.parser’)

    tag = soup.find(name='div',attrs = {'id':'t'}) 

    tags = soup.find_all(name='div',attrs = {'id':'t'}) 

    tag.find('h3).text

    tag.find('h3').get('属性名称')  #get('href')

    tag.find('h3').attrs

    Http请求基础

    requests
    GET:
    requests.get(url="http://www.oldboyedu.com")
    # data="http GET / http1.1 host:oldboyedu.com .... "

    requests.get(url="http://www.oldboyedu.com/index.html?p=1")
    # data="http GET /index.html?p=1 http1.1 host:oldboyedu.com .... "

    requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})
    # data="http GET /index.html?p=1 http1.1 host:oldboyedu.com .... "

    POST:
    requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头:url-formend....
    data="http POST / http1.1 host:oldboyedu.com .... name=alex&age=18"


    requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头:application/json
    data="http POST / http1.1 host:oldboyedu.com .... {"name": "alex", "age": 18}"


    requests.post(
    url="http://www.oldboyedu.com",
    params={'p':1},
    json={'name':'alex','age':18}
    ) # 默认请求头:application/json

     GET请求

    有参数实例:

    1 import requests
    2 
    3 payload = {'key1':'v1',‘key2’:‘v2’}
    4 
    5 ret = requests.get("http://test.cn/get",params=payload)
    6 
    7 print(ret.url)
    8 
    9 print(ret.text)

    POST请求

    import requests

    import json
      
    url = 'https://api.github.com/some/endpoint'
    payload = {'some''data'}
    headers = {'content-type''application/json'}
      
    ret = requests.post(url, data=json.dumps(payload), headers=headers)
      
    print ret.text
    print ret.cookies

    其他请求

    1. method
    2. url
    3. params
    4. data
    5. json
    6. headers
    7. cookies
    8. files
    9. auth
    10. timeout
    11. allow_redirects
    12. proxies
    13. stream
    14. cert

    requests.request(method='POST',
    url='http://127.0.0.1:8000/test/',
    data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
     headers={'Content-Type': 'application/x-www-form-urlencoded'}
     )

    def param_cookies():
        # 发送Cookie到服务器端
        requests.request(method='POST',
                         url='http://127.0.0.1:8000/test/',
                         data={'k1': 'v1', 'k2': 'v2'},
                         cookies={'cook1': 'value1'},
                         )

    def param_auth():
        from requests.auth import HTTPBasicAuth, HTTPDigestAuth
    
        ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
        print(ret.text)

    BeautifulSoup模块

    该模块用于接收一个HTML或XML字符串,然后将其格式化,之后可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

    安装:

    pip3 install beautifulsoup4

    使用实例:

     1 from bs4 import BeautifulSoup
     2 
     3 html_doc = """
     4 <html>
     5     <head>
     6         <title>The Dormouse's story</title>
     7     </head>
     8     <body>
     9         <div id='i1'>
    10             <a>sdfs</a>
    11         </div>
    12         <p class='c2'>asdfa</p>
    13     </body>
    14 </html>
    15 """

    具体方法:

    1.name,标签名称

    1 soup = BeautifulSoup(html_doc,'html.parser')
    2 tag = soup.find('a')
    3 name = tag.name #获取
    4 print(name)
    5 tag.name = 'span' #设置
    6 print(soup)

    2.attrs,标签属性

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    attrs = tag.attrs #获取
    print(attrs)
    tag.attrs['id'] = 'iiii' #设置
    print(soup)
    attrs = tag.attrs #获取
    print(attrs)

    3.children,所有子标签

    body = soup.find('body')
    v = body.children
    print(list(v))
    

    4.descendants,找所有后代 

    body = soup.find('body')
    v = body.descendants
    print(list(v))
    

    5.clear,将标签的所有子标签全部清空(保留标签名)

    tag = soup.find('body')
    tag.clear()
    print(soup)

    6.decompose,递归的删除所有标签

    tag = soup.find('body')
    tag.decompose()
    print(soup)

    7.extract,递归删除所有的标签,并获取删除的标签

    tag = soup.find('body')
    tag.extract()
    print(soup)

    8.decode,转换为字符串(含当前字符串);decode_contents(不含当前标签)

    tag = soup.find('body')
    v = tag.decode()   #对象变成字符串
    v1 = tag.encode()   #对象变成字节
    print(v,v1)

    9.find ,获取匹配的第一个标签  find_all,获取匹配的所有标签

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    tags = soup.find_all('body')
    for i in tags:
        print(i)
    v = soup.find_all(name=['a','div'])
    print(v)

    10.has_attr,检查标签是否具有该属性

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    v = tag.has_attr('class')
    v1 = tag.has_attr('id')
    print(v,v1)

    11.get_text,获取标签内部文本内容

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    v = tag.get_text()
    print(v)

    其他标签:

    index,检查标签在某标签中的索引位置

    is_empty_element,是否是空标签(是否可以是空)或者自闭合标签

    select,select_one, CSS选择器

    soup.select("body a")

    标签内容:

    print(tag.string)

    tag.string = 'new content' # 设置

    append在当前标签内部追加一个标签

    insert在当前标签内部指定位置插入一个标签

    insert_after,insert_before 在当前标签后面或前面插入

    replace_with 在当前标签替换为指定标签

    wrap,将指定标签把当前标签包裹起来

    unwrap,去掉当前标签,将保留其包裹的标签

    
    
  • 相关阅读:
    SQL Server Audit监控触发器状态
    SQL Server 数据变更时间戳(timestamp)在复制中的运用
    SQL Server 更改跟踪(Chang Tracking)监控表数据
    SQL Server 变更数据捕获(CDC)监控表数据
    SQL Server 事件通知(Event notifications)
    SQL Server 堆表行存储大小(Record Size)
    SQL Server DDL触发器运用
    SQL Server 默认跟踪(Default Trace)
    SQL Server 创建数据库邮件
    SQL Server 跨网段(跨机房)FTP复制
  • 原文地址:https://www.cnblogs.com/liumj0305/p/7109642.html
Copyright © 2011-2022 走看看