zoukankan      html  css  js  c++  java
  • 爬虫相关

    爬虫基础:requests以及BeautifulSoup模块
    http://www.cnblogs.com/wupeiqi/articles/6283017.html

    爬虫性能相关以及Scrapy框架
    http://www.cnblogs.com/wupeiqi/articles/6283017.html

    Python开发【第十五篇】:Web框架之Tornado
    http://www.cnblogs.com/wupeiqi/articles/5702910.html

    200行自定义异步非阻塞Web框架
    http://www.cnblogs.com/wupeiqi/p/6536518.html

    模块

    requests.get(url='URL路径)

    beautifulsoup

    soup = beautifulsoup('HTML格式字符串',‘html.parser’)

    tag = soup.find(name='div',attrs = {'id':'t'}) 

    tags = soup.find_all(name='div',attrs = {'id':'t'}) 

    tag.find('h3).text

    tag.find('h3').get('属性名称')  #get('href')

    tag.find('h3').attrs

    Http请求基础

    requests
    GET:
    requests.get(url="http://www.oldboyedu.com")
    # data="http GET / http1.1 host:oldboyedu.com .... "

    requests.get(url="http://www.oldboyedu.com/index.html?p=1")
    # data="http GET /index.html?p=1 http1.1 host:oldboyedu.com .... "

    requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})
    # data="http GET /index.html?p=1 http1.1 host:oldboyedu.com .... "

    POST:
    requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头:url-formend....
    data="http POST / http1.1 host:oldboyedu.com .... name=alex&age=18"


    requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头:application/json
    data="http POST / http1.1 host:oldboyedu.com .... {"name": "alex", "age": 18}"


    requests.post(
    url="http://www.oldboyedu.com",
    params={'p':1},
    json={'name':'alex','age':18}
    ) # 默认请求头:application/json

     GET请求

    有参数实例:

    1 import requests
    2 
    3 payload = {'key1':'v1',‘key2’:‘v2’}
    4 
    5 ret = requests.get("http://test.cn/get",params=payload)
    6 
    7 print(ret.url)
    8 
    9 print(ret.text)

    POST请求

    import requests

    import json
      
    url = 'https://api.github.com/some/endpoint'
    payload = {'some''data'}
    headers = {'content-type''application/json'}
      
    ret = requests.post(url, data=json.dumps(payload), headers=headers)
      
    print ret.text
    print ret.cookies

    其他请求

    1. method
    2. url
    3. params
    4. data
    5. json
    6. headers
    7. cookies
    8. files
    9. auth
    10. timeout
    11. allow_redirects
    12. proxies
    13. stream
    14. cert

    requests.request(method='POST',
    url='http://127.0.0.1:8000/test/',
    data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
     headers={'Content-Type': 'application/x-www-form-urlencoded'}
     )

    def param_cookies():
        # 发送Cookie到服务器端
        requests.request(method='POST',
                         url='http://127.0.0.1:8000/test/',
                         data={'k1': 'v1', 'k2': 'v2'},
                         cookies={'cook1': 'value1'},
                         )

    def param_auth():
        from requests.auth import HTTPBasicAuth, HTTPDigestAuth
    
        ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
        print(ret.text)

    BeautifulSoup模块

    该模块用于接收一个HTML或XML字符串,然后将其格式化,之后可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

    安装:

    pip3 install beautifulsoup4

    使用实例:

     1 from bs4 import BeautifulSoup
     2 
     3 html_doc = """
     4 <html>
     5     <head>
     6         <title>The Dormouse's story</title>
     7     </head>
     8     <body>
     9         <div id='i1'>
    10             <a>sdfs</a>
    11         </div>
    12         <p class='c2'>asdfa</p>
    13     </body>
    14 </html>
    15 """

    具体方法:

    1.name,标签名称

    1 soup = BeautifulSoup(html_doc,'html.parser')
    2 tag = soup.find('a')
    3 name = tag.name #获取
    4 print(name)
    5 tag.name = 'span' #设置
    6 print(soup)

    2.attrs,标签属性

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    attrs = tag.attrs #获取
    print(attrs)
    tag.attrs['id'] = 'iiii' #设置
    print(soup)
    attrs = tag.attrs #获取
    print(attrs)

    3.children,所有子标签

    body = soup.find('body')
    v = body.children
    print(list(v))
    

    4.descendants,找所有后代 

    body = soup.find('body')
    v = body.descendants
    print(list(v))
    

    5.clear,将标签的所有子标签全部清空(保留标签名)

    tag = soup.find('body')
    tag.clear()
    print(soup)

    6.decompose,递归的删除所有标签

    tag = soup.find('body')
    tag.decompose()
    print(soup)

    7.extract,递归删除所有的标签,并获取删除的标签

    tag = soup.find('body')
    tag.extract()
    print(soup)

    8.decode,转换为字符串(含当前字符串);decode_contents(不含当前标签)

    tag = soup.find('body')
    v = tag.decode()   #对象变成字符串
    v1 = tag.encode()   #对象变成字节
    print(v,v1)

    9.find ,获取匹配的第一个标签  find_all,获取匹配的所有标签

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    tags = soup.find_all('body')
    for i in tags:
        print(i)
    v = soup.find_all(name=['a','div'])
    print(v)

    10.has_attr,检查标签是否具有该属性

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    v = tag.has_attr('class')
    v1 = tag.has_attr('id')
    print(v,v1)

    11.get_text,获取标签内部文本内容

    soup = BeautifulSoup(html_doc,'html.parser')
    tag = soup.find('a')
    v = tag.get_text()
    print(v)

    其他标签:

    index,检查标签在某标签中的索引位置

    is_empty_element,是否是空标签(是否可以是空)或者自闭合标签

    select,select_one, CSS选择器

    soup.select("body a")

    标签内容:

    print(tag.string)

    tag.string = 'new content' # 设置

    append在当前标签内部追加一个标签

    insert在当前标签内部指定位置插入一个标签

    insert_after,insert_before 在当前标签后面或前面插入

    replace_with 在当前标签替换为指定标签

    wrap,将指定标签把当前标签包裹起来

    unwrap,去掉当前标签,将保留其包裹的标签

    
    
  • 相关阅读:
    运算符
    格式化输出
    while循环
    if 判断语句
    Swift # 字典
    Swift # 数组
    Swift # 字符串
    [ Swift # 函数 ]
    [ Bubble Sort ]& block
    数据结构 # 二叉树/堆/栈
  • 原文地址:https://www.cnblogs.com/liumj0305/p/7109642.html
Copyright © 2011-2022 走看看