zoukankan      html  css  js  c++  java
  • 转载:R语言rvest包使用

    R中有好几个包都可以抓取网页数据,但是rvest + CSS Selector最方便。

    通过查看器立刻知道表格数据都在td:nth-child(1),td:nth-child(3)之类的节点中,直接代码提取就行了。

    library(rvest)

    先看看都有什么

    freak <- html_session("http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/")

    freak

    <session> http://torrentfreak.com/top-10-most-pirated-movies-of-the-week-130304/

     Status: 200

     Type:   text/html; charset=UTF-8

     Size:   24983

    freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10]

    [1] "Silver Linings Playbook "          

    [2] "The Hobbit: An Unexpected Journey "

    [3] "Life of Pi (DVDscr/DVDrip)"        

    [4] "Argo (DVDscr)"                    

    [5] "Identity Thief "                  

    [6] "Red Dawn "                        

    [7] "Rise Of The Guardians (DVDscr)"    

    [8] "Django Unchained (DVDscr)"        

    [9] "Lincoln (DVDscr)"                  

    [10] "Zero Dark Thirty "  

    freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11]

    [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

    freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10]

    [1] "7.4 / trailer" "8.2 / trailer" "8.3 / trailer" "8.2 / trailer"

    [5] "8.2 / trailer" "5.3 / trailer" "7.5 / trailer" "8.8 / trailer"

    [9] "8.2 / trailer" "7.6 / trailer"

    freak %>% html_nodes("td:nth-child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10]

    [1] "http://www.imdb.com/title/tt1045658/"

    [2] "http://www.imdb.com/title/tt0903624/"

    [3] "http://www.imdb.com/title/tt0454876/"

    [4] "http://www.imdb.com/title/tt1024648/"

    [5] "http://www.imdb.com/title/tt2024432/"

    [6] "http://www.imdb.com/title/tt1234719/"

    [7] "http://www.imdb.com/title/tt1446192/"

    [8] "http://www.imdb.com/title/tt1853728/"

    [9] "http://www.imdb.com/title/tt0443272/"

    [10] "http://www.imdb.com/title/tt1790885/?"

    #构建数据框

    data.frame(movie=freak %>% html_nodes("td:nth-child(3)") %>% html_text() %>% .[1:10],

               rank=freak %>% html_nodes("td:nth-child(1)") %>% html_text() %>% .[2:11],

               rating=freak %>% html_nodes("td:nth-child(4)") %>% html_text() %>% .[1:10],

               imdb.url=freak %>% html_nodes("td:nth child(4) a[href*='imdb']") %>% html_attr("href") %>% .[1:10],stringsAsFactors=FALSE)

                                   movie rank        rating                              imdb.url

    1            Silver Linings Playbook     1 7.4 / trailer  http://www.imdb.com/title/tt1045658/

    2  The Hobbit: An Unexpected Journey     2 8.2 / trailer  http://www.imdb.com/title/tt0903624/

    3          Life of Pi (DVDscr/DVDrip)    3 8.3 / trailer  http://www.imdb.com/title/tt0454876/

    4                       Argo (DVDscr)    4 8.2 / trailer  http://www.imdb.com/title/tt1024648/

    5                     Identity Thief     5 8.2 / trailer  http://www.imdb.com/title/tt2024432/

    6                           Red Dawn     6 5.3 / trailer  http://www.imdb.com/title/tt1234719/

    7      Rise Of The Guardians (DVDscr)    7 7.5 / trailer  http://www.imdb.com/title/tt1446192/

    8           Django Unchained (DVDscr)    8 8.8 / trailer  http://www.imdb.com/title/tt1853728/

    9                    Lincoln (DVDscr)    9 8.2 / trailer  http://www.imdb.com/title/tt0443272/

    10                  Zero Dark Thirty    10 7.6 / trailer  http://www.imdb.com/title/tt1790885/?

    如果不考虑网址,还有更简单的方式:

    freak %>% html_nodes("table") %>% html_table()

    [[1]]

               Ranking (last week)                             Movie IMDb Rating / Trailer

    1  torrentfreak.com        <NA>                              <NA>                  <NA>

    2                 1         (5)           Silver Linings Playbook         7.4 / trailer

    3                 2      (back) The Hobbit: An Unexpected Journey         8.2 / trailer

    4                 3         (9)        Life of Pi (DVDscr/DVDrip)         8.3 / trailer

    5                 4      (back)                     Argo (DVDscr)         8.2 / trailer

    6                 5         (…)                    Identity Thief         8.2 / trailer

    7                 6         (1)                          Red Dawn         5.3 / trailer

    8                 7         (2)    Rise Of The Guardians (DVDscr)         7.5 / trailer

    9                 8         (4)         Django Unchained (DVDscr)         8.8 / trailer

    10                9         (6)                  Lincoln (DVDscr)         8.2 / trailer

    11               10      (back)                  Zero Dark Thirty         7.6 / trailer

  • 相关阅读:
    Java实现提取拼音首字母
    Java实现网格中移动字母
    Java实现网格中移动字母
    Java实现网格中移动字母
    SQL语句:Group By总结
    Maven学习 使用Nexus搭建Maven私服(转)
    CentOS7 搭建Git服务器(转)
    tomcat调优的几个方面(转)
    windows越用越卡怎么办?(转)
    Easyui datagrid行内【添加】、【编辑】、【上移】、【下移】
  • 原文地址:https://www.cnblogs.com/lvfeilong/p/24fdsfds.html
Copyright © 2011-2022 走看看