zoukankan      html  css  js  c++  java
  • R语言爬虫:使用R语言爬取豆瓣电影数据

    豆瓣排名前25电影及评价爬取

    url <-'http://movie.douban.com/top250?format=text'
    # 获取网页原代码,以行的形式存放在web 变量中
    web <- readLines(url,encoding="UTF-8")
    # 找到包含电影名称的行
    name <- str_extract_all(string = web, pattern = '<span class="title">.+</span>')
    movie.names_line <- unlist(name)
    # 用正则表达式来提取电影名
    movie.names <- str_extract(string = movie.names_line, pattern = ">[^&].+<") %>% 
        str_replace_all(string = ., pattern = ">|<",replacement = "")
    movie.names <- na.omit(movie.names)
    # 获取评价人数
    Rating <- str_extract_all(string = web,pattern = '<span>[:digit:]+人评价</span>')
    Rating.num_line <- unlist(Rating)
    Rating.num <- str_extract(string = Rating.num_line, pattern = "[:digit:]+") %>% as.numeric(.)
    # 获取评价分数
    Score_line <- str_extract_all(string = web, 
                                  pattern = '<span class="rating_num" property="v:average">[\d\.]+</span>')
    Score_line <- unlist(Score_line)
    Score <- str_extract(string = Score_line, pattern = '\d\.\d') %>% as.numeric(.)
    # 数据合并
    MovieData <- data.frame(MovieName = movie.names,
                            RatingNum = Rating.num,Score = Score,
                            Rank = seq(1,25),stringsAsFactors = FALSE)
    # 可视化
    library(ggplot2)
    ggplot(data = MovieData, aes(x = Rank,y = Score)) + 
        geom_point(aes(size = RatingNum)) + 
        geom_text(aes(label = MovieName),colour = "blue", size = 4, vjust = -0.6)
    
  • 相关阅读:
    ajax-分页查询
    Bootstrap-响应式表格
    ajax-三级联动
    ajax(加载数据)
    HDU 3086 马拉车模板
    Power Strings POJ2406 KMP 求最小循环节
    KMP模板题 Number Sequence HDU1711
    Phone List HDU1671 字典树Trie
    一些linux"基本操作"的教程汇总
    Codeforces 899F Letters Removing 线段树/树状数组
  • 原文地址:https://www.cnblogs.com/xihehe/p/8309023.html
Copyright © 2011-2022 走看看