zoukankan      html  css  js  c++  java
  • R语言爬虫:使用R语言爬取豆瓣电影数据

    豆瓣排名前25电影及评价爬取

    url <-'http://movie.douban.com/top250?format=text'
    # 获取网页原代码,以行的形式存放在web 变量中
    web <- readLines(url,encoding="UTF-8")
    # 找到包含电影名称的行
    name <- str_extract_all(string = web, pattern = '<span class="title">.+</span>')
    movie.names_line <- unlist(name)
    # 用正则表达式来提取电影名
    movie.names <- str_extract(string = movie.names_line, pattern = ">[^&].+<") %>% 
        str_replace_all(string = ., pattern = ">|<",replacement = "")
    movie.names <- na.omit(movie.names)
    # 获取评价人数
    Rating <- str_extract_all(string = web,pattern = '<span>[:digit:]+人评价</span>')
    Rating.num_line <- unlist(Rating)
    Rating.num <- str_extract(string = Rating.num_line, pattern = "[:digit:]+") %>% as.numeric(.)
    # 获取评价分数
    Score_line <- str_extract_all(string = web, 
                                  pattern = '<span class="rating_num" property="v:average">[\d\.]+</span>')
    Score_line <- unlist(Score_line)
    Score <- str_extract(string = Score_line, pattern = '\d\.\d') %>% as.numeric(.)
    # 数据合并
    MovieData <- data.frame(MovieName = movie.names,
                            RatingNum = Rating.num,Score = Score,
                            Rank = seq(1,25),stringsAsFactors = FALSE)
    # 可视化
    library(ggplot2)
    ggplot(data = MovieData, aes(x = Rank,y = Score)) + 
        geom_point(aes(size = RatingNum)) + 
        geom_text(aes(label = MovieName),colour = "blue", size = 4, vjust = -0.6)
    
  • 相关阅读:
    【转载】ARM与单片机的区别
    关于头文件定义的一点思考
    关于*** WARNING L15: MULTIPLE CALL TO SEGMENT
    【转】单片机中volatile定义的作用详解
    关于单片机位数的思考(8位、16位、32位)
    memcpy函数
    ubuntu下打开windows里的txt文件乱码解决
    linux source filename
    linux环境设置export
    pdf转word工具
  • 原文地址:https://www.cnblogs.com/xihehe/p/8309023.html
Copyright © 2011-2022 走看看