zoukankan      html  css  js  c++  java
  • “好文章”链接-爬虫脚本

     爬虫脚本

     环境:在linux系统中运行此脚本(根据不同博客源码进行调整)

    #!/bin/bash
    www_link=http://blog.oldboyedu.com/page/
    for i in  {1..4}   #按博客页码爬虫
    do
    curl ${www_link}${i}/ 2>/dev/null|grep tooltip | awk -F "[><" ]+" '{print $5"@"$7}'>>a1.txt
    done
    awk  -F @  '{print "<a href="$1">"$2"</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"}' a1.txt > curl.txt
    #!/bin/bash
    www_link=http://www.cnblogs.com/clsn/default.html?page=
    for i in  {1..8}   #按博客页码爬虫
    do
    a=`curl ${www_link}${i} 2>/dev/null|grep homepage|grep -v "ImageLink"|awk -F "[><"]" '{print $7"@"$9}' >>a1.txt`
    done
    egrep
    -v "pager" a1.txt >a2.txt #排除含有“pager”的行,处理后放到 b=`sed "s# ##g" a2.txt` #将文件里的空格去掉,因为for循环会将每行的空格前后作为两个变量,而不是一行为一个变量
    for i in $b do c=`echo $i|awk -F @ '{print $1}'` #c=内容网址 d=`echo $i|awk -F @ '{print $2}'` #d=内容 echo "<a href="${c}">${d}</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" >>curl.txt #curl.txt为生成a标签的文本 done

    结果展示

    # tail curl.txt
    <a href=http://www.cnblogs.com/clsn/p/8093301.html>JIRA敏捷开发平台部署记录</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=http://www.cnblogs.com/clsn/p/8087501.html>MySQL索引管理与执行计划</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=http://www.cnblogs.com/clsn/p/8087417.html>MySQL-Select语句高级应用</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=http://www.cnblogs.com/clsn/p/8052649.html>keepalived实现服务高可用</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  • 相关阅读:
    loj1201(最大独立集)
    hdu4185+poj3020(最大匹配+最小边覆盖)
    【Leetcode】3Sum Closest
    【Leetcode】3Sum
    【Leetcode】Two Sum
    【Leetcode】Longest Consecutive Sequence
    【Leetcode】Median of Two Sorted Arrays
    【Leetcode】Search in Rotated Sorted Array II
    【Leetcode】Search in Rotated Sorted Array
    【Leetcode】Remove Duplicates from Sorted Array II
  • 原文地址:https://www.cnblogs.com/xzy-blog/p/8734093.html
Copyright © 2011-2022 走看看