zoukankan html css js c++ java

怎样抓取网页内容

如果给你一个网页链接, 来抓取指定的内容, 比如豆瓣电影排行榜, 那要怎样才能做到了?

其实网页内容的结构很是类似于XML, 那么我们就可以用解析XML的方式来解析HTML, 不过两者之间的差距还是很大的, 好了, 废话不多说, 我们开始解析HTML。

那么解析XML的库比较多, 这里选用libxml来解析, 因为libxml是c语言接口, 找了一个用objective-c包装接口的库-hpple, 它的地址是https://github.com/topfunky/hpple, 那么网页用的是豆瓣电影排行榜, 地址是http://movie.douban.com/chart。

接下来新建工程, 工程采用的ARC, 引进libxml2和hpple库, 新建实体类movie, 完整的项目结构如下:

NewImage

movie的实现如下, 这个是实体类, 根据抓取网页的内容来定这个实体

movie.h

@interface Movie : NSObject
@property(nonatomic, strong) NSString   *name;
@property(nonatomic, strong) NSString   *imageUrl;
@property(nonatomic, strong) NSString   *descrition;
@property(nonatomic, strong) NSString   *movieUrl;
@property(nonatomic) NSInteger  ratingNumber;
@property(nonatomic, strong) NSString   *comment;
@end

那么最重要的部分来了, 不管网页的内容是什么, 我们得先需要获取网页的内容, 下面就是通过NSURLConnection来获取整个网页的内容。

- (void)loadHTMLContent
{
    NSString *movieUrl = MOVIE_URL;
    NSString *urlString = [movieUrl stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
    NSURL *url = [NSURL URLWithString:urlString];
    
    NSURLRequest *request = [NSURLRequest requestWithURL:url];
    
    [UIApplication sharedApplication].networkActivityIndicatorVisible = YES;
    
    __weak ViewController *weak_self = self;
    [NSURLConnection sendAsynchronousRequest:request queue:[NSOperationQueue mainQueue] completionHandler:^(NSURLResponse *response, NSData *data, NSError *error) {
        if (nil == error) {
//            NSString *retString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
//            NSLog(@"%@", retString);
            [weak_self parserHTML:data];
        }
        
        [UIApplication sharedApplication].networkActivityIndicatorVisible = NO;
    }];
}

这里只是简单的获取网页内容, 一些HTTP和错误处理不在此文讨论中, 所以这里的代码比较简单, 在上面的代码中有个parserHTML:的方法, 就是来解析获取的网页内容, 在解析网页内容之前, 先来扯下xpath。

假设一个简单网页内容如下:

<html>
  <head>
    <title>Some webpage</title>
  </head>
  <body>
    <p class=”normal”>This is the first paragraph</p>
    <p class=”special”>This is the second paragraph. <b>This is in bold.</b></p>
  </body>

</html>

比如想得到title的内容，那么xpath表达就是/html/head/title。如果想得到class="special"节点内容，xpath就是/html/body/p[@class='special']。

所以只要找对了xpath, 就会得到相应的节点内容, 那么来看下用hpple解析HTML

- (void)parserHTML:(NSData *)data
{
    if (nil != data) {
        TFHpple *movieParser = [TFHpple hppleWithHTMLData:data];
        NSString *movieXpathQueryString = @"/html/body/div[@id='wrapper']/div[@id='content']/div[@class='grid-16-8 clearfix']/div[@class='article']/div[@class='indent']/table/tr/td/a[@class='nbg']";
        NSArray *movieNodes = [movieParser searchWithXPathQuery:movieXpathQueryString];
        
        for (TFHppleElement *element in movieNodes) {
            Movie *m = [[Movie alloc] init];
            m.name = [element objectForKey:@"title"];
            m.movieUrl = [element objectForKey:@"href"];
            
            for (TFHppleElement *child in element.children) {
                if ([child.tagName isEqualToString:@"img"]) {
                    @try {
                        m.imageUrl = [child objectForKey:@"src"];
                    }
                    @catch (NSException *exception) {
                        
                    }
                }
            }
            
            [self.movies addObject:m];
        }
        
        [self.movieTableView reloadData];
    }
}

代码中首页找到相应节点的路径, 然后searchWithXPathQuery, 得到一个数组, 遍历组织数据就能够在table view中展示了。具体效果如下:

NewImage

好了, 网页内容被抓取出来了, 具体的实际项目中都要比这个复杂, so, 这只是个引导的示例。

参考:http://www.raywenderlich.com/14172/how-to-parse-html-on-ios#

注: 本文由啸寒原著，请支持原著！转载请附上原文链接: http://www.cnblogs.com/xiaohan-wu/p/3203932.html

查看全文

相关阅读:
RN-Android构建失败：Caused by: org.gradle.api.ProjectConfigurationException: A problem occurred configuring root project 'AwesomeProject'.
Android更新包下载成功后不出现安装界面
 真机调试： The application could not be installed: INSTALL_FAILED_TEST_ONLY
react native 屏幕尺寸转换
 Android Studio生成签名文件,自动签名,以及获取SHA1和MD5值
 React Native安卓真机调试
 git提交代码报错Permission denied, please try again
The sandbox is not in sync with the Podfile.lock. Run 'pod install' or update your CocoaPods installation.
命令行设置快捷命令
 Linux 常用指令

原文地址：https://www.cnblogs.com/xiaohan-wu/p/3203932.html