上次项目中用到<<PHP采集淘宝商品>>
此方法有一个缺点,就是执行效率问题.一个商品采集平均需要0.8秒.那10000个商品采集完需要2个半小时.
首先想到的解决办法是并发.
但是PHP不支持并发(这里是指通过PHP命令执行PHP文件,如果通过apache或nginx等做服务器是可以并发的,是并发访问,不能在程序中实现并发).
通过Shell把对php命令推到后台执行,以达到并发的效果.
整体思路:
1.在Shell中连接数据库,取出需要更新的产品
2.Shell中对数据进行循环,把商品id,price,url传递给PHP执行,将执行过程推到后台
3.每循环20条使程序暂停5秒,达到控制并发数的目的
4.php得到id,price,url参数后,通过URL进行采集,并返回现价
5.将现价和原数据库中的价格进行比较,如果有变化或下架则更新.
6.将执行结果写入日志文件.
Shell
#!/bin/sh #updateprice.sh j=0 currcyline=0; #查询数据库 for i in `/usr/local/bin/mysql -uroot -pshop123 shop -e"SELECT id,price,url FROM s_goods WHERE url!='' AND keyid like 'taobao%' AND is_off_sale=0 ORDER BY id DESC "` do if [ $j -gt 2 ]; then #前3个循环分别为id,price,url这相当于表头,不需要进行操作,所以从第3开始. line=$(($j%3)) case $line in 0) currcyline=$(($currcyline+1)) s=$(($currcyline%20)) if [ $s -eq 0 ]; then sleep 5 #每循环20次休息5秒,以此来控制避免产生过多的后台进行,使服务器压力过大或死机. fi id=$i;; 1) price=$i;; 2) url=$i; #echo id:${id} price:${price} url:${url} { #调用php命令执行PHP文件. re=`/usr/local/bin/php /var/www/9384shop/cron/goodsupdate.php ${id} ${price} "${url}"` }& #此&为推到后台执行, 关键 esac fi j=$(($j+1)) done wait #等待后台进行执行完成
PHP:
<?php /* ============================================================================ Name : goodsupdate.php Author : 风飘无痕 Version : Copyright : Your copyright notice Description : Collect taobao goods ============================================================================ */ //$argv为获取命令中的参数 $s=microtime(1); $id=$argv[1]; $oldprice=$argv[2]; $price=getPrice($argv[3]); $t=microtime(1)-$s; $r=array(); $r[]=date('Y-m-d H:i:s'); $r[]=$id; $r[]=ceil($t*1000)/1000; if($price=='soldout'){ $r[]="OutStock"; $con=mysql_connect('localhost','shop','shop123'); mysql_select_db("shop", $con); mysql_query("UPDATE s_goods SET is_off_sale=1 WHERE id=".$id); mysql_close($con); } elseif($price===false) $r[]= 'FALSE'; elseif($price==$oldprice) $r[]='EQUAL'; else{ $r[]="UPDATE"; $r[]=$oldprice; $r[]=$price; $con=mysql_connect('localhost','shop','shop123'); mysql_select_db("shop", $con); mysql_query("UPDATE s_goods SET price=".$price." WHERE id=".$id); mysql_close($con); } //以日志的形式保存执行过程 $h=fopen('/home/staff/www/9384shop/log/goodsUpdate.log','a+'); fputcsv($h,$r); fclose($h); function getPrice($url,$time=1){ $des_url=''; $ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0'); curl_setopt($ch, CURLOPT_REFERER,'http://www.tmall.com/');//采集淘宝商品必须设置此项 curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);//设置输出方式, 0为自动输出返回的内容, 1为返回输出的内容,但不自动输出. curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); //timeout on connect curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout on response curl_setopt($ch, CURLOPT_HEADER, 1);//是否输出头信息,0为不输出,非零则输出 curl_setopt($ch, CURLOPT_MAXREDIRS, 10 ); curl_setopt($ch, CURLOPT_URL, $url); $content = curl_exec($ch); curl_close($ch); if(preg_match('/noitem.htm/',$content)){ return 'soldout'; }elseif(preg_match("/'reservePrice's*:s*'([d.]+?)',/",$content,$price)){ $price = (float)$price[1]; }elseif(preg_match('/price:([d.]+?),/',$content,$price)){ $price = (float)$price[1]; } if(!$price){ preg_match('/id=(d+)+/',$url,$temp); $url2="http://mdskip.taobao.com/core/initItemDetail.htm?itemId=".$temp[1]; $ch = curl_init(); curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" ); curl_setopt( $ch, CURLOPT_URL, $url2 ); curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $ch, CURLOPT_ENCODING, "" ); curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt( $ch, CURLOPT_REFERER, 'http://www.tmall.com' ); curl_setopt( $ch, CURLOPT_AUTOREFERER, true ); curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, 10 ); curl_setopt( $ch, CURLOPT_TIMEOUT, 10 ); curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 ); $price_content = curl_exec( $ch ); $response = curl_getinfo( $ch ); curl_close ( $ch ); $price_content=json_decode(iconv('gbk','utf-8',preg_replace('/(d{10,}):/','"${1}":',$price_content)),true); $priceinfo=$price_content['defaultModel']['itemPriceResultDO']['priceInfo']; $price=array(); if(is_array($priceinfo)){ foreach ($priceinfo as $v){ $price[]=$v['price']; if(is_array($v['promotionList'])){ foreach ($v['promotionList'] as $v2){ $price[]=$v2['extraPromPrice']?$v2['extraPromPrice']:$v2['price']; } } if(is_array($v['suggestivePromotionList'])){ foreach ($v['suggestivePromotionList'] as $v2){ $price[]=$v2['extraPromPrice']?$v2['extraPromPrice']:$v2['price']; } } } } $price=count($price)>0?min($price):false; } if($price) return $price; elseif($time<3) return getPrice($url,$time+1);//如果没有取到价格递归执行,最多执行3次. else return false; }
执行结果:
tail -10 goodsUpdate.log "2014-03-21 13:45:34",13357,0.273,EQUAL "2014-03-21 13:45:35",13380,5.883,EQUAL "2014-03-21 13:45:35",13343,0.914,EQUAL "2014-03-21 13:45:35",13344,0.923,EQUAL "2014-03-21 13:45:35",13347,0.927,UPDATE,599.00,181.00 "2014-03-21 13:45:35",13339,0.908,EQUAL "2014-03-21 13:45:35",13342,0.93,EQUAL "2014-03-21 13:45:35",13348,0.933,EQUAL "2014-03-21 13:45:35",13349,0.946,UPDATE,1547.00,1877.00 "2014-03-21 13:45:35",13338,0.947,EQUAL
此方法比只用PHP更新大大节约了时间,更新2万个商品大约是半小时.但是线程数非常不好控制