php 自制基于simple_html_dom的爬虫一只v1.0
一直以来网页解析和爬虫的制作热情丝毫未减 今天用开源的simple_html_dom.php解析框架做了一只爬虫:
<?php /* *.Pho spider v1.0 *.Written by Radish.ghost 2015.1.20 */ //error_reporting(1); //close error report //curl model //I will realize it in later versions include_once("simple_html_dom.php"); $html=file_get_html("http://www.baidu.com");//The url which you want dig $tmp=array();//Save the url in the first dig foreach($html->find("a") as $e) { $f=$e->href; //if($f[10]==":")continue; if($f[0]=="/")$f="http://www.baidu.com".$f;//Completion the url if($f[4]=="s")continue;//If the url is "https://" continue (the simple_html_dom might can"t prase the https:// url) if(stripos($f,"baidu")==FALSE)continue;//If the url not in this website continue echo $f . "<br>"; $tmp[$cun++]=$f; //Save the urls into array } foreach($tmp as $r) //Dig the urls in $tmp[] { $html2=file_get_html($r); //Redo the step foreach($html2->find("a") as $a) { $u=$a->href; if($u[0]=="/")$u="http://www.baidu.com".$u; if($u[4]=="s")continue; if(stripos($u,"baidu")==FALSE)continue; echo $u."<br>"; } $html2=null; } ?>
//最后总会出现一个Fatal error: Call to a member function find() on a non-object in D:xampphtdocshtmlindex.php on line 21 的警告 与学长沟通后改正了很多小错误 不过这个仍然没有解决 希望有大神能够指点一下
---------------------分割线---------------------
simple_html_dom下载:
https://github.com/Ph0enixxx/simple_html_dom
= =家里电脑用不了git4win
声明:该文观点仅代表作者本人,牛骨文系教育信息发布平台,牛骨文仅提供信息存储空间服务。
- 上一篇: top查看进程的线程
- 下一篇: PHP爬虫最全总结1