php 自制基于simple_html_dom的爬虫一只v1.0

创建时间：2015-01-20 投稿人：浏览次数：1527

一直以来网页解析和爬虫的制作热情丝毫未减今天用开源的simple_html_dom.php解析框架做了一只爬虫：

<?php
/*
	*.Pho spider v1.0
	*.Written by Radish.ghost 2015.1.20
*/
//error_reporting(1); //close error report
//curl model //I will realize it in later versions
include_once("simple_html_dom.php");
$html=file_get_html("http://www.baidu.com");//The url which you want dig

$tmp=array();//Save the url in the first dig
foreach($html->find("a") as $e) 
{
	$f=$e->href;
	//if($f[10]==":")continue;
	if($f[0]=="/")$f="http://www.baidu.com".$f;//Completion the url
	if($f[4]=="s")continue;//If the url is "https://" continue (the simple_html_dom might can"t prase the https:// url)  
	if(stripos($f,"baidu")==FALSE)continue;//If the url not in this website continue
    echo $f . "<br>";
	$tmp[$cun++]=$f; //Save the urls into array
}

foreach($tmp as $r) //Dig the urls in $tmp[]
{
$html2=file_get_html($r); //Redo the step
foreach($html2->find("a") as $a)
{
	$u=$a->href;
	if($u[0]=="/")$u="http://www.baidu.com".$u;
	if($u[4]=="s")continue;
	if(stripos($u,"baidu")==FALSE)continue;
	echo $u."<br>";
}
$html2=null;
}
?>

//最后总会出现一个Fatal error: Call to a member function find() on a non-object in D:xampphtdocshtmlindex.php on line 21 的警告与学长沟通后改正了很多小错误不过这个仍然没有解决希望有大神能够指点一下

---------------------分割线---------------------

simple_html_dom下载：

https://github.com/Ph0enixxx/simple_html_dom

= =家里电脑用不了git4win

声明：该文观点仅代表作者本人，牛骨文系教育信息发布平台，牛骨文仅提供信息存储空间服务。

热门文章: CTF writeup 2_南邮网络攻防训...; SSM框架——详细整合教程（...; Linux Shell脚本编程－－curl命...; HttpClient使用详解; Java面试题全集（上）; JAVA设计模式之单例模式; java.lang.OutOfMemoryError: PermGen ...; TCP协议中的三次握手和四次...; form表单的两种提交方式，su...; String,StringBuffer与StringBuilder...

最新文章: Java之品优购课程讲义_day20（7）; 剑指 Offer - 8：跳台阶; Netty权威指南_札记02_NIO编程; mysql时间属性之时间戳和datetime之...; 虚拟现实或许可以拯救古埃及的“...; spring cloud服务注册中心eureka---集群...; Java SE 第六章; HTTP请求+数据库; HIDL学习笔记之HIDL C++（第二天）; ubuntu系统下指定tomcat运行时为JDK1.8...