网络爬虫的原理:
给定一个初始的url,下载这个url的网页,然后找出网页上所有满足下载要求的链接。
然后,下载这些链接对应的url,现抓取这些网页的url。
以下代码用广度优先搜索实现,供大家参考。
#!/usr/bin/python
import urllib2
import re
def downURL(url,filename):
print url
print filename
try:
fp = urllib2.urlopen(url)
except:
print 'download exception'
return 0
op = open(filename,"wb")
while 1:
s = fp.read()
if not s:
break
op.write(s)
fp.close()
fp.close()
return 1
#downURL('http://www.sohu.com','http.log')
def getURL(url):
try:
fp = urllib2.urlopen(url)
except:
print 'get url exception'
return 0
pattern = re.compile("http://sports.jb200.com/[^>]+.shtml")
while 1:
s = fp.read()
if not s:
break
urls = pattern.findall(s)
fp.close()
return urls
def spider(startURL,times):
urls = []
urls.append(startURL)
i = 0
while 1:
if i > times:
break;
if len(urls)>0:
url = urls.pop(0)
print url,len(urls)
downURL(url,str(i)+'.htm')
i = i + 1
if len(urls)<times:
urllist = getURL(url)
for url in urllist:
if urls.count(url) == 0:
urls.append(url)
else:
break
return 1
spider('http://www.jb200.com',10)
您可能感兴趣的文章:
Python web爬虫的小例子
python 网络爬虫(经典实用型)
Python 网易新闻小爬虫
python 实现从百度开始不断搜索的爬虫
python编写分布式爬虫的思路
Python实现天气预报采集器(网页爬虫)