Python 网易新闻小爬虫的实现代码_python爬虫

Python 网易新闻小爬虫的实现代码: 发布时间：2020-10-12编辑：脚本学堂

本文介绍下，用python实现的爬取网易新闻内容的一个小爬虫，有需要的朋友，可以研究学习下。

python 网易新闻小爬虫。

代码：

复制代码代码示例:

#coding:utf-8

#---------------------------------------
# 网易新闻小爬虫
#---------------------------------------
# 简介 : 通过分析 www.163.com ，分析其中以 News.163.com 开头的链接
# 获取各链接的内容，并合并到 1.txt 以便查看各新闻。
#---------------------------------------

import re, urllib

strTitle = ""
strTxtTmp = ""
strTxtOK = ""

f = open("163News.txt", "w+")

m = re.findall(r"news.163.com/d.+?</a>",urllib.urlopen("http://www.163.com").read(),re.M)
for i in m:
testUrl = i.split('"')[0]
if testUrl[-4:-1]=="htm":

# 合并标题头内容
strTitle = strTitle + "n" + i.split('"')[0] + i.split('"')[1]

    # 重新组合链接
        okUrl = i.split('"')[0]
        UrlNews = ''
        UrlNews = "http://" + okUrl

        print UrlNews

        # 查找分析链接中正文内容。
        # 整理去掉部分 html 代码，让文本更易于观看。
        n = re.findall(r"<P style=.TEXT-INDENT: 2em.>(.*?)</P>",urllib.urlopen(UrlNews).read(),re.M)
        for j in n:
            if len(j)<>0:
                j = j.replace("&nbsp","n")
                j = j.replace("<STRONG>","n_____")
                j = j.replace("</STRONG>","_____n")
                strTxtTmp = strTxtTmp + j + "n"
                strTxtTmp = re.sub(r"<a href=(.*?)>", r"", strTxtTmp)
                strTxtTmp = re.sub(r"</[Aa]>", r"", strTxtTmp)

    # 组合链接标题和正文内容
        strTxtOK = strTxtOK + "nnn===============" + i.split('"')[0] + i.split('"')[1] + "===============n" + strTxtTmp
        strTxtTmp = ""
        print strTxtOK

# 全部分析完成后，写入文件，关闭
f.write(strTitle + "nnn" + strTxtOK)
f.close()

您可能感兴趣的文章：
Python web爬虫的小例子
 python 网络爬虫(经典实用型)
python网络爬虫的代码
 python 实现从百度开始不断搜索的爬虫
 python编写分布式爬虫的思路
 Python实现天气预报采集器（网页爬虫）

上一篇：python网络爬虫的代码
下一篇：python 网络爬虫(经典实用型)

与 Python 网易新闻小爬虫的实现代码有关的文章

本文标题：Python 网易新闻小爬虫的实现代码
本页链接：http://www.jb200.com/article/11206.html

浏览排行

栏目分类

热点文章

Python 网易新闻小爬虫的实现代码