浅谈百度爬虫的HTTP状态码返回机制

本文将就HTTP协议中相关的返回机制以及在不同情况下会出现何种返回代号作一番浅显易懂地介绍。返回404 Not Found 时表明找不到相关页面;

一、简介

HTTP状态码是指在Web服务器上运行的应用程序发送到客户端(浏览器)的信息。它包含了诸如200 OK之类的标准代号,用来告诉客户端当前页面所处的情况。而对于百度来说,其为了能够正常采集数据并将其存储到数据库中,必须要遵循HTTP协议中相关的规则。因此,本文将就HTTP协议中相关的返回机制以及在不同情况下会出现何种返回代号作一番浅显易懂地介绍。

创新互联是一家专注于成都网站建设、成都做网站与策划设计,满洲网站建设哪家好?创新互联做网站,专注于网站建设10多年,网设计领域的专业建站公司;建站业务涵盖:满洲等地区。满洲做网站价格咨询:18980820575

二、HTTP 状态代号

1. 200 OK: 这是最常见也是最重要的 HTTP 状态代号之一, 在大部分情况下, 此时表明 Web 服务器已成功处理了该请求;

2. 301 Moved Permanently: 这意味者永久性重定向, 针对特定链接, 如 www.example.com/old-page.html , 此时会将 URL 重新引导到 www.example.com/new-page .html ;

3. 302 Found (Moved Temporarily): 这意味者临时性重定向, 和301 Moved Permanently 相似, 但302 Found 是临时更新URL;

4. 404 Not Found: 返回404 Not Found 时表明找不到相关页面;

5 403 Forbidden : 有时候 Web 服务器会阻止特定 IP 地址或由特定 IP 地址执行特定方法(例如 POST) , 此时就会返回403 Forbidden ;

三、Http Status Code Return Mechanism of Baidu Crawler

1、Baidu crawler will first send a request to the server and wait for the response from the server in order to get the content of web page or other resources on it . If there is no response within certain time limit , then Baidu crawler will consider that this request has failed and stop crawling this page .

2、When receiving a response from server , Baidu crawler will check whether it is an error code or not according to HTTP status codes returned by server . If it is an error code such as 404 Not found or 403 Forbidden etc., then Baidu crawler will stop crawling this page immediately without further processing . Otherwise if it is a normal status code like 200 OK , then Baidu crawler can continue its work and start downloading contents from this page .

3、In addition to checking HTTP status codes returned by servers , Baidu also checks robots exclusion protocol (robots txt ) before sending requests so as to avoid wasting resources on pages which are forbidden for crawling by website owners themselves through robots txt files stored on their websites .

4、After getting all contents successfully downloaded from target webpages with normal status codes returned by servers , baidu spider will store them into database for later use such as indexing these data into search engine results list when users enter related keywords in search box of baidus homepage etc..

5、Finally after finishing all tasks above mentioned above successfully without any errors occurred during processings of each step involved in whole procedure described hereabove , baud spider can move onto next webpage waiting for being crawled until all webpages listed in task queue have been processed completely one after another orderly just like what we have discussed hereabove briefly but clearly enough hopefully !

以上就是关于浅谈百度爬虫的HTTP状态码返回机制的相关知识,如果对你产生了帮助就关注网址吧。

网站标题:浅谈百度爬虫的HTTP状态码返回机制
当前路径:http://www.gawzjz.com/qtweb/news33/169333.html

网站建设、网络推广公司-创新互联,是专注品牌与效果的网站制作,网络营销seo公司;服务项目有等

广告

声明:本网站发布的内容(图片、视频和文字)以用户投稿、用户转载内容为主,如果涉及侵权请尽快告知,我们将会在第一时间删除。文章观点不代表本网站立场,如需处理请联系客服。电话:028-86922220;邮箱:631063699@qq.com。内容未经允许不得转载,或转载时需注明来源: 创新互联