Wuhan Shanghai Dragon an analysis of the work of search engine spiders
, a spiderThe
and the browser, the search engine spiders also have the proxy name that their identity, the webmaster can see the specific agent name search engine in the log file, to identify the search engine spiders.
Wuhan Shanghai dragon today want to talk about the search engine spiders work. First talk about the principle of search engine. The search engine is the web content on the Internet has its own server, when a user searches for a word, the search engine will find relevant content on your own server, this is to say, only stored in the search engine web page on the server will be searched. "What can be saved to the search engine on the server? Only search engine spiders catch" will be saved to the search engine on the server, the web crawler program is the search engine spiders. The whole process is divided into the crawl and crawl.
spider visit any website, will go to visit the website of the root directory of the robots.txt file. If the robots.txt file from the search engine grab some files or directories, the spider will abide by the agreement, not grasping banned.
1, deep links
is composed of the entire Internet sites and pages are linked. Of course, the website and the page link structure is very complex, the spider to take all the page to traverse the web crawling strategy a.
linked the breadth first is the talk of the spider found a number of links on a page, not to follow a link straight ahead, but the first layer of all links on the page to climb again, and then along the second layer page found on the link page to climb third floors.
depth first when spider found a link, it will follow the link that the road has been creeping forward, until the front no other links, then it will return to the first page, and then continue to go straight and then link crawling.
in theory, whether it is the depth or breadth first priority, as long as the spider.
2, breadth link
search engine to crawl and visit the web page of the program is called the spider, also called the robot. The spider browser, and we usually the Internet like a spider will also apply for a visit, get permission before they can browse, but there is one point, the search engine in order to improve the quality and speed, it will put a lot of spiders to crawl and crawl.
in order to grab as much as possible online page, the search engine spiders will follow the links on a page, a page from the climb to the next page, just like a spider crawling on the web.
the most simple crawling strategy are: the depth and breadth of priority.
From the perspective of Shanghai dragon