Understanding the difference between the robots.txt file and Robots Tag is critical for search engine optimization and security. It can have a profound impact on the privacy of your website and customers as well. The first thing to know is what robots.txt files and Robots Tags are. Robots.txt Robots.txt is a file you place in your website’s top level directory, the same folder in which a static homepage would go. Inside robots.txt, you can instruct search engines to not crawl content by disallowing file names or directories. There are two parts to a robots.txt directive, the user-agent and one or more disallow instructions. The user-agent specifies one or all Web crawlers or spiders. When we think of Web crawlers we tend to think Google and Bing; however, a spider can come from anywhere, not just search engines, and there are many of them crawling the Internet. Here is a simple robots.txt file telling all Web crawlers that it is okay to spider every page: User-agent: * Disallow: … [Read more...] about Have You Considered Privacy Issues When Using Robots.txt & The Robots Meta Tag?
Robots txt where to place
In the battle between search engines and some mainstream news publishers, ACAP has been lurking for several years. ACAP — the Automated Content Access Protocol — has constantly been positioned by some news executives as a cornerstone to reestablishing the control they feel has been lost over their content. However, the reality is that publishers have more control even without ACAP than is commonly believed by some. In addition, ACAP currently provides no “DRM” or licensing mechanisms over news content. But the system does offer some ideas well worth considering. Below, a look at how it measures up against the current systems for controlling search engines. ACAP started development in 2006 and formally launched a year later with version 1.0 (see ACAP Launches, Robots.txt 2.0 For Blocking Search Engines?). This year, in October, ACAP 1.1 was released and has been installed by over 1,250 publishers worldwide, says the organization, which is backed by the European … [Read more...] about ACAP Versus Robots.txt For Controlling Search Engines
The Robots Exclusion Protocol (REP) is not exactly a complicated protocol and its uses are fairly limited, and thus it’s usually given short shrift by SEOs. Yet there’s a lot more to it than you might think. Robots.txt has been with us for over 14 years, but how many of us knew that in addition to the disallow directive there’s a noindex directive that Googlebot obeys? That noindexed pages don’t end up in the index but disallowed pages do, and the latter can show up in the search results (albeit with less information since the spiders can’t see the page content)? That disallowed pages still accumulate PageRank? That robots.txt can accept a limited form of pattern matching? That, because of that last feature, you can selectively disallow not just directories but also particular filetypes (well, file extensions to be more exact)? That a robots.txt disallowed page can’t be accessed by the spiders, so they can’t read and obey a meta robots tag … [Read more...] about A Deeper Look At Robots.txt
Believe it or not, I am not a huge fan of placing robots.txt files on sites unless you want to specifically block content and sections from Google or other search engines. It just always felt redundant to tell a search engine they can crawl your site when they will do so unless you tell them not to.Google's JohnMu confirmed in a Google Webmaster Help thread and even recommended to one webmaster that he/she should remove their robots.txt file "completely."John said:I would recommend going even a bit further, and perhaps removing the robots.txt file completely. The general idea behind blocking some of those pages from crawling is to prevent them from being indexed. However, that's not really necessary -- websites can still be crawled, indexed and ranked fine with pages like their terms of service or shipping information indexed (sometimes that's even useful to the user :-)).I know many SEOs feel it is mandatory to have a robots.txt file and just have it say, User-agent: * Allow: /. Why … [Read more...] about Google: Remove The Robots.txt File Completely
A WebmasterWorld thread reports that Yahoo may not be fully listening to the robots.txt directive to block their spider, Yahoo Slurp.The thing is, Yahoo spider isn't all that active these days - because Bing is now powering much of Yahoo and thus BingBot is most active.The webmaster said:Depending on the Host and UA, the official Yahoo! Slurp apparently does whatever it wants to. Note the subtle differences in the subdomains and UAs... This morning, the only Host to read/heed robots.txt was: b3091154.crawl.yahoo.net [18.104.22.168] Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) These retrieved graphics by the pageful, over 60 total: b5101137.yst.yahoo.net [22.214.171.124] b5101139.yst.yahoo.net [126.96.36.199] Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)I am not sure if this is a widespread issue or something that is just a smaller bug.The main question is, should you care of Yahoo is … [Read more...] about Yahoo’s Crawler Not Listening To Robots.txt Directive?