What robots.txt does not do is to keep files out of the search engine indexes. The only thing it does is instruct search engine spiders not to crawl pages. Keep in mind that discovery and crawling are separate. Discovery occurs as search engines find links in documents. When search engines discover pages, they may or may not add them to their indexes. … [Read more...] about Have You Considered Privacy Issues When Using Robots.txt & The Robots Meta Tag?
Robots txt sitemap xml
Certainly the search engines need to get their act together more, however. It’s time to stop referring people to the REP site which is run by no one. It’s time to stop having a myriad of help pages scattered about within their respective sites. Yes, they should continue to have their own help pages (see Google’s webmaster help from here; Bing’s from here). But I’d like to see Google and Microsoft take the lead to also consolidate material into a common site, perhaps building off Sitemaps.org. … [Read more...] about ACAP Versus Robots.txt For Controlling Search Engines
User-Agent: the robot the following rule applies to (e.g. “Googlebot,” etc.) Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed) Noindex: the pages you want a search engine to block AND not index (or de-index if previously indexed). Unofficially supported by Google; unsupported by Yahoo and Live Search. Each User-Agent/Disallow group should be separated by a blank line; however no blank lines should exist within a group (between the User-agent line and the last Disallow). The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. May be used either for whole lines or end of lines. Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines. … [Read more...] about A Deeper Look At Robots.txt
Today, Google, Yahoo!, and Microsoft have come together to post details of how each of them support robots.txt and the robots meta tag. While their posts use terms like “collaboration” and “working together,” they haven’t joined together to implement a new standard (as they did with sitemaps.org). Rather, they are simply making a joint stand in messaging that robots.txt is the standard way of blocking search engine robot access to web sites. They have identified a core set of robots.txt and robots meta tag directives that all three engines support: … [Read more...] about Yahoo!, Google, Microsoft Clarify Robots.txt Support
One of the announcements that occurred during the week of SES was Ask.com joining Google, MSN and Yahoo in supporting the Sitemaps auto discovery. This feature allows webmasters to specify the location of their sitemaps within their robots.txt file. Keith Hogan of Ask.com mentioned this change in his presentation and its impact. This will eliminate the need to submit sitemaps to each engine separately. Essentially, sitemaps are a simple XML file that lists URLS and information about the URLS to help spiders do a better job of crawling a site. See www.sitemaps.org for more details. … [Read more...] about Up Close & Personal With Robots.txt