HooaaHBot - A Brief Description of HooaaH Bot

How to Identify HooaaH Bot
Presumably, you arrived at this site because you noticed traffic from a User-Agent that identified itself with the string:


HooaaHBot; +http://www.hooaah.com/help/?HooaaHBot

If the IP Address was also 50.63.119.1, then you have come to the right place to find out about who was probably crawling your site.

How HooaaH Bot often Crawls a Site
HooaaH Bot is currently run sporadically (not continuously) on a small number of machines. Each machine has about 6-8 fetcher processes. Each fetcher has open at most 100-300 connections at any given time. In a typical situation, these connections would not all be to the same host.

How you can Change how HooaaH Bot Crawls your Site
HooaaH Bot understands robots.txt (it has to be robots.txt not robot.txt ) files. A robots.txt must be placed in the root folder of your website for its instructions to be followed. HooaaH does not look in subfolders for robots.txt files. A simple robots.txt file to block HooaaH! from crawling any folders other than the cool_stuff folder and its subfolders might look like:


User-agent: HooaaHBot
Disallow: /
Allow: /cool_stuff/
HooaaHBot also obeys HTML ROBOTS meta tags with content among none, noindex, nofollow. An example HTML page, using the noindex, nofollow directive might look as follows:

<!DOCTYPE html >
<html>
<head><title>Meta Robots Example</title>
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />
<!-- The members of the content attribute must be comma separated, whitespace will be ignored-->
</head>
<body>
<p>Stuff robots shouldn't put in their index.
<a href="/somewhere">A link that nofollow will prevent from being followed</a></p>
</body>
</html>
Within HTML documents HooaaH Bot honors anchor rel="nofollow" directives. For example, the following link would not be followed by HooaaH Bot:

<a href="/somewhere_else" rel="nofollow" >This link would not be followed by HooaaHBot</a> 

More Specifics on robots.txt and Meta Tag Handling
When processing a robots.txt file, if Disallow and Allow lines are in conflict, HooaaHBot gives preference to the Allow directive over the Disallow directive as the default behavior of robots.txt is to allow everything except what is explicitly disallowed.
If a webpage has a noindex meta tag, then it won't show up in search results, provided that HooaaH has actually downloaded the page. If HooaaH hasn't downloaded the page, or is forbidden from downloading the page by a robots.txt file, it is possible for a link to the page to show up in search results. This could happen if another page links to the given page, and HooaaH has extracted this link and its text and used them in search results. One can check if a URL has been downloaded by typing a query info:URL into HooaaH and seeing the results.
When processing a robots.txt file, HooaaHBot first looks for HooaaHBot User-agent blocks and extracts all of the Allow and Disallow paths listed in them. On success, these form the path that HooaaHBot will use to restrict its access to your site. It then parses all of these blocks and uses them to restrict its access to your site. In particular, if you have a block "User-Agent: *" followed by allow and disallow rules, and no blocks for HooaaHBot, then these paths will be what HooaaHBot uses and honors.
Sitemap directives as per the Sitemap specification are not associated with any particular User-Agent. So HooaaH processes, to the extent that it does, any such directive it finds.

How Quickly does HooaaH Bot Change its Behavior
When HooaaH machines are crawling for longer than one day, they cache the robots.txt file. They use the cached directives rather than re-requesting the robots.txt file for 24 hours before making a new request of the robots.txt file again. So if you change your robots.txt file it might take a little while before the changes are noticed by HooaaH crawler.

Adding Your Site to HooaaH Search Engine
Currently we are working on a self system to allow you to add your site to HooaaH, please check back frequently as we are working hard to finish this out.