Crawl, index and search

Sometimes you need to search for files or pages on content-rich websites or browser-based information software/encyclopedias that don’t really have a search functionality. And it could be a pain to find what you are looking for. I wrote a little crawler once in python that works well to search for stuff, on websites, on the fly. But sometimes a real “index” is needed for searching. There are a few libraries available and among them is the open source Apache Lucene, an excellent high-performance text search engine library that one can use for free. Lucene coupled with a multi-threaded web crawler and you have a pretty good index and search functionality; though not as good as google, but close.

Below is an example of how you can use Lucene to build searchable indexes for websites.

Now once the index is created we can start searching it for content. Lucene provides IndexSearcher class that is used to search the index using a Query. Below is an example that searches for results in the above created index, and prints the website URL where the required content is found.

The full source of this example, including the web crawler, can be found here and is available under GPL.

Project migration from Sourceforge to Googlecode

I have been using googlecode for some of my recent open source development work, and I was surprised how googlecode speeds up development. The SCM is very fast and gave me no troubles, it is easy to create wiki pages and documentation for projects etc. etc. Although it offers limited features, compared to sourceforge for example, but the real power is in its simplicity. Sourceforge offers more features like, hosting web pages and shell services and if you are smart you can also create your own little maven repository for your artifacts; one might argue that all these features make sourceforge very complex. But recently sourceforge has become slow as hell, and it is bit of a pain to manage your work, SCM is slow, web pages are not served with a desired speed, shell services (although more secure) but slower and the whole shell-creation process takes too long. So to cut the story short, I finally decided to migrate some of my work from sourceforge to googlecode, simply because googlecode is faster and simple.

In the beginning I had no clue how to achieve this task. But it was much simpler than I anticipated. My only concern was to get the code migrated fully, safely and with all the version history. This is done by synch’ing the project’s SVN repository on googlecode with the repository on sourceforge. First I reset the googlecode repository to enable svn synch’ing. This is done under Administrator->Source tab on your project’s homepage on googlecode. Then I began the synch’ing process.

The first step is to initialize the googlecode’s subversion repository

After this we just start synch’ing the repositories.

The above command will fetch all the code, with the version history and including tags and branches.

And this is all to it. For more information on the svnsync refer to the subversion redbook.