Crawl, index and search

Sometimes you need to search for files or pages on content-rich websites or browser-based information software/encyclopedias that don’t really have a search functionality. And it could be a pain to find what you are looking for. I wrote a little crawler once in python that works well to search for stuff, on websites, on the fly. But sometimes a real “index” is needed for searching. There are a few libraries available and among them is the open source Apache Lucene, an excellent high-performance text search engine library that one can use for free. Lucene coupled with a multi-threaded web crawler and you have a pretty good index and search functionality; though not as good as google, but close.

Below is an example of how you can use Lucene to build searchable indexes for websites.

/ Create an index directory
FSDirectory dir = FSDirectory.open( new File( "C:/test/myindex") );

// Use the IndexWriter to write text documents to the above directory
IndexWriter writer = new IndexWriter( dir, new StandardAnalyzer( Version.LUCENE_CURRENT ),
true, IndexWriter.MaxFieldLength.LIMITED );
.
.
// For each crawled URL, create a document and add to index.
// You can add as many attributes you want
Document doc = new Document();
doc.add( new Field( "contents", new StringReader("...the content...") ) );
doc.add( new Field( "url", "http://the-crawled-url", Field.Store.YES, Field.Index.NOT_ANALYZED ) );
doc.add( new Field( "mime", "text/plain", Field.Store.YES, Field.Index.NOT_ANALYZED ) );

writer.addDocument( doc );

writer.commit(); // Commit changes

Now once the index is created we can start searching it for content. Lucene provides IndexSearcher class that is used to search the index using a Query. Below is an example that searches for results in the above created index, and prints the website URL where the required content is found.

// Open the index directory
FSDirectory index = FSDirectory.open( new File( "C:/test/myindex" ) );

// Create a search query
String querystr = "hello world";
Query q = new QueryParser( Version.LUCENE_CURRENT, "contents",
new StandardAnalyzer( Version.LUCENE_CURRENT ) ).parse( querystr );

int hitsPerPage = 10; // Used for pagination
IndexSearcher searcher = new IndexSearcher( index, true );
TopScoreDocCollector collector = TopScoreDocCollector.create( hitsPerPage, true );
searcher.search( q, collector );
ScoreDoc[] hits = collector.topDocs().scoreDocs;

// For each result print the URL
for( int i = 0; i < hits.length; ++i ) {
int docId = hits[i].doc;
Document d = searcher.doc( docId );
System.out.println( ( i + 1 ) + ". " + d.get( "url" ) );
}

The full source of this example, including the web crawler, can be found here and is available under GPL.

Leave a Reply