Sometimes you need to search for files or pages on content-rich websites or browser-based information software/encyclopedias that don’t really have a search functionality. And it could be a pain to find what you are looking for. I wrote a little crawler once in python that works well to search for stuff, on websites, on the fly. But sometimes a real “index” is needed for searching. There are a few libraries available and among them is the open source Apache Lucene, an excellent high-performance text search engine library that one can use for free. Lucene coupled with a multi-threaded web crawler and you have a pretty good index and search functionality; though not as good as google, but close.
Below is an example of how you can use Lucene to build searchable indexes for websites.
/ Create an index directory FSDirectory dir = FSDirectory.open( new File( "C:/test/myindex") ); // Use the IndexWriter to write text documents to the above directory IndexWriter writer = new IndexWriter( dir, new StandardAnalyzer( Version.LUCENE_CURRENT ), true, IndexWriter.MaxFieldLength.LIMITED ); . . // For each crawled URL, create a document and add to index. // You can add as many attributes you want Document doc = new Document(); doc.add( new Field( "contents", new StringReader("...the content...") ) ); doc.add( new Field( "url", "http://the-crawled-url", Field.Store.YES, Field.Index.NOT_ANALYZED ) ); doc.add( new Field( "mime", "text/plain", Field.Store.YES, Field.Index.NOT_ANALYZED ) ); writer.addDocument( doc ); writer.commit(); // Commit changes
Now once the index is created we can start searching it for content. Lucene provides IndexSearcher class that is used to search the index using a Query. Below is an example that searches for results in the above created index, and prints the website URL where the required content is found.
// Open the index directory FSDirectory index = FSDirectory.open( new File( "C:/test/myindex" ) ); // Create a search query String querystr = "hello world"; Query q = new QueryParser( Version.LUCENE_CURRENT, "contents", new StandardAnalyzer( Version.LUCENE_CURRENT ) ).parse( querystr ); int hitsPerPage = 10; // Used for pagination IndexSearcher searcher = new IndexSearcher( index, true ); TopScoreDocCollector collector = TopScoreDocCollector.create( hitsPerPage, true ); searcher.search( q, collector ); ScoreDoc[] hits = collector.topDocs().scoreDocs; // For each result print the URL for( int i = 0; i < hits.length; ++i ) { int docId = hits[i].doc; Document d = searcher.doc( docId ); System.out.println( ( i + 1 ) + ". " + d.get( "url" ) ); }
The full source of this example, including the web crawler, can be found here and is available under GPL.