November 13, 2006

Component-Oriented Search

In my first blog entry I introduced the overall vision of Merobase, and highlighted some of the features we’re planning to introduce in the coming months. Today, I’d like to elaborate further on what we mean when we characterize Merobase as being “component-oriented” and explain what benefits we think this offers our users. A recent comparison of three well known code search engines by Nik Cubrilovic on TechCrunch provides a good starting point. In his blog, Nik describes how he tried to find an implementation of a well known algorithm using Google Code Search (GCS), Krugle and Koders -

To test Google Code Search out against both Krugle and Koders, I ran a search for “md5 in C”, hoping to find an implementation of the MD5 hash algorithm in C. In Google, I can specify the implementation language I would like in the search query, while in both Krugle and Koders I needed to select the language from a drop down. Krugle and Koders didn’t seem to filter the results based on language too well as they both had results that were implementations in other languages. One problem here is that the search engines don’t actually know you are looking for a simple implementation of md5, they are just string-matching against their indexes so you get some very poor results (such as functions that call an MD5 library). Across the 3 search engines, I could not find a good, pure MD5 implementation – just a lot of header files and functions that had the string ‘md5’ within them.

This highlights the basic problem with “conventional” code search engines which simply look for strings in the text of source code – they make no distinction between modules which use the abstraction the developer is looking for and modules which implement it. They simply return code modules which contain the search string. And since modules that use an abstraction often refer to its name (e.g. “MD5”) more often than modules that implement it, this means that “using” modules tend to be ranked “higher” than implementing modules. This was why Nic couldn’t find a “good, pure MD5 implementation” using GCS, Krugle or Koders.

Merobase takes a fundamentally different approach, however. When crawling for code, Merobase’s analysis software identifies the basic abstraction implemented by a module and stores it in a language-agnostic, description format. The most important element of the description is the abstraction’s name, but other key aspects of the abstraction are also stored such as the methods and their signatures. By storing abstract representations of software modules in this way, Merobase is able to support more sophisticated, service-oriented queries. The simplest of these is a name-based query which searches for abstractions whose names matches the user’s search string. For example, a query consisting simply of the string “MD5” (with no quotes) is viewed as a named based query that returns code modules implementing abstractions named “MD5”. Since this name is relatively unique, the returned modules are mainly “pure implementations” of the MD5 algorithm.

If the desired abstraction is essentially a function, the name can be augmented with the list of “in” and “out” parameters (or return values) to form a so called function-oriented query. If the desired abstraction is essentially an object, the name can be augmented with a list of method names and signatures to form a so called object-oriented query. In both cases Merobase matches as many of the elements of the abstraction as possible, and ranks them according to their “closeness of fit”. This allows users to search for abstractions or services according to the interface or API that they wish to use.

Because we believe most searches will be for implementations of abstractions rather for example uses of them, the simplest search strings in merobase defaults to abstraction-oriented searches. However, we recognize that the ability to search for code modules that use a particular abstraction is also useful. For example, a user may want to see examples of how an abstraction is invoked, or may simply wish to identify code that uses a given module to sort out some quality or licensing issue. To define conventional “string-matching” searches it is simply necessary to enclose the string in quotes. For example, the query ““MD5”” (in quotes) is viewed as a simple string-search query which will return code modules that contain the string “MD5” in their source code, ranked according to the frequency of occurrence.

We believe this range of different query options, with the default focusing on “implementing” rather than “using” modules, provides the optimal way of harvesting the software resources available on the Internet and of maximizing the benefit that can be gained from software search engines.

October 11, 2006

Google Code Search vs Merobase

It was interesting to witness the buzz generated by the launch of the Google Code Search engine earlier this week, and with it the new awareness of the role that harvesting software from open source code repositories can play in mainstream software development. We at Merotronics have been convinced of the potential of open source code search for a long time, and since July have offered our own online search engine – merobase.com - that can be used to find reusable software assets.

Merobase populates its indices just like Google Code Search and other similar search engines like krugle.com and koders.com – namely by crawling publicly hosted code repositories such as CVS archives. However, in contrast with Google Code Search and other search engines, merobase follows a more component-oriented approach to software retrieval. This means that instead of simply searching for strings in the text of a code module, as if it were a web page or word document, merobase allows users to search for reusable components as full abstractions. This allows users to harvest software assets in terms of their interfaces rather than in terms of the way they are implemented. Of courses, searches can still be constrained to particular implementations, but the retrieval of software assets is driven by their level of support for a service rather than by the contents of their source code.

Take for example a developer who has identified the need for a ShoppingCart component in her E-commerce application. Once she has nailed down the interface to the component in terms of operation signatures she can use merobase to harvest components that match the desired profile. This not only increases the likelihood of finding a suitable component, it also opens up the search to multiple implementation languages (e.g. Java, C#, C++ etc.) and non-code assets such as Web Services.

Merobase already has the largest online index of Java and C# classes, and is the only search engine to support searches for web services. Over the next few months we plan to role out numerous new innovations from our research pipeline aimed at boosting this component-oriented approach to software search. This includes advanced ranking algorithms which sort search results according to software metrics rather than textual relevance measures, plugins integrating support for component harvesting within mainstream development and modeling environments, and semantic search mechanisms which allow components to be retrieved based on what they actually do rather merely on the syntactic form of their interfaces. Over the next few weeks I plan to elaborate on these technologies in more detail, but in the meantime, check out merobase and let us know what you think!

Until next time!

Colin Atkinson

Chief Scientist

Merotronics

colin.atkinson@merotronics.com