Component-Oriented Search
In my first blog entry I introduced the overall vision of Merobase, and highlighted some of the features we’re planning to introduce in the coming months. Today, I’d like to elaborate further on what we mean when we characterize Merobase as being “component-oriented” and explain what benefits we think this offers our users. A recent comparison of three well known code search engines by Nik Cubrilovic on TechCrunch provides a good starting point. In his blog, Nik describes how he tried to find an implementation of a well known algorithm using Google Code Search (GCS), Krugle and Koders -
To
test Google Code Search out against both Krugle and Koders, I ran a search for
“md5 in C”, hoping to find an implementation of the MD5 hash algorithm in C. In
Google, I can specify the implementation language I would like in the search
query, while in both Krugle and Koders I needed to select the language from a
drop down. Krugle and Koders didn’t seem to filter the results based on
language too well as they both had results that were implementations in other
languages. One problem here is that the search engines don’t actually know you
are looking for a simple implementation of md5, they are just string-matching
against their indexes so you get some very poor results (such as functions that
call an MD5 library). Across the 3 search engines, I could not find a good, pure
MD5 implementation – just a lot of header files and functions that had the
string ‘md5’ within them.
This highlights the basic problem with “conventional” code search engines which simply look for strings in the text of source code – they make no distinction between modules which use the abstraction the developer is looking for and modules which implement it. They simply return code modules which contain the search string. And since modules that use an abstraction often refer to its name (e.g. “MD5”) more often than modules that implement it, this means that “using” modules tend to be ranked “higher” than implementing modules. This was why Nic couldn’t find a “good, pure MD5 implementation” using GCS, Krugle or Koders.
Merobase takes a fundamentally different approach, however. When crawling for code, Merobase’s analysis software identifies the basic abstraction implemented by a module and stores it in a language-agnostic, description format. The most important element of the description is the abstraction’s name, but other key aspects of the abstraction are also stored such as the methods and their signatures. By storing abstract representations of software modules in this way, Merobase is able to support more sophisticated, service-oriented queries. The simplest of these is a name-based query which searches for abstractions whose names matches the user’s search string. For example, a query consisting simply of the string “MD5” (with no quotes) is viewed as a named based query that returns code modules implementing abstractions named “MD5”. Since this name is relatively unique, the returned modules are mainly “pure implementations” of the MD5 algorithm.
If the desired abstraction is essentially a function, the name can be augmented with the list of “in” and “out” parameters (or return values) to form a so called function-oriented query. If the desired abstraction is essentially an object, the name can be augmented with a list of method names and signatures to form a so called object-oriented query. In both cases Merobase matches as many of the elements of the abstraction as possible, and ranks them according to their “closeness of fit”. This allows users to search for abstractions or services according to the interface or API that they wish to use.
Because we believe most searches will be for implementations of abstractions rather for example uses of them, the simplest search strings in merobase defaults to abstraction-oriented searches. However, we recognize that the ability to search for code modules that use a particular abstraction is also useful. For example, a user may want to see examples of how an abstraction is invoked, or may simply wish to identify code that uses a given module to sort out some quality or licensing issue. To define conventional “string-matching” searches it is simply necessary to enclose the string in quotes. For example, the query ““MD5”” (in quotes) is viewed as a simple string-search query which will return code modules that contain the string “MD5” in their source code, ranked according to the frequency of occurrence.
We believe this range of different query options, with the default focusing on “implementing” rather than “using” modules, provides the optimal way of harvesting the software resources available on the Internet and of maximizing the benefit that can be gained from software search engines.