Merobase Logo

Merobase Data Sets

Welcome to the Merobase Data Sets

This page is a source of data that we have collected in our research in large scale analysis of source code. These data sets are available for other researchers and individuals to use. The data is provided as-is. Please refer to the terms of usage that come with each data set for any restrictions in usage.

Currently available data sets:

The downloadable repository file is a bzipped tarball (compressed approx. 15 GB, unpacked approx. 50GB) of Merobase's Code Repository archived on June, 2nd 2010. It contains 2,429,999 .java files crawled from CVS / SVN repositories. Furthermore there are 203,689 .class files. The 26,947 .jar-files contain 40,692 .java files and 4,342,376 .class files.

All these files -- together with 689,214 files from the open web available via http -- are part of the currently available (December 2012) search index of

The downloadable Lucene Index will result in a download of approx. 49GB.

Tool support:

The MeroL tool can be used to browse the index using the Merobase Query Language explained on An example query for a calculator implementation in Java would be:
Calculator ( add(int,int):int; ) lang:java

Citation Policy

If you publish material based on data sets obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. The premier reference for our work is:

W. Janjic, O. Hummel, M. Schumacher and C. Atkinson. An Unabridged Source Code Dataset for Research in Software Reuse: 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, 2013

Here is a BiBTeX citation as well:

	Title = {An Unabridged Source Code Dataset for Research in Software Reuse},
	Author = {Janjic, Werner and Hummel, Oliver and Schumacher, Marcus and Atkinson, Colin},
	Booktitle = {Proceedings of the Tenth International Workshop on Mining Software Repositories (MSR'13)},
	Address = {San Francisco, CA, USA},
	Organization = {IEEE Press},
	Pages = {339--342},
	Year = {2013}

Alternatively you may cite the following publication as well:

O. Hummel, W. Janjic, C. Atkinson (2008). Code Conjurer: Pulling Reusable Software out of Thin Air: IEEE Software, vol. 25, no. 5, pp. 45-52, Sept.-Oct. 2008

Here is a BiBTeX citation as well:

    	author="O. Hummel and W. Janjic and C. Atkinson", 
    	journal="IEEE Software",
    	title="Code Conjurer: Pulling Reusable Software out of Thin Air", 

For the user's convenience, this page is based on the template of the Sourcerer Data Set web page.

(c) the Software-Engineering group at the University of Mannheim, Germany.