DO NOT ADD CONTENT ABOVE HERE

NGData_Full-Color-Mobile
Uncategorized

On Lily, HBase, Hadoop and SOLR

During a conference call last week, I got a request for a high-level description of the differentiators between Lily and its underlying components: what sets Lily apart against HBase basically. Here’s a rough take at it before I fold it into the product documentation: your comments are appreciated!

The foundations of Lily are well understood: HBase, the Hadoop BigTable implementation, the Hadoop distributed filesystem – distributed in a convenient fashion under the Cloudera Hadoop Distribution, and SOLR, the enterprise search platform based on the popular full-text search engine Lucene. In this text, we look at what sets Lily apart, or more specifically, what Lily adds on top of these powerful technologies.

The Lily Content Model

HBase is modeled after Google BigTable, a well-documented yet surprisingly uncommon datamodel described as a sparse, distributed, multi-dimensional sorted map. Uncommon in the sense that classically trained business software developers might encounter difficulties in mapping their domain models into the BigTable/HBase model, more specifically because of the lack of entity relationships, indexed access and a tendency towards heavy denormalization. Please mind that these are exactly the reasons behind the power of HBase, however sometimes badly understood.

In our area of expertise, i.e. content management and publishing, other common storage traits are missing from the BigTable model as well, namely support of common datatypes (time, date, but also more sophisticated types such as multi-value fields, hierarchical data, and more) and the ability to model entity schemas using these core types. Also, many if not most content repositories have intrinsic support for content versioning, preferably customizable to the infra-entity-level.

With this in mind, the current use of HBase is naturally centered around mass data-ingestion and -processing, where the data structures are often really simple (log files, user profiles). We however choose HBase as a random access storage engine replacing a relational database, admittedly with a new and interesting data model and a promise to scale well in a busy work environment.

Lily offers a content model on top of HBase, which we believe to be of value for many layman content application developers, in the sense that if offers slightly higher-level concepts to shape a datatier with: records and fields, a set of data types, link, multi-value and hierarchical fields, a flexible versioning scheme, schema validation. Beyond that, we took a couple of popular content exchange models and verified if a mapping from these into Lily would be possible: HTML, RDF, the CMIS model, NewsML. We found out that a useful mapping was possible, so we’re confident that a broad range of content applications can be build on top of Lily.

With these foundations, Lily is ideally suited for storing and managing semi-structured content, a mixture of simple and rich media types, hypertext data – data with a life-cycle (hence versioning) and with relationships (hence link fields). Obviously, people who want low-level access to the BigTable model should use HBase directly instead. Technically, all Lily content is mapped onto HBase tables.

major-blocks

Flexible index management, powerful search

Having a flexible content model is one thing, but what often is cited as the key missing feature of many NoSQL stores is an ability to locate data without primary key index (there’s a pun in that, but I’ll pass). With our experience in content apps, we also had a strong need for full-text search and facet browsing, scaling all this with the volume of data HBase accommodates (which is a lot).

To this end, we do a full-on integration with the leading enterprise search platform – SOLR – in a way that is scalable (by sharding index data and distributing search), flexible (by providing full access to SOLR functionality) and robust. That last aspect is really important due to the inherently asynchronous nature of SOLR index updating, compared with the synchronous put/get operations allowed for HBase records.

We bridged SOLR with HBase for this using a HBase-backed RowLog mechanism we implemented in a fully-distributable message queue, meaning we can now reliably pass content updates from one distributed system (HBase) to another (SOLR).

The indexing mechanism allows to configure what data needs to be indexed, and to describe the mapping between the Lily and the SOLR data model. It also supports data denormalization and link dereferencing, needed to replace the flexibility of the SQL query language.

For search, we offer access SOLR as-is, a familiar environment to many. In the future, also keeping in mind SOLR might not be the only index/search solution that is supported by Lily, we could wrap the SOLR query language into a contraption of our own, but we didn’t deem this necessary so far.

A distributed architecture

All components of the Lily architecture are fully distributable, also the client connect points (the Lily server node), and we take special care in not inserting SPOFs in the design by unclever tool reuse. This is one of the reasons why we decided to implemented our own queuing mechanism, as other candidates had no reliable, distributed storage.

The Lily remote interface

Lily can be used through its Java API (with remoting happening through Avro), and also using a REST HTTP/JSON interface.

Integration hooks

It is important to realize that a Lily server node consists of a set of Kaurimodules deployed inside a Kauri runtime, which ensures a very flexible, pluggable component architecture already. You can easily develop your own indexer and replace the default one. We, the people behind Lily, have been developing open source software for almost 10 years now, working mainly for a technical audience (i.e. developers) so we know the importance of sensible interfaces and replaceability of default behavior.

Conclusion

Lily offers a flexible, higher-level content repository model on top of HBase, HDFS and SOLR, and combines store and search in a scalable and distributed architecture. It is suited for management of semi-structured, content-centric media and data objects that need versioning, sophisticated search and complex metadata. Lily has an open and extensible architecture and provides low- and higher-level access APIs.

Data-Driven Solutions from NGDATA