Ceph as an alternative filesystem for the Hadoop ecosystem

(Moving old post from blogger to wordpress – originally posted in Aug 2010)

Ceph has been gaining ground as a quality file system and is under active development. It is on its path to becoming a strong contender to replace existing systems like HDFS etc. I co-authored an article for the Usenix ;login: August 2010 edition on Ceph as an alternative file system for Hadoop. You can read it here.


Comparing HBase, Cassandra and PNUTS

(Moving old post from blogger to wordpress – originally posted in May 2010)
A lot of NoSQL systems have been sprouting up recently and an increasing number of people are using NoSQL data stores and moving away from RDBMS systems. There’s nothing wrong with relational database systems but they are optimized for certain use cases, which they handle very well. NoSQL systems (Bigtable, Dynamo, PNUTS, CouchDB, MongoDB, Keyspace, to name a few) solve different sets of problems, for which they are best suited for.
Recently, in a course that I’m taking at UC Santa Cruz, I got a chance to present the PNUTS paper and compare the system with Bigtable and Dynamo.
At a high level, here’s how these systems compare:
(This post talks about HBase instead of Bigtable and Cassandra instead of Dynamo)
Consistency Model
Fully consistent.
Eventual consistency. Divergent version trees of the same row can exist. Client can trade off between latency and consistency.
Timeline consistency. All versions of a row honor a timeline and there are no divergent version trees.
ACID Semantics
put() call is atomic at a row level. There is no concept of transactions and no notion of consistency between rows.
Scans don’t give a consistent view of the table. However, any row returned by a scan will be a consistent view of that given row. Any row updated with a timestamp older than the scanner initialization timestamp may show up in the results.
(Details available in the HBASE- 2294 jira)
None specified
Write call gives the same ACID semantics as a transaction involving a single row
Data Model
Tabular, column oriented.
Table consists of column families and each family has multiple columns. The schemas are flexible and there can be arbitrary columns in any given family. However, the families are specified on table creation. Different versions can be stored for each cell.
Storage model is columnar and is strictly ordered on the rows.
Similar to HBase. Dynamo on the other hand is a key value store. Cassandra has the Bigtable data model over the Dynamo
P2P architecture. Cassandra also has super columns, which are like columns families within a column family.
Tabular, row oriented.
The schemas are flexible and a row can have arbitrary columns, with some being empty as well. Each node stores only a single version of any given row but different versions can exist across the cluster.
Storage model is row oriented.
Underlying Storage
HDFS or any other distributed file system
Node’s local storage
Node’s local storage (Choice of Hash table or Ordered table)
Data is replicated by the file system when it is persisted
Choice between Synchronous and Asynchronous for each update.
Data is written to the master copy, which propagates it to the message broker, which takes care of replication
Fault Tolerance
Regions are restarted (on the same node or any other) if they crash.  If a region server dies, its regions
are distributed to the other servers that are functional. No re-allocation of data takes place.
Updates are first logged into a Write Ahead Log before they go to the  memstore. One WAL is maintained per region server.
Data from the failing node is re-assigned to the next node in the consistent hashing
Updates are first written to a Write Ahead Log before committed to the table.
If the master for a given record fails, either a new master is elected or the write fails. It is never the case that a write will go to a node that is not the
master for a record.
The message broker does logging when it receives the update from the master. There is no logging done at the individual nodes.
1000s of nodes.
Each table
can have millions of columns and billions of rows
10-100s of nodes
10s of sites with 1000s of nodes. Since it is row oriented storage, the number of columns would not be very high (no numbers reported in the papers)
Optimized for
Writes, Scans. The writes are kept in memory (in the memstore) and flushed to disk in chunks, which gives good write performance.
Writes. Poor scan performance as compared to the other two systems. [5]
Reads. The client has the option to read from a copy which is geographically close and
can define the level of consistency desired.
Where does it fit?
If you want a scalable system that is deployed in a single data center and you don’t care about network partitions, HBase is your friend. If you cannot tolerate loose consistency in data, this is the best option. Update: If you already have Hadoop and want to be able to read/write small objects quickly and also run some analytics over the data, HBase is the way to go.
If you can deal with eventual consistency and want a highly available system that can span across data centers, Cassandra is your friend. As of now, Cassandra is easier to get off the ground than HBase and has lesser components that you need to get running to begin with. Contrary to the popular belief, Cassandra has more moving parts than HBase but they are managed by the framework and user does not need to worry much. Update: If you dont have Hadoop and dont need it but you just need a database that scales beyond a single node, Cassandra is your friend. Keep in mind that there is no SQL here. For HBase, you need to also deploy Hadoop (HDFS atleast), which is not required by Cassandra.
If you want the system to be geographically distributed in order to serve large number of reads with low latency from across the globe, PNUTS is your friend.  You also get fine-grained control over consistency and can trade it with low latency for reads.
Access options
Native Java API, Jython, Groovy DSL, Scala, REST, Thrift
Native Java API, Ruby, Perl, Python, Scala, PHP, Clojure, Grails, C++, C#
If you are out there looking for a scalable database solution, you’ve got quite a few choices. These are just some of the more popular ones. PNUTS is not open source but is a nice system from an architectural stand point and thats why I talk about it in this post.
[5] Yahoo! YCSB Benchmark: http://research.yahoo.com/node/3202
[6] Lars George’s description of the HBase Architecture: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html