Hadoop is an open source framework which different vendors take, customize it, add their own products on top of it and bring the newly created product with different features and functionalities to the market.
I don’t know how far this analogy is correct but it’s like Android OS. Different vendors take the same core of Android, customize it build their own functionalities on top of it and create a different product altogether.
Typically different hadoop distributions have different set of tools, support, optimizations and additional features. So the challenge then how can we decide which Hadoop service is suitable for our requirement and which Hadoop service can serve the organization’s purpose.
You can see a list of Hadoop distribution here . Forrester in its report has recently done a market analysis and have rated different Hadoop on Cloud vendors.
Here is a list of top Hadoop distributions, the value additions in them and also my thoughts on what would work for which use case :
- Cloudera distribution of Apache Hadoop ( CDH ): It’s the first commercial Hadoop Startup. offers core open distribution along with a no. of frameworks which include Cloud era search, Impala, Cloudera Navigator and Cloudera Manager.
- Pivotal HD : includes a number of Pivotal software products such as HAWQ (SQL engine) GemFire, XD (analytics), Big Data extensions and USS storage abstraction. Pivotal supports building one physical platform to support multiple virtual clusters as well as PaaS using Hadoop and RabbitMQ.
- IBM Infosphere BigInsghts : includes visualization and exploration, advanced analytics, security and administration. There is no other vendor which can give you the flexibility of working on a Bare Metal machine. But that comes at the price of scalability. Bare Metal machine can’t be scale up or down on the fly. IBM’s other products BigQuality, Bigintegrate, and IBM InfoSphere Big Match can be seamlessly integrated for a mature enterprise operations.
- Amazon Elastic MapRedue: comes with EMRFS which allows EMR to be connected with S3 and use it as a storage layer. The fact that S3 is the market leader in object storage and many enterprises are already using S3 for their Big Data storage, makes it an obvious choice.
But AWS EMR work with AWS data stores only and I really doubt if it can be integrated with other storage options.
- Azure HD Insight : Azure HD Insight uses HDP (Hortondataworks Platform) distribution which is designed for Azure Cloud. Enterprise Architects can use C#, JAVA and .NET to create configure, monitor and submit Hadoop jobs.
- Google Cloud Dataproc: has built in integration with Google Cloud Services like BigQuery and Big Table along with Dataproc. Unlike other vendors Google bills you in minutes.
Looking at the functionalities, features, it’s quite easy to get confused with plethora of options available right now and each vendor is trying hard to get a bigger pie of this cake.