Category Archives: Big Data and Analytics

Shopping cart for Hadoop as a Service

Hadoop is an open source framework which different vendors take, customize it, add their own products on top of it  and bring the newly created product with different features and functionalities to the market.
I don’t know how far  this analogy is  correct but it’s like Android OS. Different vendors take the same core of Android, customize it build their own functionalities on top of it and create a  different product altogether.
Typically different hadoop distributions have different set of tools, support, optimizations and additional features. So the challenge then how can we decide which Hadoop service is suitable for our requirement  and which Hadoop service can serve the organization’s purpose.
You can see a list of Hadoop distribution here . Forrester in its report has recently done a market analysis and have rated different Hadoop on Cloud vendors.
Screen Shot 2016-06-13 at 3.00.06 PM
Here is a list of top Hadoop distributions, the value additions in them and also my thoughts on what would work for which use case :

  1. Cloudera distribution of Apache Hadoop  ( CDH ):  It’s the first commercial Hadoop Startup. offers core open distribution  along with a no. of frameworks which include Cloud era search, Impala, Cloudera Navigator and Cloudera Manager.
  2. Pivotal HD : includes a number of Pivotal software products such as HAWQ (SQL engine) GemFire, XD (analytics), Big Data extensions and USS storage abstraction. Pivotal supports building one physical platform to support multiple virtual clusters as well as PaaS using Hadoop and RabbitMQ.  
  3. IBM Infosphere BigInsghts : includes visualization and exploration, advanced analytics, security and administration. There is no other vendor which can give you the flexibility of working on a Bare Metal machine. But that comes at the price of scalability. Bare Metal machine can’t be scale up or down on the fly. IBM’s other products BigQuality, Bigintegrate, and IBM InfoSphere Big Match can be seamlessly integrated for a mature enterprise operations.
  4. Amazon Elastic MapRedue:  comes with EMRFS which allows EMR to be connected with S3 and use it as a storage layer. The fact that S3 is the market leader in object storage and many enterprises are already using S3 for their Big Data storage, makes it an obvious choice.
    But AWS EMR work with AWS data stores only and I really doubt if it can be integrated with other storage options.
  1. Azure HD Insight : Azure HD Insight uses HDP (Hortondataworks Platform) distribution which  is designed for Azure Cloud. Enterprise Architects can use C#, JAVA and .NET to create configure, monitor and submit Hadoop jobs.
  2. Google Cloud Dataproc: has built in integration with Google Cloud Services like BigQuery and Big Table along with Dataproc. Unlike other vendors Google bills you in minutes.
Looking at the functionalities, features, it’s quite easy to get confused with plethora of options available right now and each vendor is trying hard to get a bigger pie of this cake.


Security Strategies for Hadoop

You know what is common between TCP/IP and Hadoop ? They both were created not keeping security in mind. And you know what is the other common thing between them, they both have become extremely important and ubiquitous entities.
The very fact which we rejoice about, of having massive amount of data created by various datasources like sensors, mobile devices etc have given hackers multiple point of entries in to an organization. Think about it it’s not just servers exposed to the internet which can be hacked, anything and everything which connects to your intranet has potential source of security breach.
Hadoop characteristics such as distributed computing, Fragmented data, access to data and node to node communication presents a great challenge for the developers to prevent any security breach. The biggest issue with Hadoop security is that it’s not a single technology, but it’s an entire ecosystem of technology, Hive, HBase Oozie etc.
it’s  important to understand  the threat categories before creating a security strategy which can be :

  1. Unauthorized access/Masquerade
  2. Insider Threat
  3. Denial of Service
  4. Threats to Data

According to Forrester, developer must consider 6 security properties:

  1. Confidentiality – make data only available to people who really need it
  2. Integrity – Data changed in appropriate way and the way it’s authorized to change
  3. Availability : Data is available only from applications which are allowed to make them available
  4. Authentication : A person’s identity is established before access is granted.
  5. Authorization : People are explicitly allow or not allowed to access the application
  6. Nonrepudiation :  Person cannot perform an action and later denied performing the action

In addition it’s important to understand that Hadoop architecture  which comprises of

  1. Network
  2. Hosts

There is a lot which also depend on operating environment of Hadoop. It can one of the following

  1. In-premise
  2. Co-location
  3. Cloud

As I have highlighted earlier  deploying Hadoop as a Service take care of most of these concerns and then you really focus on actionable insights and making great apps for your business.

If you want to consult more about Hadoop and other Big Data solutions on Cloud, just get back to mere here

Two tangible business impacts of IoT

IoT is the new favorite of CIOs. A lot of buzz is there in the market. So it’s really important to understand the tangible benefits of IoT.

According to Forrester, IoT impact businesses in two ways :

1. Enhance front end customer experience: Every industry has different definitions and requirements to enhance the customer experience and it’s very difficult to list down each and every aspect of customer experience. Broadly,  following are the initiatives should be taken to enhance customer experience:

  • Build Smarter Products by tapping the enormous data
  • Use the data to make Customer order and delivery tracking more efficient and precise
  • Enhance Energy management by deploying intelligent sensors
  • Enhance Security and public safety monitoring or surveillance by making things talk to each other and managed centrally
  • Connected sensors which can be controlled through handheld devices or from a remote location making Smarter homes. 

2. Improve backend operational efficiency: A business is as efficient as its backend is. Processes like navigation, metering, asset tracking, notifications, monitoring and ordering support the following use cases:

  • Fleet Management: Navigation systems and asset tracking systems have made it possible for reduce the cost and use the same money to invest in core business processes 
  • Logistics and Transport: It’s become easy to monitor non stationary assets’ inventory with the help of chip in the near real time.  A report published by Forrester highlights how Cargo View uses smart global SIM cards to provide an automatic airplane mode during flight times so that connected objects travel in a safe FAA compliant mode when on a aircraft.
  • Predictive and Prescriptive maintenance: Real time monitoring services enables Predictive and Prescriptive maintenance. Now it’s possible to predict  failure or adapt precautionary methods so that failure can be avoided or aftermaths can be minimized.
  • Supply chain management: Monitoring and tracking reliably the status of fast moving consumer goods and assets is a competitive advantage. So if tomorrow don’t get surprised when you milk man carry a bag with a chip embedded in it and capturing the location data, temperature data etc and transmitting it to the server to make the milk delivery in a next level.
  • Safety monitoring and surveillance:   Location specific data adds tremendous relevancy to IoT solutions. Imagine you got a car which transmits location continuously to your mobile device. And in case you come across an familiar route you disable the car.

IoT solutions are getting smarter and more attractive not because of two important facts:

1. Machines  can now talk to each other

2. There is a program sitting somewhere which make the hardware, users and other machines understand what is being said, how is it being said, when is it being said, by whom and to whom it’s being said.

To get you started, all you need is an IoT platform which already has most of the stuffs ready for you, you just have to start writing the codes. Yes I am talking about  IoT platforms on Cloud !!!

Essentials of Big Data and Analytics on Cloud

indexStarting a Big Data and Analytics (BDA) project on cloud is not only faster, but also quite cheaper. I have come across many clients who want to start their BDA project, but they don’t proceed or I will say they delay thinking about the upfront cost, required skills and execution time. All these can be well taken care of if they start their project on cloud.

The challenge for them is to decide upon which what cloud services they should go for and which vendor to select. Here is a list of the services which are currently available on cloud and will be a good idea to consider them to start BDA project.

1. Edge Services :  This act as an interface between your users, data and Cloud services provider. It serves the following purposes

  • DNS resolution
  • CDN services
  • Firewall
  • Load balancers

This is typically available as IaaS from companies like, IBM, AWS, Microsoft etc.

2. Data Streaming : This is primarily for data in motion. You need data streaming for

  • Real time analytical processing
  • Data Augmentation

Data Streaming tools are available as SaaS on various cloud market places.

3. Data integration: Data from different sources are delivered to the cloud service provider by using Edge services then go through the following to extract insights from it :

  • Data Staging
  • Data quality checks
  • Transformation and loading

These are also available was SaaS on various Cloud Market places.

4. Data Repositories: Data repository consist of both data in motion ( from streaming services) and data at rest (after Data integration process) and then prepares the data for the various Analytical engines. Data repositories are meant for the following functionalities:

  • Data warehousing
  • Landing, exploration and archive
  •  Deep Analytics and modelling
  • Interactive Analytics and Reporting
  • Catalog

Earlier not many SaaS offerings were available for Data repository services. But now there are many SaaS offerings available on different cloud market places.

5. Actionable Insights: Data from Data repository is then fetched in to a variety of tools to extract insights. Typically you need different tools to perform the following:

  • Decision Management
  • Discovery and Exploration
  • Predictive Analytics
  • Analysis and Reporting
  • Content Analtyics
  • Planning and Forecasting
  • Visulization

In addition to the above services you do get Data Security an Governance services on cloud. There are multiple vendors providing either all or part of the above services. The market is flooded with services and to select one service or vendor really require a lot of research and many points to be considered.