CEO Blog: Hadoop for Dummies (Meaning Senior Execs)
When I joined Prime Computer in 1984, there was an emerging market call "GIS" which stands for Geographic Information Systems. This had basically evolved from "computer mapping" then "AM/FM" (Automated Mapping and Facilities Management), which basically utilized thematic maps to mostly overlay utilities' infrastructures on maps. Then came ESRI (the Environmental Systems Research Institute). While ESRI had been around since 1969, they were really pioneering the use of data in the world of computer-based mapping. The maps themselves were merely a spatial lens into the data. ESRI was a big partner of Prime, and I really liked what they were doing. By 1987, I had become a "GIS Specialist." In reality, I knew very little. But GIS was a very, very hot topic then. Everyone wanted to get in on the action. And because of this, I learned a very simple but important lesson: if you know 15% of a topic that almost everyone only knows 2%, then you appear to almost everyone as an expert. Really.
Nowadays, everyone wants in on Hadoop. The more advanced people extend these discussions to NoSQL as well. Yet, many, many people in the technology space still don't understand much about Hadoop, other than they must need it and need it fast. This was the GIS phenomenon in 1987. I was recently at a conference where there were several big Hadoop shops. There were also several, seemingly sophisticated organizations expressing the need they felt to move in that direction, though they clearly had no idea what they were really talking about. Many of these people were higher up in their organizations, so perhaps there were much more sophisticated technologists involved below them in their organizations. But decisions are often made by people who really don't understand much about what they are doing. I hear that from time to time from my own team! That is actually OK to a point, but I truly believe it makes great sense for decision-makers to understand very, very basic aspects of technology when the well-being of their organizations are dependent on technology and the decisions they either make or approve. So here goes my quick attempt to get the Hadoop 2%ers up to 10% to 15%.
What Hadoop is:
- A collection of (free) open source programs available from the Apache Foundation. These programs mainly include a file system, capabilities for writing processing logic, distributing processing over large numbers of systems and gathering back the results, and creating a data warehouse structure for summarization, query, and analysis. Hadoop is continually evolving via the Apache Foundation.
- There are a number of companies who have commercialized the distributions to provide added value capabilities and support that make Hadoop more desirable in commercial production environments. The main four are Cloudera, HortonWorks, EMC, and MapR. In these cases, it is, of course, no longer free. However, you are paying for the support and added value.
- It is a proven platform for storing very large amounts of unstructured data. This is more and more a need in industries ranging from digital advertising and social networking to financial services to telecommunications to government, especially defense.
- It is a platform that can scale, utilizing a large number of commodity distributed-processing resources effectively
What Hadoop is not:
- It is not a silver bullet that will solve all your technology problems
- It is not a technology that can be deployed without administrative people to establish or maintain the environment, so there are people costs involved
- It is not an interactive environment (in and of itself), nor is it a real-time system
- It is not mature (yet), but it's definitely getting there
Where it can often be really effective:
- Your organization has to deal with very, very large amounts of unstructured data, like a video advertising organization such as LiveRail
- Your organization needs to sort and manipulate large amounts of unstructured and semi-structured data, such as a developing and deploying mobile advertising campaigns like inMobi
- Your organization needs to index large amounts of data, such as an online brand management like AdSafe Media
Where it is not as effective:
- When you are constrained in terms of talent. Hadoop talent is much in demand, in part because it is quite necessary for establishing and maintaining a Hadoop environment. And as a result, as you might expect, Hadoop expertise is not inexpensive.
- If you run an operation with limited numbers of servers, it is more difficult to take advantage of Hadoop's capabilities. The same is true for available storage.
- If you need to perform an very heavy computational analysis against a small amount of data
- If you need to perform interactive investigative analytics like ad-hoc queries agains large amounts of data
Basic commentary: The main takeaway I always suggest to people is that there is no silver bullet. In reality, Hadoop is often used in combination with other technologies. Of course, I was prompted to write this based on the growing number of Infobright customers using Infobright in combination with Hadoop. This is a very simple, yet powerful approach, whereby Hadoop may provide a main repository of mountains of semi-structured and unstructured data, but periodically run Map-Reduce jobs to collect smaller mountains of data which is moved into Infobright. That can be automated and very easy to do, and once the data is in Infobright, it becomes very easily interrogated, including robust ad-hoc query support. It also is then accessible using the BI tools sets used in almost any company, including Jaspersoft, Pentaho, MicroStrategies, Cognos, Actuate/BIRT, Business Objects, and many, many more, not to mention Java, PhP, and other common programming languages as well.
But it all comes down to a few basic questions. What are you trying to accomplish? How much data to you have and what does it look like? What do you need to do with the data? Who needs to use the data and in what form? If you want to expose information through a portal for customers to interactively inquire about their operations, you will take a different approach than if you want to provide a repository for archiving documents. Again, it all comes down to the use case, and seldom is there one technology that solves all the problems...even Hadoop. But the right technologies, deployed in the right combinations, can be very, very powerful.
One last thought. There are a number of major players in the Hadoop and NoSQL market that have communicated that as knowledge of Hadoop grows, the true need for most executives to really understand it will actually diminish. The reason for this is that the true value of these technologies will ultimately be delivered as an underlying component of the applications that utilize them. I totally agree with that. In fact, more and more of our customers and revenues are a function of exactly that model, where Infobright is embedded in applications delivered by our OEM solution partners. And I think this is going to be true for many of the emerging technologies we are seeing as well.
If anything, the message to CEO's should be this: make sure you have very strong technology architectural talent in your organization. Make sure they are aware of and conversant in the technology landscape, as your real opportunity will truly come in the combinations of the right technologies for your business. Doing this right can deliver significant competitive advantages in terms of increased savings and advanced capabilities.
By the way, Infobright is used alongside Hadoop with each of the examples in this blog. And many, many more.