Hadoop - 1
I am building a hadoop cluster to analyze massive amounts of data for a startup in San Francisco.
Client : stealth
Features
- I am designing a cluster to handle terra-bytes of data (billions of rows)
- we go through the massive amount of data and produce summary-data that is used in every day business
- each day millions of new 'logs' are created and added to the growing volume of data
- I have evaluated Hbase-cluster and a Hadoop-cluster
- Final solution is built on Amazon Elastic Map Reduce (EMR) framework
- This givs us lot of advantages
- the flexibility of modifying cluster's capacity very easily, we can change the instance-type (small, large, x-large) for each MR job
- being cost-effective (only pay for the actual run time of Map-Reduce job). This present a substantial cost-savings compared to a 24/7 running cluster
- all infrastructure is built on Amazon EC2 environment. Amazon S3 provides scalable/reliable data storage for terra bytes of data