<< Tech Portfolio /

Hadoop - 1

I am building a hadoop cluster to analyze massive amounts of data for a startup in San Francisco.

Client : stealth

Features

  • I am designing a cluster to handle terra-bytes of data (billions of rows)
  • we go through the massive amount of data and produce summary-data that is used in every day business
  • each day millions of new 'logs' are created and added to the growing volume of data
  • I have evaluated Hbase-cluster and a Hadoop-cluster
  • Final solution is built on Amazon Elastic Map Reduce (EMR) framework
  • This givs us lot of advantages
    • the flexibility of modifying cluster's capacity very easily, we can change the instance-type (small, large, x-large) for each MR job
    • being cost-effective (only pay for the actual run time of Map-Reduce job). This present a substantial cost-savings compared to a 24/7 running cluster
  • all infrastructure is built on Amazon EC2 environment. Amazon S3 provides scalable/reliable data storage for terra bytes of data