Wednesday, June 4, 2014

Hadoop Directory Structure


Recently in our organization management has decided to use Hadoop platform for new data analytics initiatives. Our early adoption use cases are in near real time, data archival, data staging and log data analytics area. For initial POC phase we have 10 node virtual machine clusters and started playing around with sqoop, flume along with pig and hive. Also as part of learning I have completed “Cloudera Certified Developer for Apache Hadoop (CCDH)” certification.

Since Hadoop is new in our organization we started from scratch like setting up a directory structure, process for migration of code, etc. Directory structure is needed in local unix file system as well as in HDFS, in local unix file system directories are needed for software & codes and in HDFS its needed for raw data, intermediate data and other configuration files. In Hadoop eco system every other month we hear about some new software so it becomes very important to have a proper directory structure.

Since Hadoop is new for most of us so we don’t really have much information so thought it would be good to write a blog and hopefully it will be helpful for some of you. I will try to write a new topic as I learn something new. Please do write your comment or feedback, I am new to blogging as well so it’s going to be helpful for me to write a better blog.

So let’s get started what we did for directory structure. First in local unix file system for software, we came up with the below directory structure with the help of Unix Admin team. The software directory is under /usr/lib, for now we have Hadoop, hive, pig, hbase, sqoop, flume, oozie, java, zookeeper, hcatalog, etc. We will add new directory under /usr/lib whenever we get new software. 






Second directory structure is in local unix file system for code like pig, hive, flume, sqoop, etc. We created our application directory under /home and under application we have common, module1, module2. Under these we have conf, lib, bin, util, and data. I have tried to explain below about each directory.
  • Application1: It’s good to have separate directories for each application.
  • common: all the common libraries, config, scripts will be available under this directory.
  • module1 & 2: If needed module specific directory can be created otherwise common can be used.
  • conf: For all the properties or config files.
  • lib:  For all the external library, executable jar files.
  • bin: For all the pig, hive, flume, sqoop scripts.
  • util: For all the wrapper scripts.
  •  data: it’s placeholder for data for any processing purpose but technically we are not going to store any data files here.



Third directory structure is in HDFS for raw data, intermediate files, output files, metadata, etc. The below directories will be available in HDFS under /lob/application/module/.
  • data: For all the input data files and the processed output files
  •  work: For intermediate files generated during the workflow process. 
  • metadata: For all the metadata, schema, property files. 
  • archive: All the input and processed data will be moved to this, it will be done periodically.



Please let me know what you think about this blog. 

24 comments:

  1. This really makes sense to me. Thanks

    ReplyDelete
  2. Useful information , Thanks

    ReplyDelete
  3. Excellent post. Big data is a term that portrays the substantial volume of information; both organized and unstructured that immerses a business on an everyday premise. To know more details please visit Big Data Training in Chennai | Primavera Training in Chennai

    ReplyDelete