Sharing Knowledge: Hadoop Directory Structure

Recently in our organization management has decided to use Hadoop platform for new data analytics initiatives. Our early adoption use cases are like in near real time, data archival, data staging and log data analytics area. For initial POC phase we have 10 node virtual machine clusters and we started playing around with sqoop, flume along with hive and hbase. Also as part of learning I have completed “Cloudera Certified Developer for Apache Hadoop (CCDH)” certification.

Since Hadoop is new in our organization we started from scratch like setting up a directory structure, process for migration of code, etc. Directory structure is needed in local unix file system as well as in HDFS, in local unix file system directories are needed for software & codes and in HDFS its needed for raw data, intermediate data and other configuration files. In Hadoop eco system every other month we hear about some new software so it becomes very important to have a proper directory structure.

Since Hadoop is new for most of us so we don’t really have much information on this so thought it would be good to write a blog and hopefully it will be helpful for some of you. I will try to write a new topic as I learn something new. Please do write your comment or feedback, I am new to blogging as well so it’s going to be helpful for me to write a better blog.

So let’s get started what we did for directory structure. First in local unix file system for software, we came up with the below directory structure with the help of Unix Admin team. The software directory is under /usr/lib, for now we have Hadoop, hive, pig, hbase, sqoop, flume, oozie, java, zookeeper, hcatalog, etc. We will add new directory under /usr/lib whenever we get new software.

Second directory structure is in local unix file system for code like pig, hive, flume, sqoop, etc. We created our application directory under /home and under application we have common, module1, module2. Under these we have conf, lib, bin, util, and data. I have tried to explain below about each directory.

Application1: It’s good to have separate directories for each application.
common: all the common libraries, config, scripts will be available under this directory.
module1 & 2: If needed module specific directory can be created otherwise common can be used.
conf: For all the properties or config files.
lib: For all the external library, executable jar files.
bin: For all the pig, hive, flume, sqoop scripts.
util: For all the wrapper scripts.
data: it’s placeholder for data for any processing purpose but technically we are not going to store any data files here.

Third directory structure is in HDFS for raw data, intermediate files, output files, metadata, etc. The below directories will be available in HDFS under /lob/application/module/.

data: For all the input data files and the processed output files.
work: For intermediate files generated during the workflow process.
metadata: For all the metadata, schema, property files.
archive: All the input and processed data will be moved to this, it will be done periodically.

Please let me know what you think about this blog.

6 comments:

UnknownJune 30, 2015 at 10:13 PM
You have certainly explained that Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions..The big data analytics is the major part to be understood regarding Hadoop Training Chennai program. Via your quality content i get to know about that in deep.Thanks for sharing this here.
UnknownSeptember 4, 2015 at 4:19 AM
This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.
Regards,
Python Training in Chennai|Python Taining|Python Training Institutes in Chennai
UnknownSeptember 5, 2015 at 3:56 AM
Thanks for sharing this niche useful informative post to our knowledge, Actually SAP is ERP software that can be used in many companies for their day to day business activities it has great scope in future.
Regards,
SAP course in chennai|SAP Training in Chennai|SAP Training Chennai|sap training in Chennai
UnknownNovember 27, 2015 at 10:54 PM
This is the exact piece of information that I was searching for a long time(Hadoop Training in Chennai). Processing data is the biggest issue that every cloud based companies are facing worldwide(Big Data Training). Handling this problem made easy with the introduction of big data. Thank you so much for your worth able content here. Keep Posting article like this(Big Data Hadoop Training in Chennai).
for ict 99January 8, 2016 at 9:20 AM
Your explanation on directory structure is really great

Java Training Institutes in Chennai | java j2ee training institutes in chennai | Java Training in Chennai | J2EE Training in Chennai | Java Course in Chennai
UnknownMarch 5, 2016 at 2:59 AM
Really informative post. Big data is a term that portrays the substantial volume of information; both organized and unstructured that immerses a business on an everyday premise. To know more details please visit Big Data Training in Chennai | Primavera Training in Chennai