Recently in our organization management has decided to use Hadoop platform for new data analytics initiatives. Our early adoption use cases are in near real time, data archival, data staging and log data analytics area. For initial POC phase we have 10 node virtual machine clusters and started playing around with sqoop, flume along with pig and hive. Also as part of learning I have completed “Cloudera Certified Developer for Apache Hadoop (CCDH)” certification.
Since Hadoop is new in our organization we started from scratch
like setting up a directory structure, process for migration of code, etc. Directory
structure is needed in local unix file system as well as in HDFS, in local unix
file system directories are needed for software & codes and in HDFS its
needed for raw data, intermediate data and other configuration files. In Hadoop
eco system every other month we hear about some new software so it becomes very
important to have a proper directory structure.
Since Hadoop is new for most of us so we don’t really have much information so thought it would be good to write a blog and hopefully it will be
helpful for some of you. I will try to write a new topic as I learn something
new. Please do write your comment or feedback, I am new to blogging as well so
it’s going to be helpful for me to write a better blog.
So let’s get started what we did for directory structure. First in
local unix file system for software, we came up with the below directory structure
with the help of Unix Admin team. The software directory is under /usr/lib, for
now we have Hadoop, hive, pig, hbase, sqoop, flume, oozie, java, zookeeper,
hcatalog, etc. We will add new directory under /usr/lib whenever we get new
software.
Second directory structure is in local unix file
system for code like pig, hive, flume, sqoop, etc. We created our application
directory under /home and under application we have common, module1, module2. Under
these we have conf, lib, bin, util, and data. I have tried to explain below
about each directory.
- Application1: It’s good to have separate directories for each application.
- common: all the common libraries, config, scripts will be available under this directory.
- module1 & 2: If needed module specific directory can be created otherwise common can be used.
- conf: For all the properties or config files.
- lib: For all the external library, executable jar files.
- bin: For all the pig, hive, flume, sqoop scripts.
- util: For all the wrapper scripts.
- data: it’s placeholder for data for any processing purpose but technically we are not going to store any data files here.
Third directory structure is in HDFS for raw data, intermediate files, output files, metadata, etc. The below directories will be available in HDFS under /lob/application/module/.
- data: For all the input data files and the processed output files.
- work: For intermediate files generated during the workflow process.
- metadata: For all the metadata, schema, property files.
- archive: All the input and processed data will be moved to this, it will be done periodically.
Please let me know what you think about this blog.