Sharing Knowledge: hadoop

Saturday, June 28, 2014

My first tweet feed in Hadoop

When I started working in Hadoop one of the thing that I always wanted to accomplish was to load the live twitter feed in HDFS. Now why do we need to load the twitter feed? There are so many use cases with twitter data like analyzing brand sentiments, performance of a movie, sports events like FIFA, NBA, NFL, etc. Soccer world cup is becoming biggest social media event ever and social media uses are going to get increase day by day for almost everything so its very important to have twitter data analyzed for any organization for marketing needs or for any other purpose.

When I started working in Hadoop I didn't have any working experience in Java, along with Hadoop I am also learning little bit of Java. To load the twitter feed I used flume. There are certain steps that I followed to load the twitter feed in HDFS via flume.

First I created twitter application from my own twitter account in apps.twitter.com and then generate API Key/Secret and Access Token/Secret.
A custom twitter flume source is used for streaming twitter data into HDFS. Flume source uses the twitter streaming API which returns a json structure for every tweet via Twitter4J library and then it gets stored in HDFS. We got the custom source jar file from one of our support guy. Initially I was getting this error "java.lang.UnsupportedClassVersionError: poc/hortonworks/flume/source/twitter/TwitterSource : Unsupported major.minor version 51.0" after using correct version of JDK/JRE it worked as expected.
Modified the flume-env.sh file to locate your custom jar file in classpath.
I created the flume config file and added my key/secret, also other flume configuration details.
For execution I used below command -

flume-ng agent -n TwitterAgent -c /etc/flume/conf -f flume.conf

In this POC after fixing the custom jar issue I didn't really had any issue I was able to stream the tweet feed in HDFS. I created the external table in hive to see and analyze twitter data. I used below command to create hive table -

add jar json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar;

CREATE EXTERNAL TABLE tweets (
tweetmessage STRING,
createddate STRING,
geolocation STRING,
user struct<
        userlocation:STRING,
        id:STRING,
        name:STRING,
        screenname:STRING,
        geoenabled:STRING>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/twitter'
;

For hive external table I downloaded the jar file and added the jar file like above otherwise table won't be created it will give error.

I would love to see your feedback/comment on this.

Wednesday, June 4, 2014

Hadoop Directory Structure

Recently in our organization management has decided to use Hadoop platform for new data analytics initiatives. Our early adoption use cases are in near real time, data archival, data staging and log data analytics area. For initial POC phase we have 10 node virtual machine clusters and started playing around with sqoop, flume along with pig and hive. Also as part of learning I have completed “Cloudera Certified Developer for Apache Hadoop (CCDH)” certification.

Since Hadoop is new in our organization we started from scratch like setting up a directory structure, process for migration of code, etc. Directory structure is needed in local unix file system as well as in HDFS, in local unix file system directories are needed for software & codes and in HDFS its needed for raw data, intermediate data and other configuration files. In Hadoop eco system every other month we hear about some new software so it becomes very important to have a proper directory structure.

Since Hadoop is new for most of us so we don’t really have much information so thought it would be good to write a blog and hopefully it will be helpful for some of you. I will try to write a new topic as I learn something new. Please do write your comment or feedback, I am new to blogging as well so it’s going to be helpful for me to write a better blog.

So let’s get started what we did for directory structure. First in local unix file system for software, we came up with the below directory structure with the help of Unix Admin team. The software directory is under /usr/lib, for now we have Hadoop, hive, pig, hbase, sqoop, flume, oozie, java, zookeeper, hcatalog, etc. We will add new directory under /usr/lib whenever we get new software.

Second directory structure is in local unix file system for code like pig, hive, flume, sqoop, etc. We created our application directory under /home and under application we have common, module1, module2. Under these we have conf, lib, bin, util, and data. I have tried to explain below about each directory.

Application1: It’s good to have separate directories for each application.
common: all the common libraries, config, scripts will be available under this directory.
module1 & 2: If needed module specific directory can be created otherwise common can be used.
conf: For all the properties or config files.
lib: For all the external library, executable jar files.
bin: For all the pig, hive, flume, sqoop scripts.
util: For all the wrapper scripts.
data: it’s placeholder for data for any processing purpose but technically we are not going to store any data files here.

Third directory structure is in HDFS for raw data, intermediate files, output files, metadata, etc. The below directories will be available in HDFS under /lob/application/module/.

data: For all the input data files and the processed output files.
work: For intermediate files generated during the workflow process.
metadata: For all the metadata, schema, property files.
archive: All the input and processed data will be moved to this, it will be done periodically.

Please let me know what you think about this blog.