Sharing Knowledge: My first tweet feed in Hadoop

When I started working in Hadoop one of the thing that I always wanted to accomplish was to load the live twitter feed in HDFS. Now why do we need to load the twitter feed? There are so many use cases with twitter data like analyzing brand sentiments, performance of a movie, sports events like FIFA, NBA, NFL, etc. Soccer world cup is becoming biggest social media event ever and social media uses are going to get increase day by day for almost everything so its very important to have twitter data analyzed for any organization for marketing needs or for any other purpose.

When I started working in Hadoop I didn't have any working experience in Java, along with Hadoop I am also learning little bit of Java. To load the twitter feed I used flume. There are certain steps that I followed to load the twitter feed in HDFS via flume.

First I created twitter application from my own twitter account in apps.twitter.com and then generate API Key/Secret and Access Token/Secret.
A custom twitter flume source is used for streaming twitter data into HDFS. Flume source uses the twitter streaming API which returns a json structure for every tweet via Twitter4J library and then it gets stored in HDFS. We got the custom source jar file from one of our support guy. Initially I was getting this error "java.lang.UnsupportedClassVersionError: poc/hortonworks/flume/source/twitter/TwitterSource : Unsupported major.minor version 51.0" after using correct version of JDK/JRE it worked as expected.
Modified the flume-env.sh file to locate your custom jar file in classpath.
I created the flume config file and added my key/secret, also other flume configuration details.
For execution I used below command -

flume-ng agent -n TwitterAgent -c /etc/flume/conf -f flume.conf

In this POC after fixing the custom jar issue I didn't really had any issue I was able to stream thetweet feed in HDFS. I created the external table in hive to see and analyze twitter data. I used below command to create hive table -

add jar json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar;

CREATE EXTERNAL TABLE tweets (
tweetmessage STRING,
createddate STRING,
geolocation STRING,
user struct<
        userlocation:STRING,
        id:STRING,
        name:STRING,
        screenname:STRING,
        geoenabled:STRING>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/twitter'
;

For hive external table I downloaded the jar file and added the jar file like above otherwise table won't be created it will give error.

I would love to see your feedback/comment on this.

My first tweet feed in Hadoop

2 comments: