Wednesday, May 14, 2014

Automation of excel file load in Informatica



Microsoft Excel Spreadsheet is one of the most commonly used office file but still in ETL world we don’t have that much cases where Excel file load requirement comes up. Recently in one of my project we have come across with this requirement where business users are updating excel spreadsheet and it’s getting uploaded directly in intranet. For some business analysis purpose they want to merge excel data with other DW data and generate reports.

Informatica Power Center is getting used here for ETL needs. In Power Center you can import excel source definition using Microsoft Excel ODBC driver but if excel file is getting replaced or new file is getting added then you won't be able to use this feature. I think best way would be to convert the excel file in text file format and then use Power Center. This requirement is to do the load every week so we won’t be able to do this manually. We need to have automated scripts which will convert excel into text file. There are many ways of converting excel file into text file. Since I am familiar with Python I decided to use it for converting and use the script in Pre-Session Command of Session task.

Pre-Session Command that we used -

xls2flatfile.py -i "$HOME/ifile/InputFile.xlsx" -s Sheet2 -o $HOME/ofile/Outfile.txt -d "|"

Description of above command –
xls2flatfile.py: Python script name
-i: Input file name
-s: sheet (tab) name, in our input file we had multiple tabs.
-o: Output file name
-d: Output file delimiter type, default is comma delimited type.

Since Python was not getting used in our project I started from installation steps to writing scripts. I will provide python installation steps and script information that we used.

Python Installation -

Our server is Linux so I will be providing Linux command. If you have Unix Admin group then you can simply ask them to do the installation for you. You can skip this part.


1.       Download the latest version of python from https://www.python.org/download/.
2.       Decompress the above file. If its tar file then you could use (tar -xvf Python-2.7.6.tar)
3.       Since we don’t have root access, we decided to use our application directory to do the installation. Change directory to decompressed (Python-2.7.6) directory and execute following commands.

a.       ./configure --prefix=$HOME/python
b.       make install

After successful execution of step b Python installation is complete.


4.       In order to read the excel file in Python we need to download xlrd module. Download the latest version from https://pypi.python.org/pypi/xlrd/0.9.3
5.       Decompress the xlrd tar file and change directory to decompressed directory.
6.       Installation command –

a.       $HOME/python/bin/python setup.py build
b.      $HOME/python/bin/python install --prefix=$HOME/python

After successful execution of step b xlrd installation is complete.

Excel to Text file Converter

I tried to write a generic script as much I could think. This code can easily be adoptable for any kind of project. It’s very small script and it’s only around 60 lines of code. I think python code is very much self-explanatory.  


#!$HOME/python/bin/python

import sys
import argparse
import time
import xlrd

## define parameter ##

parser = argparse.ArgumentParser(description='This script is for to convert xls to text file format')
parser.add_argument('-i','--input', help='Input xls file name', required=True)
parser.add_argument('-s','--sheet', help='Input sheet name', required=True)
parser.add_argument('-o','--output',help='Output file name', required=True)
parser.add_argument('-d','--delimiter',help='File delimiter type', default = ',', required=False)
args = parser.parse_args()

## define variable ##

ifile = args.input
isheet = args.sheet
ofile = args.output
delimiter = args.delimiter

print ("Script execution started at %s" % time.ctime())

## show values ##

print ("Input file: %s" % ifile )
print ("Input sheet name: %s" % isheet )
print ("Output file: %s" % ofile )
print ("Output file delimiter: %s" % delimiter )

## Read excel file and write text file ##

workbook = xlrd.open_workbook(ifile)
worksheet = workbook.sheet_by_name(isheet)

out_file = open(ofile, 'w')

num_rows = worksheet.nrows - 1
curr_row = -1

print ("Reading excel file now....")

while curr_row < num_rows:
    curr_row += 1
    row = worksheet.row(curr_row)
    num_cols = worksheet.ncols - 1
    curr_col = -1
   
    while curr_col < num_cols:
        curr_col += 1
        try:
            col = worksheet.cell_value(rowx=curr_row, colx=curr_col).encode('ASCII')   
        except:
            col = worksheet.cell_value(rowx=curr_row, colx=curr_col)
       
        #print col
        out_file.write('"%s"%s' % (str(col).strip(), delimiter))
           
    out_file.write('\n')
out_file.close()

print ("Script completed at %s" % time.ctime())


exit()

I am sure there may be better ways of doing this file conversion. I would be very happy to know any new better ways of doing this so please do let me know if you have come across with something.

13 comments:

  1. Very helpful.. I was looking for this. Thanks

    ReplyDelete
  2. <a href="https://elasraa.com/%d8%b4%d8%b1%d9%83%d8%a9-%d9%85%d9%83%d8%a7%d9%81%d8%ad%d8%a9-%d8%a

    ReplyDelete