Microsoft Excel Spreadsheet is one of the most commonly used
office file but still in ETL world we don’t have that much cases where Excel
file load requirement comes up. Recently in one of my project we have come across
with this requirement where business users are updating excel spreadsheet and
it’s getting uploaded directly in intranet. For some business analysis purpose
they want to merge excel data with other DW data and generate reports.
Informatica Power Center is getting used here for ETL needs. In Power Center you can import excel source definition using Microsoft Excel ODBC driver but if excel file is getting replaced or new file is getting added then you won't be able to use this feature. I think best way would be to convert the excel file in text file format and then use Power Center. This requirement is to do the load every week so we won’t be able to do this manually. We need to have automated scripts which will convert excel into text file. There are many ways of converting excel file into text file. Since I am familiar with Python I decided to use it for converting and use the script in Pre-Session Command of Session task.
Pre-Session Command that we used -
xls2flatfile.py -i "$HOME/ifile/InputFile.xlsx"
-s Sheet2 -o $HOME/ofile/Outfile.txt -d "|"
Description of above command –
xls2flatfile.py: Python script name
-i: Input file name
-s: sheet (tab) name, in our input file
we had multiple tabs.
-o: Output file name
-d: Output file delimiter type, default
is comma delimited type.
Since Python was not getting used in our project I started
from installation steps to writing scripts. I will provide python installation
steps and script information that we used.
Python Installation
-
Our server is Linux so I will be providing Linux command. If
you have Unix Admin group then you can simply ask them to do the installation
for you. You can skip this part.
1. Download the latest version of python from https://www.python.org/download/.
2. Decompress the above file. If its tar file then you could use (tar -xvf Python-2.7.6.tar)
3. Since we don’t have root access, we decided to use our application directory to do the installation. Change directory to decompressed (Python-2.7.6) directory and execute following commands.
a.
./configure --prefix=$HOME/python
b.
make
install
After successful execution of step b Python
installation is complete.
4. In order to read the excel file in Python we need to download xlrd module. Download the latest version from https://pypi.python.org/pypi/xlrd/0.9.3
5. Decompress the xlrd tar file and change directory to decompressed directory.
6. Installation command –
a.
$HOME/python/bin/python setup.py build
b.
$HOME/python/bin/python install --prefix=$HOME/python
After successful execution of step b xlrd installation is
complete.
Excel to Text file
Converter –
I tried to write a generic script as much I could think.
This code can easily be adoptable for any kind of project. It’s very small
script and it’s only around 60 lines of code. I think python code is very much
self-explanatory.
#!$HOME/python/bin/python
import sys
import argparse
import time
import xlrd
## define parameter ##
parser = argparse.ArgumentParser(description='This script is for to convert xls to text file format')
parser.add_argument('-i','--input', help='Input xls file name', required=True)
parser.add_argument('-s','--sheet', help='Input sheet name', required=True)
parser.add_argument('-o','--output',help='Output file name', required=True)
parser.add_argument('-d','--delimiter',help='File delimiter type', default = ',', required=False)
args = parser.parse_args()
## define variable ##
ifile = args.input
isheet = args.sheet
ofile = args.output
delimiter = args.delimiter
print ("Script execution started at %s" % time.ctime())
## show values ##
print ("Input file: %s" % ifile )
print ("Input sheet name: %s" % isheet )
print ("Output file: %s" % ofile )
print ("Output file delimiter: %s" % delimiter )
## Read excel file and write text file ##
workbook = xlrd.open_workbook(ifile)
worksheet = workbook.sheet_by_name(isheet)
out_file = open(ofile, 'w')
num_rows = worksheet.nrows - 1
curr_row = -1
print ("Reading excel file now....")
while curr_row < num_rows:
curr_row += 1
row = worksheet.row(curr_row)
num_cols = worksheet.ncols - 1
curr_col = -1
while curr_col < num_cols:
curr_col += 1
try:
col = worksheet.cell_value(rowx=curr_row, colx=curr_col).encode('ASCII')
except:
col = worksheet.cell_value(rowx=curr_row, colx=curr_col)
#print col
out_file.write('"%s"%s' % (str(col).strip(), delimiter))
out_file.write('\n')
out_file.close()
print ("Script completed at %s" % time.ctime())
exit()
import sys
import argparse
import time
import xlrd
## define parameter ##
parser = argparse.ArgumentParser(description='This script is for to convert xls to text file format')
parser.add_argument('-i','--input', help='Input xls file name', required=True)
parser.add_argument('-s','--sheet', help='Input sheet name', required=True)
parser.add_argument('-o','--output',help='Output file name', required=True)
parser.add_argument('-d','--delimiter',help='File delimiter type', default = ',', required=False)
args = parser.parse_args()
## define variable ##
ifile = args.input
isheet = args.sheet
ofile = args.output
delimiter = args.delimiter
print ("Script execution started at %s" % time.ctime())
## show values ##
print ("Input file: %s" % ifile )
print ("Input sheet name: %s" % isheet )
print ("Output file: %s" % ofile )
print ("Output file delimiter: %s" % delimiter )
## Read excel file and write text file ##
workbook = xlrd.open_workbook(ifile)
worksheet = workbook.sheet_by_name(isheet)
out_file = open(ofile, 'w')
num_rows = worksheet.nrows - 1
curr_row = -1
print ("Reading excel file now....")
while curr_row < num_rows:
curr_row += 1
row = worksheet.row(curr_row)
num_cols = worksheet.ncols - 1
curr_col = -1
while curr_col < num_cols:
curr_col += 1
try:
col = worksheet.cell_value(rowx=curr_row, colx=curr_col).encode('ASCII')
except:
col = worksheet.cell_value(rowx=curr_row, colx=curr_col)
#print col
out_file.write('"%s"%s' % (str(col).strip(), delimiter))
out_file.write('\n')
out_file.close()
print ("Script completed at %s" % time.ctime())
exit()
Thanks
ReplyDelete