Sharing Knowledge

Wednesday, February 20, 2019

Day 1 - Create SCD Type 1 Mapping

Friday, June 24, 2016

With Informatica BDM gaining pace and most of us are not very familiar with new Informatica version 10.0 architecture so thought of putting this in a blog.

Informatica Service Architecture & Clients

Informatica Domain: Its collection of nodes and services that define the Informatica environment. Administrator is a browser interface to control the Informatica domain. It includes services, folders, users, groups, privileges, roles, log management and other functions.

Application Services: It provides the functionalities of the Informatica Platform. Following are the application service examples:

PowerCenter Repository Service: It manages the PowerCenter Repository, performs all metadata transactions between the repository database and repository clients.
PowerCenter Integration Service: Runs PowerCenter workflows, schedules workflows, connects to the PowerCenter Repository, Workflow Monitor and Workflow Manager clients.
Reporting Service: It’s an application service that runs the Data Analyzer application. It has prebuilt reports for PowerCenter repository, Metadata Manager reports, Data Profiling reports and other reports.
PowerCenter Web Hub Service: Receives requests from web service clients and exposes PowerCenter workflows as Web enabled services.
Model Repository Service: It manages the Model Repository. The Metadata Repository is a relational database that stores metadata for projects created in Informatica Analyst and Informatica Developer.
Analyst Service: It runs Informatica Analyst web tool in the Informatica domain. The Analyst Service manages the connections between the service components and the users who log in to the Analyst tool.
Data Integration Service: It performs data integration tasks for Informatica Analyst and Informatica Developer client. It also performs data integration tasks for external clients.
Metadata Manager Service: It provides the tools for the management of technical and business metadata. The Metadata Manager application manages access to metadata in the Metadata Manager repository.
Other Services:

System Services: Single instance System Services created out of the box. The domain includes the Email Service, Resource Manager Service, Scheduler Service.
Content Management Service: It manages the reference data. It provides the reference data to the DIS and to the Developer & Analyst tools.
PowerExchange Listener Service: Manages the PowerExchange Listener.
PowerExchange Logger Service: It manages the PowerExchnage Logger Service.
SAP BW Service: Listens for RFC requests from SAP NetWeaver BI.

Automation of New Column Addition in Informatica

In my project I see many recurring requirements of new column addition in existing tables hence Informatica mapping modification again and again so thought of automating this activity and reduce the overall implementation time hence cost benefits to the customer.

In this blog I will explain the framework that we implemented for this automation. Here I am taking source as a relational object and target as a Netezza table. Source data should be inserted or updated based on availability of data in target. This framework can easily be customized for any relational source/target & flat file.

1. First step is to add new dummy columns as Strings in Source Qualifier and precision should be on higher side.
2. Create the Target as a flat file, copy the existing attributes from Netezza target table and add new dummy columns same as Source Qualifier.

3. Create the parameters as String and precision should be on higher side. Number of parameters should be same as the new dummy columns that you added in above source & target. Initial value should be NULL.
4. Create the Transformation parameter with “IsExpression Variable” as TRUE. This is for to add any transformation logic for new columns that you will add in future.

5. Now modify the Source Qualifier query. Add the parameters in query that you created in above steps like below
SELECT distinct case when t.emp_id is null then ‘I’ else ‘U’ end as flag,
s.emp_id, s.first_name, s.last_name,
$$COLUMN1 as NEWFIELD,
$$COLUMN2 as NEWFIELD1,
$$COLUMN3 as NEWFIELD2,
$$COLUMN4 as NEWFIELD3,
$$COLUMN5 as NEWFIELD4,
$$COLUMN6 as NEWFIELD5
FROM
src_newcolumnaddition s left join
tgt_newcolumnaddition t on (s.emp_id = t.emp_id)

6. In expression transformation, select the NEWFIELDs as input and create Variable/Output Ports and assign the Transformation Parameters in Expression, like below –

7. Now connect the ports appropriately to the target. Mapping creation is completed here and create the session & workflows.

8. In workflow provide the parameter file name. When there is no new column addition then no need to pass any parameter value, by default it will take as a NULL value.

9. Now let’s say you have the requirement of new column addition in source as “Date of Birth” and in Target as “Age”. This is how parameter file should look like

[folder_name.WF:wkf_test_NewColumnAddition.ST:s_test_NewColumnAddition]
$$COLUMN1=DOB
$$TRNSFRM1=date_diff(SYSDATE, to_date(NEWFIELD,’yyyy-mm-dd’), ‘YY’)

10. Using above steps you would be to generate Target flat file with new columns as mentioned in parameter file.

11. After this step we need to use script to load the data in target table. I have used python script to automate the load process but this can be implemented in shell script as well.

In Netezza, nzload script throws an error if your file has more columns than the table. To resolve this script will get the list of target table columns first, remove the extra NULL columns from flat file and execute the insert & update queries in database. These insert & update queries will get generated dynamically based on the number of columns in table.

In this script, following are the parameters -
-f : Source file name which is generated by informatica
-df: Delimiter Character
-t: Target table name
-d: Database/Schema where above target table exists

If you like this idea and would like to implement then please let me know I would be very happy to assist you.

Taking this idea to next level is create template to do the data integration. In next blog I will show you the template creation and execution for different source/target objet

Please let me know your thoughts/comment on this.

Data Masking with Informatica

Data masking / obfuscation is the process of hiding original data with random characters or data. Data Masking is a simple technical solution for a fundamental business problem of data leaks. Data leaks are triggered not always by the security holes in systems, but by the business processes as well. It is important for organizations to restrict access to sensitive data to ensure confidentiality.

There are primarily 2 types of data masking solutions:

1. Static Data Masking
2. Dynamic Data Masking

Static Data Masking takes production data as input, and then applies transformations to obfuscate records and removes sensitive information. Masked records are written into a new instance of a database, a clone. Data in clone can be used for testing different scenarios because the structure and the format of data is preserved with referential integrity. Static data masking provides high‐quality, near realistic test data for use in non‐production environments.

Dynamic Data Masking is a technology that puts data masking as a mediator between applications and underlying databases. An additional layer is created between databases and applications which selectively masks sensitive information. Obfuscation is done in real time; the underlying database and the application source are not changed.

Dynamic data masking is an effective strategy to reduce the risk of data exposure to insiders and outsiders in organizations which need to have access to production data and is the best practice for securing production databases. It limits all sensitive data exposures because of application security design flaws, inadequate testing, ever changing regulatory requirements and aggressive production release schedule etc.

So, how to decide between static and dynamic data masking? Static masking should be used to remove risks associated with using production data in non-production environments, and on the other hand, dynamic masking should be used to reduce exposure to personnel that has access to production data; it introduces an additional layer of role-based security offering delicate access control. There is no simple answer; in most cases the perfect solution would be to use both approaches. If you cannot do both, it is more important to introduce static data masking to avoid using sensitive data in non-production environments.

As a result of these challenges, organizations are in greater need of robust data masking software to prevent breaches and enforce data security. Such a solution should empower IT organizations to:

• Mask the sensitive data exposed in production environments
• Shield production applications and databases without changes to source code
• Respond quickly to reduce the risks of data breaches and the resulting costs
• Customize database security for different regulatory or business requirements

Informatica Dynamic Data Masking solution

Informatica Dynamic Data Masking helps organizations to accomplish these daunting tasks, proactively addressing data privacy challenges in real time. Informatica Dynamic Data Masking de-identifies data and controls unauthorized access to production environment.

This blog is contributed by one of my colleague.

Informatica Data Validation Option

Informatica Power Center Data Validation Option enables you to test and validate data. Data Validation Option requires the installation of Informatica PowerCenter which carries out majority of work. DVO tool provides complete and effective data validation and testing—without having to write complicated, error-prone SQL statements or use Excel spreadsheets. This option provides reliable and independent method to verify that moves and transformations have not compromised the integrity of your data.

DVO expedite the process for Developers and business analyst to create rules that test the data being transformed during the data integration process. A typical workflow for data validation consists of multiple tasks.

Data Validation Option reads one or more PowerCenter metadata repositories.
Define the validation rules in Data Validation Option.
Run the rules to ensure the data conforms to the validation rules. When you do this, Data Validation Option performs the following tasks:

Creates and executes all tests through PowerCenter.
Loads results into the Data Validation Option results database and displays them in the Data Validation Option Client.

Examine the results and identify sources of inconsistencies in the ETL process or the source systems.
Repeat this process for new records.

Figure 1 shows DVO architecture. DVO client reads Source and target data table and file definitions from PowerCenter repositories. The test metadata is stored in the Data Validation Option repository. During test execution, Data Validation Option communicates with PowerCenter through an API to create appropriate mappings, sessions, and workflows, and execute them.

Figure 1 Data Validation Option Architecture

Traditional methods of data validation like writing complex SQL queries excel sheet comparisons are error prone, complicated and time consuming. DVO reduces the time required for data validation and production data auditing and verification.

Data Validation Option stores all test definitions and test results in the Data Validation Option repository. It provides three types of reporting types – Summary of Testing Activities, Table Pair Summary and Detailed Test Results. All Data Validation Option reports run against database views that are set up as part of the installation process so you can write custom reports against the results database.

Key Features:

Robust Data Validation Environment
Packaged Testing Rules and Complete Expressions Library
Support for Production Validation Testing

Benefits:

Ensure Data’s Accuracy and Trustworthiness,
Reduce Data Testing and Validation Time and Resources by 50 to 90 Percent.
Make Business Decisions Knowing That Production Systems Are Accurate.
Eliminate the manual errors caused while writing complex SQL queries while data testing.

Limitations:

Requires basic PowerCenter knowledge. For troubleshooting errors users need to open monitor/designer or workflow and locate error. The tool redirects for more errors refer to workflow log. It requires basic Informatica workflow/session log debugging.
Invalid test case creation will result in invalid mapping thus show of no result or run error.
Any modifications in source/target metadata require a folder refresh which usually time consuming.
Report generation is slow specially when the report you generate contains hundreds of tests or test runs.

Use Case:

Following are some of the use case examples:

Folder Comparison/Automatic Test Generation
Compare Individual Records between Sources and Targets
Constraint based data validation
Expression based tests, Aggregate Functions tests
Join View, Validity of the lookup logic, SQL View

In entire process of data integration DVO fits typically in phase of data testing and validation. Developers/ Testers can leverage the tool while unit testing, integration testing, migration testing etc. Easy-to-use GUI includes prebuilt operators—no programming skills are needed to create validation tests. Business decisions are based on the available trusted data, it makes more important to leverage DVO capabilities to eliminate risk of introducing bad data and errors—move errors, transformation errors, or operational errors.

Wednesday, November 19, 2014

Data Virtualization using Informatica Data Service

For Data Virtualization solutions Informatica has a tool called Data Services. Most of you might be already aware Data Virtualization is an approach that allows an application to retrieve and manipulate data without needing to know any technical details about the data, like how it is formatted or where it is physically located.

In recent years, data volumes, complexity, and compliance requirements have grown dramatically, making data governance an increasingly critical element in enterprise information management strategy. Informatica Data Services gives the power to manage and monitor data services in a single unified environment, it enables you to enforce and manage policies and rules across federated data for all data services.

Using Informatica Data Services you can create a unified data model to provide uniform access to disparate data. You can also federate data to combine data from heterogeneous data sources into a single view.

Analysts and developers can use Informatica Data Services to perform the following tasks:

Standardize access to disparate data -

Developers can create a logical data object model to standardize access to disparate data.

Federate data -

Developers can create a SQL data service to federate data from heterogeneous data sources. For example, a developer wants to create a prototype that shows how to combine external data with data in a data warehouse. The developer creates an SQL data service that federates data from both data sources.

Provide a web service -

Developers can create a web service to enable access to data from a web service client. For example, a developer creates a web service that enables each customer to use a web browser to retrieve their order details.

Define business logic for populating a target -

Analysts can create a mapping specification in the Analyst tool to define how to populate a target from source data. Source data can come from data sources, virtual tables, or logical data model objects. Developers can export the mapping specification logic as a virtual table.

Profile data -

Developers can run column profiles in the Developer tool to discover information about the column data. Developers can run column profiles on data sources, logical data model objects, virtual tables, and sources and targets in mapping specifications.

Data Flow with Informatica Data Services

With Informatica Data Services, you can create a logical data object model and a data service before an end user consumes the data.

The following figure shows how data can flow through a data model and a data service:

The figure shows the following layers:

Data source layer -

Contains the data sources, such as relational databases, mainframe sources, and flat files.

Data abstraction layer -

Contains a logical data object model. A logical data object model is a logical representation of data. The logical data object model provides a single view to access data from disparate data sources. The logical data object model also insulates data consumers from changes in the data sources.

Data service layer -

Contains data services. Data services provide access to data. You can create SQL data services and web
services. An SQL data service is a virtual database that end users can query. Create an SQL data service so that end users can run SQL queries against the virtual tables. End users can run SQL queries from a third-party client tool, like a business intelligence tool. A web service provides access to data integration functionality. Create a web service to enable a web service client to access, transform, or deliver data.

Data consumer layer -

Contains end-user applications that use the data. Data consumers can include web services and business
intelligence tools, such as Tableau and Excel. Business intelligence tools generate reports based on data. Web service clients receive data based on web service requests.

Continuation to this blog I will provide how to create SQL data services and web services in Informatica Developer client.

Saturday, June 28, 2014

My first tweet feed in Hadoop

When I started working in Hadoop one of the thing that I always wanted to accomplish was to load the live twitter feed in HDFS. Now why do we need to load the twitter feed? There are so many use cases with twitter data like analyzing brand sentiments, performance of a movie, sports events like FIFA, NBA, NFL, etc. Soccer world cup is becoming biggest social media event ever and social media uses are going to get increase day by day for almost everything so its very important to have twitter data analyzed for any organization for marketing needs or for any other purpose.

When I started working in Hadoop I didn't have any working experience in Java, along with Hadoop I am also learning little bit of Java. To load the twitter feed I used flume. There are certain steps that I followed to load the twitter feed in HDFS via flume.

First I created twitter application from my own twitter account in apps.twitter.com and then generate API Key/Secret and Access Token/Secret.
A custom twitter flume source is used for streaming twitter data into HDFS. Flume source uses the twitter streaming API which returns a json structure for every tweet via Twitter4J library and then it gets stored in HDFS. We got the custom source jar file from one of our support guy. Initially I was getting this error "java.lang.UnsupportedClassVersionError: poc/hortonworks/flume/source/twitter/TwitterSource : Unsupported major.minor version 51.0" after using correct version of JDK/JRE it worked as expected.
Modified the flume-env.sh file to locate your custom jar file in classpath.
I created the flume config file and added my key/secret, also other flume configuration details.
For execution I used below command -

flume-ng agent -n TwitterAgent -c /etc/flume/conf -f flume.conf

In this POC after fixing the custom jar issue I didn't really had any issue I was able to stream the tweet feed in HDFS. I created the external table in hive to see and analyze twitter data. I used below command to create hive table -

add jar json-serde-1.1.9.3-SNAPSHOT-jar-with-dependencies.jar;

CREATE EXTERNAL TABLE tweets (
tweetmessage STRING,
createddate STRING,
geolocation STRING,
user struct<
        userlocation:STRING,
        id:STRING,
        name:STRING,
        screenname:STRING,
        geoenabled:STRING>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/twitter'
;

For hive external table I downloaded the jar file and added the jar file like above otherwise table won't be created it will give error.

I would love to see your feedback/comment on this.