Penataho Data Integration (PDI) - ETL

What is Pentaho Data Integration (PDI) or Kettle or Spoon

Pentaho data integration (PDI) is a (ETL) tool which allows you to read a data from multiple source, extract or transform or process the data and push it into another form. Spoon is the tool which is used to design the transformation and execute the transformation.

There are two basic types of activities can be done:

1. Transformations - .ktr

2. Jobs - .ktj

Transformations is the data flows for ETL such as reading from a various files (ex: xml, JSON, csv etc), transforming data and loading it into a target location(ex:database). On the other hand Jobs uses transformation workflow in sequence by defining the which should execute first, second etc., or execute a transformation by various validation steps and execute it. Ex: "If source file is JSON execute these steps"

Spoon is the tool which is used to develop the transformation. Pentaho CE (community edition) can be downloaded from below link and it is free to use and this will not have any pentaho server to automate the job, but this contain a simple carte server to execute the jobs using browser.

Click the link to download the Pentaho CE

Below are the installation guide

Windows:

Click to download the Pentaho Windows Installation guide

Linux:

Click to download the Pentaho Linux Installation guide

This guide will teach you the basic concepts of the transformation and the How to create a simple ETL transformation and execute it.

Prerequisites Below ate the prerequisites required to create the below transformation

Step 1

1. java installation and setup java path

2. Install the Pentaho software

3. Source csv file

4. SQLite DB, if required you can use SQLiteStudio from SQLite studio

Step 2

Open the Spoon tool from the downloaded pentaho software from pdi-ce-[version] --> data-integration. If you are using windows open Spoon.bat and if you use Linux machine, execute Spoon.sh. You can able to see the Spoon tool like below and you can create a new transformation

Step 3

1. Drag and drop steps "CSV file input", "Data grid" and "Insert / Update" from design

* "CSV file input" - It is used to read the CSV file from file system

* "Data grid" - This is to display the data from source file

* "Insert / Update" - To insert or update the DB

2. Mouse Double click on "CSV file input" step to set the csv file like below

* Open "CSV file input" step

* Press "CTRL+SPACE" to select the internal directory and then add the file name

* Select "Get Fields" to verify any issue with csv file and verify date format etc.,

* Select "OK" button to close the step

Step 4

Create hop between 2 steps to pass data from one step to another

Step 5

Create a new DB connection for the SQLite DB

* Go to "View"

* Select "Database connection" and select "New Connection Wizard"

* Select "General" * Provide connection name

* Select "Connection Type as "Generic Database"

* Select "Access"

* Provide "Custom connection URL"

* Provide "Custom drive class name"

* Finally "Test" the connection and see the connection is successful

* Select "OK" to close the window

Step 6

Create "Select Values" step to select the respective fields from the csv file

* Select "Get fields to select" to select all the fields from previous step. It can be deleted if the few fields are not required

* Select "OK" to close the "Select Values" step

Step 7

Finally, the "Insert / Update" will be modified to insert or update the data from csv file to the respective fields in the SQLite table

* Open the Insert/Update" step and change the name of the step if required

* Select "Connection" if required, by default it will be selected

* Enter the "Target Table" name

* select Lookup fields, this is similar to where condition

* In Update Fields select "Table field" which is from employee table and "Stream field" from CSV file. If you want to update the fields, select "Y" in Update else select "N"

* Select "OK" to close the step

Step 8

Verify all the steps and database are properly configured. Once all the correct, you can execute the transformation file and see if all the steps are executed without any error. If all the steps are executed without any error. You will see tick mark in all the steps. verify the logs if you see any error.

Step 9

Check the values are inserted in the SQLite DB correctly. Even if you execute multiple times, it will not duplicate information as the empID is verified in the lookup field in "Insert / Update " step each time. You can modify other information like jobTitle, except empID, empName, DOJ to verify the modified values.

Step 10

Some trouble shooting tips.

1. You can verify the logs to see if anything is missing, Ex: I have changed the name of the table and it provided "no such table name" error.

2. Check for any typo error

3. Check for mismatch between table field and stream field

Hopefully you are able to follow all the steps provide, if you have any doubt or require clarification, please post comment and I will respond as soon as I can.

Happy Learning :)

A YouTube video of the above information will be posted soon.

Comments

Popular posts from this blog

All about expensive Miyazaki mangoes and its accidental presence in India

What is MCP (Module Control Protocol)

Common Pitfalls in Agentic AI and How to Avoid Them