Data Integration 005: Talend, Mapping with tMap

update-mobile-db

Talend job that matches data from 2 files and filter out specified records, using tMap.

In this example, I am creating a job that would read from two separate CSV files, one is a list of employee mobile phone numbers, the other is a list of employment status of all employees. After matching a record against both documents, if the employee is still active, the employee mobile phone number shall be updated in a MS SQL DB. (More details on how to connect to MS SQL Server here.)

Talend has provided an extremely convenient component to achieve my objective, which is the tMap component. This tMap component steps through the data records in my employee mobile number file (Main), and match the specified primary key against the reference file (Lookup), which contains employee status. When the status of a particular record meets my condition, it will be generated as an output from tMap, which can then be loaded into the DB by a separate DB input component.

tmap-settings

Mapping data inputs and outputs.

The tMap schema editor presents a simple drag and drop UI for you to relate the input and output data. To filter for certain conditions in the data, you will have to manipulate the data using expression (which appears to be in the Java language). More introduction on the tMap expression can be found here. In my example, I actually have two outputs. The second output catches all the data records rejected by the first output condition (Catch output reject = True), and presents the data to me in the console using tLogRow component.

tmap-settings-2

Schema of output to be loaded into DB.

I did run into a small problem when I run the job. The DB table allowed the field “isPrimary” to be nullable even though it holds the type Boolean. Naturally I have used the DB table schema to design the output of my Talend job, however Talend does not recognise Null value for the type Boolean. I will have to explicitly specify that the Boolean data field in my output is Nullable. (Which makes me wonder, why doesn’t Talend enforce this instead of allowing the user to specify otherwise and throwing up error?)

 

Advertisements

Data Integration 004: Talend, Connect to MSSQL

I ran into loads of problems trying to connect to a Microsoft SQL Server (2008 R2) using my Talend job to update a DB.

(Disclaimer: there was no network issues, the server was definitely reachable.)

The MSSQL Connector that does not work

First, I tried to use tMSSqlConnection to connect to the server, but the job could not even compile due to a missing library mssql-jdbc.jar. Fine, typically the open studio will prompt me to install the additional required libraries after I agree to some terms and conditions, but No! This time round the “download and install” button for said library was totally greyed out, there was no way I could download it from within the studio. After taking advise from forums and Stack Overflow, I decided to manually download the required Java Archive file and install it. Honestly, I am not even sure if it is the right file. As expected, it could not work.

Next, I turned to the tOleDbConnection as recommended by some of the posts in the forums. Nice! It managed to build and run but it threw up an error. I could not tell what was wrong with the connection from the open studio console, so I started to look at the access logs from the SQL Server Management Studio (SSMS). “The login is a SQL Server login and cannot be used with Windows Authentication”. It turns out that my SQL server was configured for Windows Authentication only, but the Talend connector inherently do not support Windows Authentication.

Some online sources had recommended that changing the authentication mode was simple enough, except that you need to restart the server. It was between this option and trying to enable Windows Authentication on the Talend connector, which I assumed was not going to be simple as it was not provided out-of-the-box. I was like, let’s go for the easy way.

Mixed Authentication Mode

And here is where the nightmare begins. After changing the setting to mixed authentication mode, the Talend connector could successfully connect to the server. However, at the same time all other existing applications that connects to DBs in the server all broke. Man, I did not see that coming. The “Mixed Authentication” had misled me into thinking that restarting the server was a good idea. There was simply too many applications for me to troubleshoot alone. The best way out of this is to revert back to the original configuration. (I did perform a clone of the server VM in the event that the development doesn’t go well, I will restore the entire server instance from the clone.)

For some weird reasons, I could not log in to the server using SSMS no matter how I tried. Then it dawn on me that the admin account was a Windows account, and if the Windows Authentication mode had been denied, I would need to use another admin account permitted for SQL Server Authentication mode. Nope. There is no such admin account (not that I know of. Did not assume there was a default admin account and that people were lazy enough to have kept that account accessible.). At this point I wished for a detonation button to wipe the server off the face of the Earth.

So there is no way for me to change the authentication mode at all. There is only one last thing to do. Restore the entire server instance to some point back in time.

And that is how the episode ended. Eventually I had to stick with using tOleDbInput with Windows Authentication. This requires the janet-win<32 or 64>.dll (depends on whether you have a 32bit or a 64bit system) to be available in the java.library.path of your JVM. I am very thankful for this guide providing all the required information.

Update or Insert

The job worked well. I wanted to modify a table in the DB on the condition that if a record do not exist, a new record shall be inserted, else the existing record will be updated. This is when I come across the convenient option:

tOleDbOutput.JPG

In my scenario, I will not often have to insert a new record, therefore the “update or insert” option is more efficient. More difference between the options are explained here.

Build and Export

After completing the job, I built and exported it as a Windows batch file (.bat) to run on a production system. That it where I run into another roadblock. After triggering the batch script with the Windows Task Scheduler, the console kept throwing an error, “UnsatisfiedLinkError: somepath\janet-win64.dll: Can’t find dependent libraries”. This does not mean that the system cannot find janet-win64.dll, as I have also imported that, but those Dynamic-Link Libraries (DLLs) that janet-win64 depends on that were missing.

That is just crazy to me, as my development environment was made from an exact VM clone of the production instance. How could there ever be anything missing? I googled so hard for two hours, then it finally hit me. I installed SSMS in the development server. And that installation, which the Talend job does not depend on, had also installed the latest Visual C++ Redistributable packages, which enabled janet-win64.dll to function and in turn enabled the Talend job. It totally blew my mind that these unrelated actions enabled the success of the Talend job. There was also no way for me to discover this dependency during development as I had installed SSMS before starting work on the Talend job.

TL; DR

Do not change the SQL server authentication mode if there are live connectors running. Always keep track of changes made in a server even though it may be unrelated to the development.

Data Integration 002: Talend, Working with FTP

Working with FTP is simple enough. This is the first project I have attempted using Talend, to download CSV files from an FTP server over the internet, perform some data manipulation, and finally upload the modified files into another folder in the same FTP server.

dl-sftp-01

The Structure

The process of my Talend job is as follows:

  1. Using a tPreJob, I begin reading from a config file to obtain the key parameters for this project, and perform a tContextLoad.
  2. The main job starts with a tJava SubJob. It contains my custom java logic to redirect stdout and stderr for writing log files, set up any necessary global variables, or anything you want. It is flexible.
  3. The deactivated SubJob contains a tContextDump for me to read the context (or config in my own terms) that I have loaded for this job, and a tLogRow connected via Iterate to write into the stdout.
  4. Next, tFTPConnection connects to the FTP server and locate the desired folder, and tFTPGet downloads the files.
  5. tStatCatcher is used to read the statistic of any components that have checked the stat catcher option. (honestly I still have not figured out how to use this)
  6. Connecting tDie OnSubJobError to SubJobs will kill the main job when that particular Subjob fails, and throw out any error messages that you have set.
  7. tPostJob will be executed at the end, regardless of whether an error killed the Job or not. I used it to tFTPClose the FTP connection and clean up any streams that have been opened.

Works perfectly fine. However, I am uncertain if this is the best way to structure a job. Will continue to improve on it.

Here is part 2 of the job. This will upload the file back to the FTP server. I have split it up into two distinct jobs so that they can be reused for other projects independently. Reusability is also my top consideration when I decided to use context loading for these two jobs.

upload-sftp-01

Data Integration 001: Talend, First Impression

As stated on the landing page of Talend.com, it is the leading open source integration software provider for data driven projects. What I like about it is, of course, the fact that the open studio is absolutely FOC, and according to this comparison, it provides almost the same features as the paid Enterprise edition.

talend-features

Features of Talend Open Studio. Looking forward to try them out.

Talend by Example is a fantastic Talend resource to guide any beginner. Use it to determine which package to download, install, and start running your very first Talend job.

I started using Talend to perform file downloads from an FTP server, with the objective of automating this daily task eventually. No one around me had even heard of Talend, but luckily open source means every problem can be googled.

Although I could have completed this job using Python, Talend is just so much easier to use overall. The ability to insert java logic provides flexibility to fit your ETL needs. The canvas allow peers to understand your thinking process at a glance. Eventually, the ETL job can be scaled much faster by tapping on other Talend features.

After working on a few jobs, questions on how to make my jobs robust, sustainable, and reusable quickly come to mind. This is a good read to quickly have a sense of the best practices when designing Talend jobs.

Data Integration 003: Talend, GPG Decryption

In my previous post, I downloaded some files from an FTP server to work with. These files have been encrypted using PGP. To work with them, I need to first decrypt them.

decrypt-pgp-01

Decrypt the files using PGP and delete the original copies.

Thankfully, Talend comes with a tGPGDecrypt component to perform the task. However, there is a catch: this component calls the GPG program that resides on your system to perform the decryption. Therefore, you need to first make sure that your system already has the software in place.

I assume my system is using Gpg4win because we have not paid any additional money for it, and I seem to see Kleopatra lingering around in my system startup programs. Interesting that someone had actually implemented an auto-password-input program, so that manual keying of passphrase is no longer necessary when decryption is invoked. Look forward to explore Gpg4win in future.