Data Integration 002: Talend, Working with FTP

Working with FTP is simple enough. This is the first project I have attempted using Talend, to download CSV files from an FTP server over the internet, perform some data manipulation, and finally upload the modified files into another folder in the same FTP server.

dl-sftp-01

The Structure

The process of my Talend job is as follows:

  1. Using a tPreJob, I begin reading from a config file to obtain the key parameters for this project, and perform a tContextLoad.
  2. The main job starts with a tJava SubJob. It contains my custom java logic to redirect stdout and stderr for writing log files, set up any necessary global variables, or anything you want. It is flexible.
  3. The deactivated SubJob contains a tContextDump for me to read the context (or config in my own terms) that I have loaded for this job, and a tLogRow connected via Iterate to write into the stdout.
  4. Next, tFTPConnection connects to the FTP server and locate the desired folder, and tFTPGet downloads the files.
  5. tStatCatcher is used to read the statistic of any components that have checked the stat catcher option. (honestly I still have not figured out how to use this)
  6. Connecting tDie OnSubJobError to SubJobs will kill the main job when that particular Subjob fails, and throw out any error messages that you have set.
  7. tPostJob will be executed at the end, regardless of whether an error killed the Job or not. I used it to tFTPClose the FTP connection and clean up any streams that have been opened.

Works perfectly fine. However, I am uncertain if this is the best way to structure a job. Will continue to improve on it.

Here is part 2 of the job. This will upload the file back to the FTP server. I have split it up into two distinct jobs so that they can be reused for other projects independently. Reusability is also my top consideration when I decided to use context loading for these two jobs.

upload-sftp-01

Advertisements

Data Integration 001: Talend, First Impression

As stated on the landing page of Talend.com, it is the leading open source integration software provider for data driven projects. What I like about it is, of course, the fact that the open studio is absolutely FOC, and according to this comparison, it provides almost the same features as the paid Enterprise edition.

talend-features

Features of Talend Open Studio. Looking forward to try them out.

Talend by Example is a fantastic Talend resource to guide any beginner. Use it to determine which package to download, install, and start running your very first Talend job.

I started using Talend to perform file downloads from an FTP server, with the objective of automating this daily task eventually. No one around me had even heard of Talend, but luckily open source means every problem can be googled.

Although I could have completed this job using Python, Talend is just so much easier to use overall. The ability to insert java logic provides flexibility to fit your ETL needs. The canvas allow peers to understand your thinking process at a glance. Eventually, the ETL job can be scaled much faster by tapping on other Talend features.

After working on a few jobs, questions on how to make my jobs robust, sustainable, and reusable quickly come to mind. This is a good read to quickly have a sense of the best practices when designing Talend jobs.