of a Data Quality Automate which consists of several treatments:
- address normalization using a free API;
- extract data from the database based on an exact match;
- extract data from the database based on a fuzzy match.
The automate is made of a Talend job and Python scripts which manage and process all the csv files and execute the treatments. The address normalization is done through a Python script which loads a csv file, calls the API for normalization and retains the response of the API in another csv file. Extracting the data for further use is about querying the database based on some input data. The fuzzy match treatment is made in Python using a built-in module csv match.ETL Developer.
- Development of File pocessing / Mappings / Transformation in Talend 6.4 (creation of mappings and execution of Python scripts using tMap, tReplicate, tUniqRow, tSortRow, tUnite, tAggregateRow, tSystem, tFileList, tFileCopy, tFileExist, tFileDelete, tFileInput/OutputDelimited, tSetGlobalVar, tBufferInput/Output, tHashInput/Output, tMySqlInput etc.).
- Creation of Talend jobs.
- Query optimization for a better performance.
- Documentation and support for the client and fellow team members.
- Development of Python scripts (creation of Python scripts using a library: requests, collection, os and a csv match module, and creation of executable using pyinstaller).