How to improve pipeline performance

Question

Hi,
I am reading multiple files from a S3 location and each file has multiple JSON records. I have to look for an id value in each JSON record  and based on the id value, I have to create an individual JSON file for a single record and write it to a specific directory in S3. My pipeline is running fine but its performance is no that good for eg: to process 474 documents which has total 78,526 records , it took 2:30hr to wrote 78,526 files in another S3 directory which I believe is not good.
I am attaching my pipelines, if any improvement can be done in the pipeline please let me know. I really appreciate your suggestion.
This is the flow of my pipelines:
pl_psychometric_analysis_S3_child.slp → pl_psychometric_analysis_S3_split_events.slp → pl_psychometric_analysis_S3_write_events.slp
Thanks
Aditya
pl_psychometric_analysis_S3_child.slp (16.4 KB)
pl_psychometric_analysis_S3_split_events.slp (9.4 KB)
pl_psychometric_analysis_S3_write_events.slp (7.8 KB)

dwhite · Answer

Replace your two scripts with mappers should net you some performance gain. You don’t need to use a script to make a string of json.
Script in “Split events” pipeline which is doing
data = self.input.next()
newData = {}
newData[‘record’] = data
newData[‘sequence_id’] = data[‘sequence’][‘id’]
newData[‘event_id’] = data[‘event_id’]
self.output.write(newData)
replace with a mapper like

Script in write events is like
data = self.input.next()
self.output.write(data[‘record’])
replace with a mapper like

aditya_sharma · Answer

Thanks Dwhite
I’ll make that change, but I checked the average execution time for both the scripts , it’s very less (0.141 sec).
Also, how should we decide what to set in pool size property of Pipeline Execute snap.
Thanks

Forum Discussion

How to improve pipeline performance

2 Replies

Recent Discussions

Way to lock down in Prod org to "Monitor" only access?

trace API and proxy calls

Pagination Logic Fails After Migrating from REST GET to HTTP Client Snap

Pipeline Execute Pool size

Concat values of a field based on value of another field