cancel
Showing results for 
Search instead for 
Did you mean: 

How to improve pipeline performance

aditya_sharma
New Contributor III

Hi,

I am reading multiple files from a S3 location and each file has multiple JSON records. I have to look for an id value in each JSON record and based on the id value, I have to create an individual JSON file for a single record and write it to a specific directory in S3. My pipeline is running fine but its performance is no that good for eg: to process 474 documents which has total 78,526 records , it took 2:30hr to wrote 78,526 files in another S3 directory which I believe is not good.

I am attaching my pipelines, if any improvement can be done in the pipeline please let me know. I really appreciate your suggestion.

This is the flow of my pipelines:

pl_psychometric_analysis_S3_child.slp → pl_psychometric_analysis_S3_split_events.slp → pl_psychometric_analysis_S3_write_events.slp

Thanks
Aditya

pl_psychometric_analysis_S3_child.slp (16.4 KB)
pl_psychometric_analysis_S3_split_events.slp (9.4 KB)
pl_psychometric_analysis_S3_write_events.slp (7.8 KB)

2 REPLIES 2

dwhite
Employee
Employee

Replace your two scripts with mappers should net you some performance gain. You don’t need to use a script to make a string of json.

Script in “Split events” pipeline which is doing

data = self.input.next()
newData = {}
newData[‘record’] = data
newData[‘sequence_id’] = data[‘sequence’][‘id’]
newData[‘event_id’] = data[‘event_id’]
self.output.write(newData)

replace with a mapper like
27 PM

Script in write events is like
data = self.input.next()
self.output.write(data[‘record’])

replace with a mapper like
28 PM

aditya_sharma
New Contributor III

Thanks Dwhite

I’ll make that change, but I checked the average execution time for both the scripts , it’s very less (0.141 sec).
Also, how should we decide what to set in pool size property of Pipeline Execute snap.

Thanks