cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to improve pipeline performance

aditya_sharma
New Contributor III

Hi,

I am reading multiple files from a S3 location and each file has multiple JSON records. I have to look for an id value in each JSON record and based on the id value, I have to create an individual JSON file for a single record and write it to a specific directory in S3. My pipeline is running fine but its performance is no that good for eg: to process 474 documents which has total 78,526 records , it took 2:30hr to wrote 78,526 files in another S3 directory which I believe is not good.

I am attaching my pipelines, if any improvement can be done in the pipeline please let me know. I really appreciate your suggestion.

This is the flow of my pipelines:

pl_psychometric_analysis_S3_child.slp โ†’ pl_psychometric_analysis_S3_split_events.slp โ†’ pl_psychometric_analysis_S3_write_events.slp

Thanks
Aditya

pl_psychometric_analysis_S3_child.slp (16.4 KB)
pl_psychometric_analysis_S3_split_events.slp (9.4 KB)
pl_psychometric_analysis_S3_write_events.slp (7.8 KB)

2 REPLIES 2

dwhite
Employee
Employee

Replace your two scripts with mappers should net you some performance gain. You donโ€™t need to use a script to make a string of json.

Script in โ€œSplit eventsโ€ pipeline which is doing

data = self.input.next()
newData = {}
newData[โ€˜recordโ€™] = data
newData[โ€˜sequence_idโ€™] = data[โ€˜sequenceโ€™][โ€˜idโ€™]
newData[โ€˜event_idโ€™] = data[โ€˜event_idโ€™]
self.output.write(newData)

replace with a mapper like
27 PM

Script in write events is like
data = self.input.next()
self.output.write(data[โ€˜recordโ€™])

replace with a mapper like
28 PM

aditya_sharma
New Contributor III

Thanks Dwhite

Iโ€™ll make that change, but I checked the average execution time for both the scripts , itโ€™s very less (0.141 sec).
Also, how should we decide what to set in pool size property of Pipeline Execute snap.

Thanks