Async issue with pool size option in Pipeline Execute Snap

In my pipeline, I use Pipeline Execute snap to run some sub-pipeline.

I like using the pool size option in this Snap so that it can create multiple pipelines running at concurrently.

However, like any other concurrent thing, sometimes there is a small chance that 2 or more concurrent threads write the same data into the same place(e.g. database) at the same time, although there’s a check in my pipeline for the data existance in DB before doing insert. (If no data, then insert, if there is data, then update)

Is there any locking or thread safe mechanisams (like Java lock) in Snaplogic that can prevent such threading issue?

Thanks

Is that your assumption of this?

What happens is that the incoming documents are sent in round robin to this
snap even if you have the pool size enabled. So they will never be sent to
two instances of execution.

Unless you have duplicates in your data you will be safe with the pool and
achieve true concurrency.

Thats what im trying to say, we have the same/duplicates in my data…

in my case, its the firm employee data where the location of the employee is one of the column in the row(document)

e.g.
1, Emplyee1, New York
2. Emloyee2, New York

So the sub-pipeline is to check if the location of that Employee is in the DB first, if its in DB, it will update the location, if its not in DB, it will insert the location into DB.

In above case, because the location in the above 2 document are same. Lets say the database does not have New York initially. I think with concurrent or even round robin…if for the first snap execution there’s a delay to insert New York into DB that the insert is NOT able to happen before the 2nd pipeline execution doing database check, the 2nd execution would insert the same into DB.

Thanks

If its the same data and you’re doing an update, it seems like it’d be a non issue.

I’d like to ask why understand your use case more, You have a child pipeline running to do individual updates to a database?That’d eat a lot of resources spinning up and down those parallel executions, even if you reuse. From what you’ve described here it sounds like you’re injecting parallelism just for the sake of it.

It’d most likely be more efficient and performant to use a single db snap to do your updates and edit the batch settings on your db account.

Also you should not think of pipelines as threads, but a process that’s a collection of threads. In reality, each snap in a pipeline has its own thread. Those snaps combine to perform a semi-serial execution, and pipeline executions are generally not aware of each.

Best,

unfortunately to insert the data in the db, its not via SQL, we have to call SOAP API to insert/update single location into that systems’DB.

Thanks, that approach makes more sense now.

As far as locking goes there isn’t anything in the product that would function like you were thinking at this point.

What Naveen mentioned is right. Is there any way to possibly preprocess the data so that you’d only have one instance of a place across your incoming documents (could be dangerous, if the one with the data fails)?

I’m not sure how your SOAP endpoint functions, are you just publishing flat employee records in one soap call or are you making multiple soap calls per incoming document to write pieces of the employee record to different places?

Is there any danger from your side having the duplicate data overwrite for your application?

Thanks, in our case, we have to make multiple soap calls per incoming document. There’s no other choice to do that.

But anyway. I somehow organize the data a little bit, removed the duplicate location from the data.

Now I am able to leverage the pool size of Pipeline Execute without worrying about the race condition.

Thanks again

1 Like

Yes if the location is made sure to exist before this is written.

I am pretty sure you wont have to insert 100s of locations every load. It is a one time thing.

Glad you figured it!!