File Extraction Patterns
Hi All, What I'm looking for with this post is some input around simple file extraction patterns (just A to B without transformations). Below I will detail a couple of patterns that we use, with pros and cons, and I'd like to know what others are using and what the benefit is over another method. Caveat on my SnapLogic experience, I've been using it for just over 4 years, and everything I know about it is figured out or self taught from community and documentation, so in the approaches below, there may be some knowledge gaps that could be filled in by more experienced users. In my org, we use either a config driven generic pattern split across multiple parent/child pipelines, or a more reductive approach with a single pipeline where the process could be reduced to "browse then read then write". If there is enough interest in this topic, I can expand it with some diagrams and documentation. Config driven approach A file is created with details of source, target, file filter, account etc a parent pipeline reads the config and per each row, detail is passed to a child as parameters child pipeline uses the parameters to check if file exists, if no, then end process, raise a notification if yes, then another child is called to perform the source to target read/write. Pros Cons high observability in dashboard through use of execution labels in pipeline execute snaps more complex to set up child pipelines can be reused for other processes increased demand on snaplex child pipelines can be executed in isolation with correct parameters might be more difficult to move through development stages easier to do versioning and update separate components of process requires some documentation for new users some auditing available via parameter visibility in dashboard concurrency could cause issues with throttling/bottlenecks at source/target. can run parallel extractions child fails are isolated pipelines follow principle of single responsibility easy to test each component Single pipeline approach can run from a config file or have directories hard coded In one pipeline browser checks for files, reader then writer perform the transfer depending on requirement, some filtering or routing can be done to different targets Pros Cons less assets to manage no visibility of what files were picked up without any custom logging (just a number in the reader/writer in the dashboard) faster throughput small files more difficult to rerun or extract specific files less demand on snaplex not as reusable as a a multi pipeline approach, less modular easier to promote through development stages no concurrency, so process could be long, depending on files volume/size more difficult to test each component I think both approaches are valid; the choice to me seems to be a trade off between operability (observability, isolation, re-runs) and simplicity/throughput. I'm interested to hear insights, patterns, and examples from the community to help refine or reinforce what we're doing. Some other question which I think would be useful to get input on! Which pattern do you default to for file extractions, and why? If you use a single pipeline, how do you implement file-level observability (e.g., DB tables, S3 JSON logs, groundplex logs)? How do you handle retries and idempotency in practice (temporary folders, checksums, last-load tracking)? What limits or best practices have you found for concurrency control when fanning out with Pipeline Execute? Have you adopted a hybrid approach that balances clarity with operational efficiency, and how do you avoid asset sprawl? Cheers Lee1View0likes0Comments