cancel
Showing results for 
Search instead for 
Did you mean: 

Splitting zip files

jfpelletier
New Contributor III

Hello all,

As part of my pipeline I'm downloading a .zip file from a server (using the HTTP Client snap) and the zip file that I'm getting contains documents in different languages that are stored under a folder named after the language. For example:

MyFile.zip
- japanese/document1.docx
- japanese/document2.docx
- korean/document1.docx
- korean/document2.docx
- german/document1.docx
- german/document2.docx

I need to split the content based on the language (top folder), and create new zip files named after the language. Each language specific zip file needs to contain only the files for that language of course.

Japanese.zip
- japanese/document1.docx
- japanese/document2.docx

Korean.zip
- korean/document1.docx
- korean/document2.docx

German.zip
- german/document1.docx
- german/document2.docx

I tried to play a bit with the Zip Reader and Zip Writer snaps, and I'm struggling writing the files to the SLDB, sorting them and putting them into their respective final zip file. The best I managed to do if to include ALL the files into a single resulting zip file that was named after one of the languages in the original file.

Is there a way to manipulate files inside the zip files that's more efficient and doesn't require writing them to the SLDB?

Thanks in advance for any help!

Kind regards,

JF

1 ACCEPTED SOLUTION

koryknick
Employee
Employee

@jfpelletier - Attached is a much more generic solution that uses the execution node tmp space as a cache to unzip the files temporarily to allow for re-zipping into a file per language.  There are 2 pipelines: a parent to split the input zipfile into individual files in the tmp mount, and a child pipeline that is called to re-zip the files for each language that was contained in the input zip.

The parent pipeline reads the zipfile, writes each to tmp, waits for the entire zip to be split out (using Tail snap), gets the list of language directories (using Directory Browser snap) that were created, and calls the child (using Pipeline Execute snap) to rewrite the individual zip files per language.

koryknick_0-1705431347187.png

Note that the tmp files are written using the pipe.tmpDir built-in value.  This directory is used as scratch space of the executing pipeline and only exists as long as the pipeline is running - all contents are automatically purged as soon as the pipeline ends (either success or failure).

The child pipeline gets the list of files in the language directory, reads the file content, uses the Mapper snap to change the "content-location" value to recreate the file path, then creates a single zipfile with all of the files associated with that language.

koryknick_1-1705431662527.png

Note that the Mapper snap in this pipeline was switched to use Binary input and output views.  In this context, the Mapper is only going to affect the binary header values, not the actual file content.

koryknick_2-1705432190772.png

Note that the syntax used in these pipelines to refer to the "tmp" directory location may not work as-is on a snaplex running under Windows.  It was developed and tested on a snaplex running Linux.

Hope this helps!

View solution in original post

4 REPLIES 4

koryknick
Employee
Employee

@jfpelletier - try something like this:

  • Use a ZipFile Read to get each file in the zipfile
  • Use a Binary Router to look at the $['content-location'] value to see which output zip file it should be written to
  • Use a ZipFile Write for each output file

koryknick_0-1705417194612.png

 

koryknick_0-1705416553334.png

Hope this helps!

PS - If you need this to be more generic where it doesn't require previous knowledge of the possible languages, there are a couple ways I can think to do it depending on your requirements but may require writing the files temporarily (not necessarily to SLDB) which would not be quite as efficient as this solution that can perform the entire operation via streaming data.

 

 

Hello @koryknick,

Thanks a lot for your reply. This should work because I know which language(s) to expect for this particular case. My only concern is that there are over 50 potential languages, and my pipeline would look very odd if I design it with a Binary Router that has 50+ outputs. Would it be a good practice to have a Binary Router with 50+ outputs, or should the solution be more dynamic (based on actual languages included in the ZIP)?

Kind regards,

JF

koryknick
Employee
Employee

@jfpelletier - Attached is a much more generic solution that uses the execution node tmp space as a cache to unzip the files temporarily to allow for re-zipping into a file per language.  There are 2 pipelines: a parent to split the input zipfile into individual files in the tmp mount, and a child pipeline that is called to re-zip the files for each language that was contained in the input zip.

The parent pipeline reads the zipfile, writes each to tmp, waits for the entire zip to be split out (using Tail snap), gets the list of language directories (using Directory Browser snap) that were created, and calls the child (using Pipeline Execute snap) to rewrite the individual zip files per language.

koryknick_0-1705431347187.png

Note that the tmp files are written using the pipe.tmpDir built-in value.  This directory is used as scratch space of the executing pipeline and only exists as long as the pipeline is running - all contents are automatically purged as soon as the pipeline ends (either success or failure).

The child pipeline gets the list of files in the language directory, reads the file content, uses the Mapper snap to change the "content-location" value to recreate the file path, then creates a single zipfile with all of the files associated with that language.

koryknick_1-1705431662527.png

Note that the Mapper snap in this pipeline was switched to use Binary input and output views.  In this context, the Mapper is only going to affect the binary header values, not the actual file content.

koryknick_2-1705432190772.png

Note that the syntax used in these pipelines to refer to the "tmp" directory location may not work as-is on a snaplex running under Windows.  It was developed and tested on a snaplex running Linux.

Hope this helps!

Hello @koryknick,

Thanks a lot for this, it worked for me first time and I got exactly what I needed as results!

And I want to add an additional big thank you for the explanations! Your response is very well documented and it's very clear. It really helps!

Kind regards,

JF