โ06-28-2017 12:00 AM
The scenario I needed to accomplish was to pull data from SQL server, group the data sets together based on different fields and then dynamically define the file name based on those groupings.
Performance was the #1 consideration and competitive tool sets were able to complete the action in under 2 hours with a 300M record set. For my testing and documentation below, Iโm using a local data set of 3.6M records.
Initial attempts at using the SQL Read and SQL Execute snaps quickly excluded those based on required throughput (was getting somewhere between 12k and 18k records per second over the network, 26k records per second local). Calculated out, that would have taken 7-10 hours just for the query to complete.
The final solution ended up being broken down into 3 main steps:
Command line execution to call a BAT file which invokes the BCP command
Reading the dataset created from step 1 and performing the GROUP BY and a mapper to prep the data
Document2Binary snap and then a file write which utilizes the $[โcontent-locationโ] value to dynamical set the name.
Attached to this post is the ZIP export of the project which contains 3 pipelines along with a SAMPLE bat file which starts the bcp utility. You will need to edit the bat file for your specific database requirements, the pipeline parameter for where the output of the bat file is stored and the location of the file you want written out.
Pipeline: 1_Run_BCP_CLI
NOTE: There is a pipeline parameter for the BCP output file that gets generated
โReference to BATโ - points to a file on the local drive which executes the actual BCP process. (My files are located in c:\temp\bcp_testing)
โDEL BCP Outโ will delete the bcp file if it already exists (optional as the bcp process will just overwrite the file anyway)
โcli to runโ is renaming the bat key value (originally done as I was testing the cli and bcp execute - could be removed)
โremove nullsโ will clear out the results from the โDEL BCP Outโ since itโs not part of the command line that needs to be executed.
โExecute CLIโ is a script snap which will kick off the bat file and once completed, return a single record with the results.
โProcess BCPโ Outโ is a pipeline execute which calls 2_BULK_DAT_READ and passes the pipeline parameter for the file to read in the next step
Pipeline: 2_BULK_DAT_READ
โBCP Out File Readโ will use the pipeline parameter value specified for which file to read
โCSV Parserโ self explanatory - does NOT have any headers on the data file (to enhance the pipeline, you could add the second input and define the file format with column names and types)
โGroup by Fieldsโ takes the first 2 field names (field001 and field002) and will create groupings for each set. This is the results of the initials for both first and last name from the BCP SQL Query.
โMapperโ will convert the JSON payload to $content as well as define $[โcontent-locationโ] based on the grouped by fields. The expression is $groupBy.field001+"_"+$groupBy.field002
โPipeline Executeโ will provide both $content and $[โcontent-locationโ] to the next pipeline
Pipeline: 3_DynamicFile_Write
โDocument to Binaryโ with the option for โDocumentโ as the Encode or Decode setting allows the JSON records to be output as a binary stream.
โFile Writerโ will build the protocol, folder specification and the file name based on the provided $[โcontent-locationโ] value from before.
Execution Results
The 3.6M records were actually processed in 16 seconds. The BCP process itself took 24 seconds.
My group by was based on the first characters on both First name and Last name. This process ended up creating 294 files (locally) and used the naming convention of _.json
Notes and File
The file created contains a KEY for โcontentโ and is not pretty print json. For the screen cap above, Iโm utilizing the JSTool โ JSFormat plug-in for Notepad++.
This approach will only create JSON formatted data (not CSV or other formatter options)
BCP is required to be installed and this was only tested on a WINDOWS groundplex
โ08-24-2017 11:09 PM
Tried it on Linux, the script worked as is for the customer, thanks for the post!
โ11-10-2021 07:12 AM
I have to execute BCP command on Linux nodes, How to configure the credentials to run the BCP on groundplex node?