cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading first N records from a file?

PSAmmirata
Employee
Employee

I have a file with a large number of rows. I want to be able to read the file, parse it, but stop reading the file after N rows have been parsed. Is this possible? I tried using a Head snap after the parser snap, but file reader continues to read the entire file and pipeline doesnโ€™t complete until the entire file has been read; even though Iโ€™m only interested in the first N rows.

1 ACCEPTED SOLUTION

Continuing the discussion from Reading first N records from a file?:

Ok, thanks for the explanation. That makes sense. Your issue isnโ€™t really with the fact that snaps upstream (the File Reader + CSV Parser, or whatever) keep running. Itโ€™s with the fact that the snaps downstream (a Formatter + File Writer, perhaps) do โ€“ they donโ€™t complete (write the file) as soon as the Head snap has written the only document it will write.

So, yes, thereโ€™s actually a simple fix we can make to the Head snap to do just that: close the output view as soon as the desired number of documents are written. This will cause the downstream snaps to finish writing their output. I just tried it and it works as expected. I think we should be able to get this fix into our forthcoming release planned for Nov 14.

View solution in original post

11 REPLIES 11

ptaylor
Employee
Employee

Simple answer: no.

In theory, snaps like the CSV Parser could be enhanced with a new setting to limit the number of output documents, but thereโ€™s little reason to burden snaps with this additional complexity when you can achieve the same result by adding a Head snap.

May I ask why itโ€™s preferable to stop reading the file early?

In my pipeline Iโ€™m performing some analysis on those first N records and writing the analysis results to a file. The input file is very large and the output file containing the analysis results doesnโ€™t appear to be closed until the pipeline completes (when the file reader finishes reading the file). The amount of time between the analysis of the first N records being complete and the file reader finishing reading the file is significant; at least significant enough that our user has complained about it.

Continuing the discussion from Reading first N records from a file?:

Ok, thanks for the explanation. That makes sense. Your issue isnโ€™t really with the fact that snaps upstream (the File Reader + CSV Parser, or whatever) keep running. Itโ€™s with the fact that the snaps downstream (a Formatter + File Writer, perhaps) do โ€“ they donโ€™t complete (write the file) as soon as the Head snap has written the only document it will write.

So, yes, thereโ€™s actually a simple fix we can make to the Head snap to do just that: close the output view as soon as the desired number of documents are written. This will cause the downstream snaps to finish writing their output. I just tried it and it works as expected. I think we should be able to get this fix into our forthcoming release planned for Nov 14.

@ptaylor - Once this change is implemented, is the Exit snap the best way to programmatically stop the pipeline once the Headโ€™s output view is closed?

Why do you need the pipeline to stop if the output file containing the analysis results has been written and closed?

But, yes, I suppose you could add an Exit snap after the File Writer if you do want the pipeline to exit. But even if you donโ€™t, the File Writer will have completed and written the file that your user needs, even while the Head continues to consume its input and the pipeline continues to run.