Forum Discussion

PSAmmirata's avatar
PSAmmirata
Employee
5 years ago
Solved

Reading first N records from a file?

I have a file with a large number of rows. I want to be able to read the file, parse it, but stop reading the file after N rows have been parsed. Is this possible? I tried using a Head snap after the parser snap, but file reader continues to read the entire file and pipeline doesn’t complete until the entire file has been read; even though I’m only interested in the first N rows.

  • ptaylor's avatar
    ptaylor
    5 years ago

    Continuing the discussion from Reading first N records from a file?:

    Ok, thanks for the explanation. That makes sense. Your issue isn’t really with the fact that snaps upstream (the File Reader + CSV Parser, or whatever) keep running. It’s with the fact that the snaps downstream (a Formatter + File Writer, perhaps) do – they don’t complete (write the file) as soon as the Head snap has written the only document it will write.

    So, yes, there’s actually a simple fix we can make to the Head snap to do just that: close the output view as soon as the desired number of documents are written. This will cause the downstream snaps to finish writing their output. I just tried it and it works as expected. I think we should be able to get this fix into our forthcoming release planned for Nov 14.

11 Replies

  • skatpally's avatar
    skatpally
    Former Employee

    Exit snap can help, but that will mark the pipeline as failure in the Dashboard.

    • PSAmmirata's avatar
      PSAmmirata
      Employee

      While not ideal, using the Exit snap helped. I need to ensure that the downstream processing of the N records completes before the Exit snap triggers. I can use the threshold limit in the Exit snap to “delay” the exit a bit.

  • Simple answer: no.

    In theory, snaps like the CSV Parser could be enhanced with a new setting to limit the number of output documents, but there’s little reason to burden snaps with this additional complexity when you can achieve the same result by adding a Head snap.

    May I ask why it’s preferable to stop reading the file early?

    • PSAmmirata's avatar
      PSAmmirata
      Employee

      In my pipeline I’m performing some analysis on those first N records and writing the analysis results to a file. The input file is very large and the output file containing the analysis results doesn’t appear to be closed until the pipeline completes (when the file reader finishes reading the file). The amount of time between the analysis of the first N records being complete and the file reader finishing reading the file is significant; at least significant enough that our user has complained about it.

      • ptaylor's avatar
        ptaylor
        Employee

        Continuing the discussion from Reading first N records from a file?:

        Ok, thanks for the explanation. That makes sense. Your issue isn’t really with the fact that snaps upstream (the File Reader + CSV Parser, or whatever) keep running. It’s with the fact that the snaps downstream (a Formatter + File Writer, perhaps) do – they don’t complete (write the file) as soon as the Head snap has written the only document it will write.

        So, yes, there’s actually a simple fix we can make to the Head snap to do just that: close the output view as soon as the desired number of documents are written. This will cause the downstream snaps to finish writing their output. I just tried it and it works as expected. I think we should be able to get this fix into our forthcoming release planned for Nov 14.