cancel
Showing results for 
Search instead for 
Did you mean: 

How to add S3 path to document

Andrei_Y
New Contributor III

Hi,

I need your help with following. For example, I have two files in S3:

 

s3:///bucket_name/file1.json
{
    "data": [
        {
            "id": 1,
            "name": "string 1"
        },
        {
            "id": 2,
            "name": "string 2"
        }
    ]
}

s3:///bucket_name/file2.json
{
    "data": [
        {
            "id": 3,
            "name": "string 3"
        },
        {
            "id": 4,
            "name": "string 4"
        }
    ]
}

 

 Output should look:

 

[
  {
    "id": 1,
    "name": "string 1",
    "path": "s3:///bucket_name/file1.json"
  },
  {
    "id": 2,
    "name": "string 2",
    "path": "s3:///bucket_name/file1.json"
  }
  {
    "id": 3,
    "name": "string 3",
    "path": "s3:///bucket_name/file2.json"
  },
  {
    "id": 4,
    "name": "string 4",
    "path": "s3:///bucket_name/file2.json"
  }
]

 

 I created a pipeline to read multi files but I don't know how to add "path" from S3 Browser to each document.

Screenshot 2024-06-03 175410.png

Thanks in advance

1 ACCEPTED SOLUTION

koryknick
Employee
Employee

@Andrei_Y - the easiest way to retain file information when reading multiple files and processing them is to use a parent/child pipeline design.  In the parent, use the S3 Browser, followed by a Pipeline Execute snap and pass pipeline parameters to a child pipeline with the filename to be processed.  Pipeline Parameters are globally accessible in your pipeline, so having the child work on only one file makes it very simple to include the filename with the data after it has been parsed.

If you don't want to use a parent/child design, take a look at the attached pipeline which will attach the file metadata with each document processed in the JSON files.

koryknick_0-1717702227371.png

Here, the Binary Copy splits the binary stream and allows me to grab the metadata separately from the content.  The top path uses the JSON Parser with the "Process Array" property unchecked so that the entire file content is retained in a single JSON document on the output.  This is important when we get to the Join snap (Merge Content to metadata).  The Mapper that follows here (Map content array) is just taking the input documents and naming it so I can reference it in the JSON Splitter.

The bottom path after the Binary Copy is simply using a Mapper snap with the input view set to Binary type so I can capture the binary header information from the file.  The "Drop Content" Mapper snap removes the actual data from this path since I already have it parsed into JSON from the top path.

From here, I just use the Join snap with the Join Type set to Merge that simply combines the contents of each record from top and bottom path: 1=>1, 2=>2, 3=>3, etc.  Finally, we use JSON Splitter to explode the contents of the files and include the file metadata with each output record.

Hope this helps!

 

View solution in original post

3 REPLIES 3

koryknick
Employee
Employee

@Andrei_Y - the easiest way to retain file information when reading multiple files and processing them is to use a parent/child pipeline design.  In the parent, use the S3 Browser, followed by a Pipeline Execute snap and pass pipeline parameters to a child pipeline with the filename to be processed.  Pipeline Parameters are globally accessible in your pipeline, so having the child work on only one file makes it very simple to include the filename with the data after it has been parsed.

If you don't want to use a parent/child design, take a look at the attached pipeline which will attach the file metadata with each document processed in the JSON files.

koryknick_0-1717702227371.png

Here, the Binary Copy splits the binary stream and allows me to grab the metadata separately from the content.  The top path uses the JSON Parser with the "Process Array" property unchecked so that the entire file content is retained in a single JSON document on the output.  This is important when we get to the Join snap (Merge Content to metadata).  The Mapper that follows here (Map content array) is just taking the input documents and naming it so I can reference it in the JSON Splitter.

The bottom path after the Binary Copy is simply using a Mapper snap with the input view set to Binary type so I can capture the binary header information from the file.  The "Drop Content" Mapper snap removes the actual data from this path since I already have it parsed into JSON from the top path.

From here, I just use the Join snap with the Join Type set to Merge that simply combines the contents of each record from top and bottom path: 1=>1, 2=>2, 3=>3, etc.  Finally, we use JSON Splitter to explode the contents of the files and include the file metadata with each output record.

Hope this helps!

 

Andrei_Y
New Contributor III

@koryknick It works. Thanks. I also found another solution for JSON files in S3. I used the S3 select instead of S3 download:

"select s.*, '" + $path + "' as path from S3Object[*].data[*] as s"

koryknick
Employee
Employee

Thanks @Andrei_Y - I learned something new today!  I hadn't thought about that!