01-10-2019 01:37 AM
For saving data in parquet format in s3 , below is the pipeline configuration.
This pipeline creates meta data from the data itself , though it uses parquet data type ‘binary’ which is equivalent to string.
the first mapper converts the doc into string. we used the below arrow function into the 1st mapper:
$.mapValues((value, key) => value==null?“”:value.toString())
the second mapper function creates the meta from the data. The arrow function used for this:
$.keys().map(x=>{“col_name”:x,“data_type”:“binary”})
For reading the file back below is the configuration of parquet reader:
Please note “Use old data format” may be critical otherwise it may fail to read. This is to be checked when data are not nested.