Forum Discussion
I remembered a few more things I wanted to say.
For performance, you also want to take advantage of Kafka’s ability to automatically distribute partitions to all of the different Consumer instances in the same consumer group. You would do this by running multiple instances of the same pipeline, typically one per node. So if your plex has 4 nodes, you would run one instance of your pipeline on each of those 4 nodes. Assuming the topic your processing has multiple partitions (and it certainly should if you care about performance), and you’ve left the Partition
setting blank in the Consumer snap’s settings, Kafka will automatically assign the partitions across the different instances in the same group (having the same Group ID
value). So if your topic has 24 partitions, each of your 4 nodes will get 6 partitions each. If one node is temporarily removed, Kafka will automatically rebalance the partitions so that each of the 3 remaining nodes gets 8 partitions. This is called horizontal scaling and is the key to reliable high performance in distributed computing.
I don’t think that copying data to S3 really solves anything. It’s just adding more overhead to a system that you’re trying to optimize. A well-managed Kafka cluster designed for production loads (multiple nodes with elastic disks, etc) is a very reliable place to keep data. Data in Kafka topics can be read and re-read any number of times for any number of different applications, provided you’ve configured the retention policy for your topics appropriately.
Once you’ve reconfigured your pipelines as discussed in my last reply, and then run a sufficient number of instances of that pipeline in parallel, I think you’ll find that you have the reliable, high-throughput solution that you’re looking for.