Scalable Analytics Platform: A Data Engineering Journey

Employee

5 months ago

Introduction

In today's data-driven landscape, the ability to build a scalable and efficient analytics platform is not just advantageous—it's essential. At SnapLogic, with a massive volume of data generated by billions of pipeline executions each month, we encountered this challenge firsthand. The need extended beyond mere visualization; we required in-depth analysis and actionable insights. This blog post will delve into our innovative approach, leveraging Medallion Architecture, to design and implement a scalable analytics platform that empowers users to extract valuable insights and make informed decisions.

Problem Statement

SnapLogic's previous data analytics, primarily on MongoDB, struggled with the massive data from billions of monthly pipeline executions, leading to scalability and cost issues. MongoDB's limitations hindered complex analytical queries, data transformation, governance, and efficient data modeling. These challenges necessitated a shift to a more scalable and cost-effective solution like the Medallion Architecture with S3 and Trino to handle growing data volume and complexity for better analytical insights and decision-making.

The Architecture

Our architecture incorporates the medallion approach with a combination of technologies to enhance data processing and analysis.

Data Sources & Ingestion:

MongoDB serves as our primary data source, and we employ two methods for data extraction, storing the data in S3 in parquet format. Bulk Loading involves capturing a snapshot of MongoDB at a given moment and ingesting it into the bronze layer of the medallion architecture using Trino. SnapLogic supports over 1000 organizations, handling around 64 million real-time data updates daily through MongoDB Change Streams. To ensure high throughput and fault tolerance, these change streams are processed by delta loaders and transmitted via Apache Kafka (MSK). The extracted data is partitioned and stored in S3 for cost-effective and scalable storage, and subsequently accessed using Trino.

Medallion Architecture / Transformations:

Medallion architecture is a data design pattern that organizes data logically to enhance its structure and quality incrementally. The architecture organizes data into different layers (Bronze, Silver, Gold) to cater to the needs of various stakeholders. Medallion architectures are sometimes also referred to as multi-hop architectures. This provides a well-governed, comprehensive view of data, from raw data ingestion to consumption-ready formats for analytics and reporting.

Runtime data generated by pipeline executions and the static metadata change streams are analyzed using Trino, a distributed SQL query engine that efficiently queries data stored in S3. We optimized the data schema in S3 for efficient querying by Trino, considering factors like data partitioning and data compression.

To implement the medallion-based approach for transformations, we utilize DBT-based SnapLogic Pipelines that represent various layers. These pipelines run on custom DBT Ground plexes to transform and refine the data, ensuring data quality and preparing it for consumption by downstream applications like Asset Catalog and knowledge graphs to enrich our downstream ML models.

The Medallion architecture comprises distinct layers, each with designated functions. The Bronze layer ingests data from S3 and transfers it to the Silver layer, where the data undergoes cleaning and refinement through normalization. Subsequently, the Gold layer consolidates data using joins performed on the Silver layer, making it readily accessible for consumption by downstream applications.

Consumption

The gold layer, acting as a presentation layer, contains factual, operational, and inferred data consumed by two primary systems.
The Asset Catalog serves as the central System of Record (SoR) for all SnapLogic pipelines, enabling discovery, understanding, and governance of integration assets through exposed metadata (factual, inferred, and supplemented attributes concerning tasks, pipelines, and accounts).
The data within the gold layer tables is synchronized to a graph-based data model in Amazon Neptune. This graph-based model represents the interrelationships between assets, including pipelines, accounts, snaps, and files. Each asset type is represented by nodes, and edges signify dependencies and relationships.

Key Considerations during implementing

Data Extraction Overhead
Initial data extraction and ongoing change data capture require careful planning and optimization.
Data Consistency
Maintaining data consistency between MongoDB and the new architecture is crucial.
Architecture Complexity
Managing a multi-component architecture requires expertise and ongoing maintenance.

Benefits of the New Architecture

Cost Reduction: S3 data lakehouse minimizes storage and query expenses, with costs primarily tied to Trino's infrastructure.
Scalability and Performance: Medallion architecture with S3 and Trino enables efficient processing and analysis of large datasets, outperforming previous MongoDB-based analytics.
Streamlined Data Management: Medallion architecture enhances data organization with defined layers (Bronze, Silver, Gold) for resilience, reusability, and optimized performance for specific use cases.
Enhanced Data Insights: Neptune's graph model allows for complex relationship analysis, improving impact assessment and lineage tracking. Optimized S3 schemas for Trino further boost query speed.
Improved Data Governance and Compliance: The layered approach provides clear data lineage visibility, aiding governance and compliance efforts.
Increased Data Reusability and Development Speed: Well-modeled Silver and Gold layers can be reused across projects, reducing redundancy and accelerating the creation of business entities.

Conclusion

In summary, we’ve implemented a robust and scalable analytics platform utilizing the Medallion Architecture, moving from MongoDB-centric analytics to a more efficient, cost-effective, and data-driven approach using S3 and Trino. This architecture has enhanced data management, improved performance, enabled deeper data modeling, and strengthened data governance. Looking forward, we aim to further automate the transformation pipelines and integrate more machine learning models into our data processing layers to drive predictive analytics. Additionally, we are exploring real-time data processing enhancements and expanding our data lakehouse capabilities to support a wider array of data sources and analytical workloads, ensuring our platform remains at the forefront of data innovation.

Updated 6 months ago

Version 1.0