Sigma Framework Library

Library of cross-disciplinary best practices documents

Forum Widgets

Recent Discussions

Pipeline Design and Performance Optimization Guide
Introduction This document serves as a comprehensive best practice guide for developing efficient and robust Pipelines within the SnapLogic Platform. It offers guidelines that aim to optimize performance, enhance maintainability, reusability, and provide a basis for understanding common integration scenarios and how best to approach them. The best practices encompass various aspects of Pipeline design, including Pipeline behavior, performance optimization and governance guidelines. By adhering to these best practices, SnapLogic developers can create high-quality Pipelines that yield optimal results while promoting maintainability and reuse. The content within this document is intended for the SnapLogic Developer Community or an Architect, in addition to any individuals who may have an influence on the design, development or deployment of Pipelines. Authors: SnapLogic Enterprise Architecture team Why good Pipeline Design is important The SnapLogic Pipeline serves as the foundation for orchestrating data across business systems, both within and outside of an organization. One of its key benefits is its flexibility and the broad range of "Snaps" that aim to reduce the complexity involved in performing specific technical operations. The “SnapLogic Designer”, a graphical low-code environment for building an integration use case with Snaps, provides a canvas enabling users with little technical knowledge to construct integration Pipelines. As with any user-driven environment, users must exercise careful attention to ensure they not only achieve their desired business goals but also adhere to the right approach that aligns with industry and platform best practices. When dealing with a SnapLogic Pipeline, these best practices may encompass various considerations: Is my Pipeline optimized to perform efficiently? Will the Pipeline scale effectively when there's an increase in data demand or volume? If another developer were to review the Pipeline, would they easily comprehend its functionality and intended outcome? Does my Pipeline conform to my company's internal conventions and best practices? Not considering these factors may cause undesirable consequences for the business and users concerned. Relative to the considerations stated above, these consequences could be as follows: If data is not delivered to the target system, there may be financial consequences for the business. The business may experience data loss or inconsistency when unexpected demand occurs. Development and project teams are impacted if they are unable to deliver projects in a timely fashion. Lack of internal standardization limits a company's ability to govern usage across the whole business, thus making them less agile. Therefore, it is essential that users of the Platform consider best practice recommendations and also contemplate how they can adopt and govern the process to ensure successful business outcomes. Understanding Pipeline Behavior To better understand how Pipelines can be built effectively within SnapLogic, it is essential to have an understanding of the Pipeline’s internal characteristics and behaviors. This section aims to provide foundational knowledge about the internal behavior of Pipelines, enabling you to develop a solid understanding of how they operate and help influence better design decisions. Pipeline Execution States The execution of a SnapLogic Pipeline can be initiated either via a Triggered, Ultra or Scheduled task. In each case, the Pipeline transitions through a number of different ‘states’ with each state reflecting a distinct processing the lifecycle of the Pipeline, from invocation, preparation, execution to completion. The following section of the document will look to highlight this process in more detail and explain some of the internal behaviors. The typical Pipeline execution flow is as follows: Initialize Pipeline. Send Metadata to Snaplex. Prepare Pipeline, fetch & decrypt account credentials. Connect to endpoint security. Send execution metrics. Pipeline completes, and resources are released. The following section describes the different Pipeline state transitions & respective behavior in sequential order. State Purpose NoUpdate A pre-preparing state. This indicates a request to invoke a Pipeline has been received but the leader node or control plane is trying to establish which Snaplex node it should run on. (This state is only relevant if the Pipeline is executed on the leader node). Preparing Indicates the retrieval of relevant asset metadata including dependencies from the control plane relating to the invoked Pipeline. This process also carries out pre-validation of snap configuration alerting the user of any missing mandatory snap attributes. Prepared Pipeline is prepared and is ready to be executed Executing Pipeline executes and processes data, connecting to any Snap Endpoints using the specified protocols. Completed Pipeline execution is complete and the teardown resulting in the releasing of compute resources within the Snaplex Node. Final Pipeline execution metrics and sent to the Control Plane. Table 1.0 Pipeline state transitions Pipeline execution flow Pipeline Design Decision Flow The following decision tree can be used to establish the best Pipeline Design approach for a given use case. Snap Execution Model Snaps can be generally categorized into these types: Fully Streaming Most Snaps follow a fully streaming model. i.e. Read one document from the Input view (or from the source endpoint for Read Snaps), and write one document to the Output view or to the target endpoint. Streaming with batching Some Snaps are streaming with batching behavior. For example, the DB Insert Snap reads N documents and then makes one call to the database (where N is the batch size set in the database account). Aggregating Aggregating type Snaps (e.g. Aggregate, Group By, Join, Sort, Unique etc.) read all input documents before any output is written to the Output view. Aggregating Snaps can change the Pipeline execution characteristics significantly as these Snaps must receive all upstream documents before processing and sending the documents to the downstream Snaps. Pipeline Data Buffering Connected Snaps with a Pipeline communicate with one another using Input and Output views. An Input view accepts data being passed from an upstream snap, it operates on the data and then passes the data to its Output view. Each view implements a separate in-memory ring buffer at runtime. Given the following example, the Pipeline will have three separate ring buffers. These are represented by the circular connections between each snap (diamond shaped connections for binary Snaps). The size of each ring buffer can be configured by setting the below feature flags on the org. The default values are 1024 and 128 for DOCUMENT and BINARY data formats respectively. com.snaplogic.cc.jstream.view.publisher.AbstractPublisher.DOC_RING_BUFFER_SIZE=1024 com.snaplogic.cc.jstream.view.publisher.AbstractPublisher.BINARY_RING_BUFFER_SIZE=128 The values must be set as powers of two. The source Snap reads data from the endpoint and writes to the Output view. If the buffer is full (i.e. if the Consumer Snap is slow), then the Producer Snap will block on the write operation for the 1025th document. Pipeline branches execute independently. However in some cases, the data flow of a branch in a Pipeline can get blocked until another branch completes streaming the document. Example: A Join Snap might hang if its upstream Snaps (e.g. Copy, Router, Aggregator, or similar) has a blocked branch. This can be alleviated by setting Sorted streams to Unsorted in the Join Snap to buffer all documents in input views internally. The actual threads that a Pipeline consumes can be higher than the number of Snaps in a Pipeline. Some Snaps such as Pipeline Execute, Bulk loaders, and Snaps performing input/output, can use a higher number of threads compared to other Snaps. Sample Pipeline illustration for threads and buffers The following example Pipeline demonstrates the practical example of how the usage and composition of Snaps within a Pipeline change the characteristics of how the Pipeline will operate once it is executed. Segment 1 Segment 2 Six threads are initialized at Pipeline startup. There are a total of seven ring buffers. The Copy Snap has two buffers, all other Snaps have one output buffer each. There are two segments that run in parallel and are isolated (other than the fact that they run on the same node, sharing CPU/memory/IO bandwidth). The first segment has two branches. Performance of one branch can impact the other. For example, if the SOAP branch is slow, then the Copy Snap’s buffer for the SOAP branch will get full. At this point, the Copy Snap will stop processing documents until there is space available in the SOAP branch’s buffer. Placing an aggregating Snap like the Sort Snap in the slow branch changes the performance characteristics significantly as the Snap must receive all upstream documents before processing and sending the documents to the downstream Snaps. Memory Configuration thresholds Property / Threshold Where configured Default value Comments Maximum memory % Node properties tab of the Snaplex 85 (%) Threshold at which no more Pipelines will be assigned to a node Pipeline termination threshold Internal (Can be configured by setting the feature flag at the org level com.snaplogic.cc.snap.common. SnapThreadStatsPoller. MEMORY_HIGH_WATERMARK_PERCENT) 95 (%) Threshold at which the active Pipeline management feature kicks in and terminates Pipelines when the node memory consumption exceeds the threshold. Ideal range: 75-99 Pipeline restart delay interval Internal (Can be configured by setting the feature flag at the org level com.snaplogic.cc.snap.common. SnapThreadStatsPoller. PIPELINE_RESTART_DELAY_SECS) 30 (seconds) One Pipeline is terminated every 30 seconds until the node memory goes below the threshold (i.e. goes below 95%) Table 2.0 Snaplex node memory configurations The above thresholds can be optimized to minimize Pipeline terminations due Out-of-Memory exceptions. Note that the memory thresholds are based on the Physical memory on the node, and not the Virtual / Swap memory. Additional Reference: Optimizations for Swap Memory Hypothetical scenario Add 16 GB swap memory to a Snaplex node with 8 GB physical memory. Property Comments Swap Space on the server Add 16 GB of swap / virtual memory to the node. Total Memory Total Memory is now = 24 GB (8 GB Physical plus 16 GB Virtual) Maximum Heap Size Set to 90% (of 24 GB) = 22 GB Maximum Memory Set to 31% rounded (of 22 GB) = 7 GB The intent of the above calculation is to ensure that the JCC utilizes 7GB of the available 8GB memory for normal workloads. Beyond that, the load balancer can queue up additional Pipelines or send them to other nodes for processing. If Pipelines that are running collectively start using over 7GB of memory, then the JCC can utilize up to 22GB of the total heap memory by using the OS swap space per the above configuration. Table 3.0 Snaplex node memory configurations By updating the memory configurations as in the above example, the JCC utilizes 7 GB of the available 8 GB memory. Beyond that value, the load balancer would queue up additional Pipelines or distribute them across other nodes. Use the default configurations for normal workloads, and use Swap-enabled configuration for dynamic workloads. When your workload exceeds the available physical memory and the swap is utilized, the JCC can become slower due to additional IO overhead caused by swapping. Hence, configure a higher timeout for jcc.status_timeout_seconds and jcc.jcc_poll_timeout_seconds for the JCC health checks. We recommend that you limit to 16 GB the maximum swap to be used by the JCC. Using a larger swap configuration causes performance degradation during the JRE garbage collection operations Modularization Modularization can be implemented in SnapLogic Pipelines by making use of the Pipeline Execute Snap. This approach enables you to: Structure complex Pipelines into smaller segments through child Pipelines. Initiate parallel data processing using the Pooling option. Reuse child Pipelines. Orchestrate data processing across nodes, within the Snaplex or across Snaplexes. Distribute global values through Pipeline parameters across a set of child Pipeline Snaps. Modularization best practices: Modularize by business or technical functions. Modularize based on functionality and avoid deep nesting or nesting without a purpose. Modularize to simplify overly-complex Pipelines and reduce in-page references. Use the Pipeline Execute Snap over other Snaps such as Task Execute, ForEach, Auto-router (i.e. Router Snap with no routes defined with expressions), or Nested Pipelines. Pipeline Reuse with Pipeline Execute Detailed documentation with examples can be found in the SnapLogic documentation for Pipeline Execute. Use Pipeline Execute when: The child Pipeline is CPU/memory heavy and parallel processing can help increase throughput. Avoid when: The child Pipeline is lightweight where the distribution overhead can be higher than the benefit. Additional recommendations and best practices for the Pipeline Execute Snap: Use Reuse mode to reduce child runtime creation overhead. Reuse mode allows each child Pipeline instance to process multiple input documents. Note that the child Pipeline must be a streaming Pipeline for reuse mode. Use the batching (Batch size) option to batch data (avoid grouping records in parent). Use the Pool size (parallelism) option to add concurrency. If the document count is low then use the Pipeline Execute Snap for structuring Pipelines else embed the child segment within the Parent Pipeline instead of using Pipeline Execute. Set the Pool Size to > 1 to enable concurrent executions up to the specified pool size. Set Batch Size = N (where N > 1). This sends N number of documents to the child Pipeline input view. Use Execute On to specify the target Snaplex for the child Pipeline. Execute On can be set to one of the below values: LOCAL_NODE. Runs the child Pipeline on the same node as the parent Pipeline. This is recommended when the child Pipeline is being used for Pipeline structuring and reuse rather than Pipeline workload distribution. This option is used for most child Pipeline executions. LOCAL_SNAPLEX. Runs the child Pipeline on one of the available nodes in the same Snaplex as the parent Pipeline. The least utilized node principle is applied to determine the node where the child Pipeline will run.This has dependency on the network, and must be used when workload distribution within the Snaplex is required. SNAPLEX_WITH_PATH. Runs the child Pipeline on a user-specified Snaplex. This allows high workload distribution, and must be used when the child Pipeline has to run on a different Snaplex for endpoint connectivity restrictions or for effective workload distribution. This option also allows you to use Pipeline parameters to define relative paths for the Snaplex name. Additional Pipeline design recommendations This section lists some recommendations to improve Pipeline efficiency SLDB Note: SLDB should not be used as a file source or as a destination in any SnapLogic orgs (Prod / Non-Prod). You can use your own Cloud storage provider for this purpose. You may encounter issues such as file corruption, pipeline failures, inconsistent behavior, SLA violations, and platform latency if using SLDB instead of a separate Cloud storage for the file store. This applies to all File Reader / Writer Snaps and the SnapLogic API. File Read from an SLDB File source. File Write operations to SLDB as a destination. Use your own Cloud storage instead of SLDB for the following (or any other) File Read / Write use-cases: Store last run timestamps or other tracking information for processed documents. Store log files. Store other sensitive information. Read files from SLDB store. Avoid using Record Replay Snap in Production environments as the recorded documents are stored in an SLDB path making them visible to users with Read access. Snaps Enable Pagination for Snaps where supported (e.g. REST Snaps, HTTP Client, GraphQL, Marketo, etc.). There should also always be a Pagination interval to ensure that too many requests are not made in a short time. Use the Group By N Snap where there is a requirement to limit request sizes. E.g. Marketo API request. The Group By Fields Snap creates a new group every time a record with a different Group Field value is received. Place a Sort Snap before Group By Fields to avoid multiple sets of documents with the same group value. XML Parser Snap with a Splitter expression reduces memory overhead when reading large XML files. Use an Email Sender Snap with a Group By Snap to minimize the number of emails that get sent out. Pipelines Batch Size (only available if the Reuse executions option is not enabled) is used to control the amount of records that are passed into a child Pipeline. Setting this value to 1 will pass a single record for each instance of the child Pipeline. Avoid using this approach when processing large volumes of documents. Do not schedule a chain reaction. When possible, separate a large Pipeline into smaller pieces and schedule the individual Pipelines independently. Distribute the execution of resources across the timeline and avoid a chain reaction. Integration API limits must not exceed across all integrations running at the same time. Group By Snaps or Pipeline Execute can be used to achieve this. Optimization recommendations for common scenarios Scenario Recommendation Feature(s) Multiple Pipelines with similar structure Use parameterization with Pipeline Execute to reuse Pipelines Pipeline Execute Pipeline parameters Bulk Loading to target datasource Use Bulk Load Snaps where available (e.g. Azure SQL - Bulk Load, Snowflake - Bulk Load) Bulk Loading Mapper snap contains a large amount of mappings where the source & target field names are consistent Enable “Pass through” setting on the Mapper. Mapper - Pass Through Processing large data loads Perform target load operation within a Child Pipeline using the “Pipeline Execute” snap with “Execute On” set to “LOCAL_SNAPLEX”. Pipeline Execute Performing complex transformations and/or JOIN/SORT operations across multiple tables Perform transformations & operations within SQL query SQL Query Snaps High Throughput Message Queue to Database ingestion Batch polling and ingestion of messages by: Specifying matching values for Max Poll Record (Consumer Snap) with Batch Size (Database Account Setting). Performing database ingestion within a child Pipeline with Reuse Enabled on the Pipeline Execute Snap. Consumer Snaps Database Load Snaps Table 4.0 Optimization recommendations Configuring Triggered and Ultra Tasks for Optimal Performance Ultra Tasks Definition and Characteristics An Ultra Task is a type of task which can be used to execute Ultra Pipelines. Ultra Tasks are well-suited for scenarios where there is a need to process large volumes of data with low latency, high throughput, and persistent execution. While the performance of an Ultra Pipeline largely depends on the response times of the external applications to which the Pipeline connects to, there are a number of best practice recommendations that can be followed to ensure optimal performance and availability. General Ultra Best Practices Before building an Ultra Pipeline, consult the “Snap Support for Ultra Pipelines” documentation to understand if the desired Snaps are supported. For optimal Ultra performance, deploy a dedicated Snaplex to support Ultra workloads. There are two modes of Ultra Tasks - Headless Ultra and Low Latency Ultra API with each mode being characterized by the design of the Pipeline which is invoked by the Ultra Task. The modes are described in more detail below. Headless Ultra A Headless Ultra Pipeline is an Ultra Pipeline which does not require a Feedmaster, and where the data source is a Listener or Consumer type construct, for example Kafka Consumer, File Poller, SAP IDOC Listener (For a detailed list of supported Snaps, please click here). The Headless Ultra Pipeline executes continuously and polls the data source according to the frequency configured within the Snap passing documents from the source to downstream Snaps. Use Cases Processing real-time data streams such as message queues. High volume message or file processing patterns with concurrency. Publish/Subscribe messaging patterns. Best Practices Deploy multiple instances of the Ultra Task for High Availability. Decompose complex Pipelines into independent Pipeline using a Publish-Subscribe pattern. Lower the dependency on the Control Plane by avoiding the use of expressions to declare queue names, account paths etc. Set the ‘Maximum Failures’ Ultra Task configuration threshold according to the desired tolerance for failure. For long running Ultra Pipelines, set a higher ‘Max In-Flight’ option to a higher value within the Ultra Task configuration. When slow performing endpoints are observed within the Pipeline, use the Pipeline Execute Snap with Reuse mode enabled and the Pool Size field set to > 1 to create concurrency across multiple requests to the endpoint. Additional reference: Ultra Tasks Low Latency API Ultra Low Latency API Ultra is a high-performance API execution mode designed for real-time, low-latency data integration and processing. The Pipeline invoked by the Ultra Task is characterized by having an open input view for the first Snap used in the Pipeline (typically a HTTP Router or Mapper Snap). Requests made to the API are brokered through a ‘FeedMaster Node’, guaranteeing at least once message delivery. Use Cases High frequency & high throughput request-response use cases. Sub-second response times requirement. Best Practices Deploy multiple Feedmasters for High Availability. Deploy multiple instances of the Ultra Task for High Availability running within the same Snaplex. Leverage the ‘Alias’ setting within the Ultra Task configuration to support multi Snaplex High Availablity. To support unpredictable high volume API workloads, leverage the ‘Autoscale based on Feedmaster queue’ instance setting in the Ultra task configuration. When slow performing endpoints are observed within the Pipeline, use the Pipeline Execute Snap with the Reuse mode enabled and the Pool Size field set to > 1 to create concurrency across multiple requests to the endpoint. Use the HTTP Router Snap to handle supported & unsupported HTTP methods implemented by the Pipeline. Handle errors that may occur during the execution of the Pipeline and return the appropriate HTTP status code within the API response. This can be done either by using the Mapper, JSON Formatter or the XML Formatter Snap. Reference request query parameters using the $query object. Set the ‘Maximum Failures’ Ultra Task configuration setting according to the desired tolerance for failure. For long running Ultra Pipelines, set a higher ‘Max In-Flight’ setting within the Ultra Task configuration. Triggered Tasks Definition and Characteristics Triggered Tasks offer the method of invoking a Pipeline using an API endpoint when the consumption pattern of the API is infrequent and/or does not require low latency response times. Use Cases When a batch operation is required within the Pipeline, e.g. Join, Group By, Sort etc. Integrations that need to be initiated on-demand. Non-real time data ingestion. File ingestion and processing. Bulk data export APIs. Best Practices Avoid deep nesting of large child Pipelines. Use Snaplex URL to execute Triggered Tasks for reduced latency response times. Handle errors that may occur during the execution of the Pipeline and return the appropriate HTTP status code within the API response. This can be done either by using the Mapper, JSON Formatter or the XML Formatter Snap. Use the HTTP Router snap to handle supported & unsupported HTTP methods implemented by the Pipeline. Parallelize large data loads using the “Pipeline Execute” Snap with Pool Size > 1
ramaonline
Admin
3 years ago
Admin and Operation
Pipeline Performance
Sigma
11KViews
5likes
0Comments
Snaplex Capacity Tuning Guide
Introduction This document serves as a comprehensive best practice guide for developing efficient and robust Pipelines within the SnapLogic Platform. It offers guidelines that aim to optimize performance, enhance maintainability, reusability, and provide a basis for understanding common integration scenarios and how best to approach them. The best practices encompass various aspects of Pipeline design, including Pipeline behavior, performance optimization and governance guidelines. By adhering to these best practices, SnapLogic developers can create high-quality Pipelines that yield optimal results while promoting maintainability and reuse. The content within this document is intended for the SnapLogic Developer Community or an Architect, in addition to any individuals who may have an influence on the design, development or deployment of Pipelines within the SnapLogic platform. Authors: SnapLogic Enterprise Architecture team Snaplex Planning Snaplexes are a grouping of co-located nodes which are treated as a single logical entity for the purpose of Pipeline execution. The SnapLogic Control plane automatically performs load balancing of Pipeline workload within a Snaplex. Nodes in Snaplexes should be homogeneous, with the same CPU/memory/disk sizing and network configurations per node type (i.e. JCC / FeedMaster). The JCC and Feedmaster nodes in a Snaplex can be of different sizes. Examples of recommended configurations: Snaplex configurations JCC node count - 4 JCC node size for each node - Large Feedmaster node count - 2 Feedmaster node size for each node - Medium JCC node count - 4 JCC node size for each node - X-Large Feedmaster node count - 2 Feedmaster node size for each node - Large Object Definition Node A Node is a JVM (Java Virtual Machine) process which is installed on a server such as Windows or Linux. JCC Node The JCC node is responsible for: Preparation, validation, and execution of Pipelines. Send heartbeat to the Snaplogic Control plane indicating the health of the node. FeedMaster Node The FeedMaster node acts as an interface between the JCC nodes and the client. The main functions of a FeedMaster node are: Manage message queues. Send heartbeat to the SnapLogic Control plane indicating the health of the node. When setting up Snaplexes, it is recommended to plan out the number of Snaplexes to configure along with the usage criteria to achieve isolation across workloads. Snaplexes can be organized in various ways such as: Pipeline Workload - Organize Snaplexes by workload type: Batch, Low latency, and On-demand. Business Unit - Organize Snaplexes by business units. Geographical location - Organize Snaplexes by data center or geographic location. The recommendation is to use a combination of the above to optimize resource usage and achieve workload isolation. Snaplex Network Requirements Snaplexes should have the below network characteristics: Within a Snaplex: Less than 10 ms round trip latency between Snaplex nodes. Greater than 40 MB/sec throughput between Snaplex nodes. Snaplex to Control Plane: Less than 50 ms round trip latency to the SnapLogic Control plane. Greater than 20 MB/sec throughput to the SnapLogic Control plane. Pipeline Execute Pipeline execution using the Pipeline Execute Snap, nodes communicate with each other using HTTPS on port 8081. There is some resiliency to network failures and HTTPS requests are retried in the case of failures. Even though requests are retried, high network latency and dropped connections can result in Pipeline execution failures. Regular Pipeline executions run within a node, requiring no communication with other nodes in the Snaplex. When a Pipeline Execute Snap is used to run child Pipelines, there are three options: Option Comments LOCAL_NODE This option is recommended when the child Pipeline is being used for Pipeline structuring and reuse rather than Pipeline workload distribution. Use this option for most regular child Pipeline executions. LOCAL_SNAPLEX The network communication is optimized for streaming data processing since the child Pipeline is on the local Snaplex. Use this option only when workload distribution within the Snaplex is required. SNAPLEX_WITH_PATH This has high dependency on the network. The network communication is optimized for batch data processing since the child Pipeline is on a remote Snaplex. Use this option only when the child Pipeline has to run on a different Snaplex, either because of endpoint connectivity restrictions or for workload distribution. Ultra Pipelines The JCC nodes communicate with the FeedMaster nodes over TCP with SSL on port 8084 when executing Ultra Pipelines. The communication between nodes is based on a message queue. This communication is not resilient to network failure, so a reliable network is required between the Snaplex nodes for Ultra Pipeline processing. In case of any network failures, the currently processing Ultra requests will be retried or in some instances fail with errors. If there is a communication failure between the JCC and Feedmaster nodes, then the request will be retried for up to five times. This is controlled by the ultra.max_redelivery_count Snaplex configuration. There is an overall 15-minute timeout for an Ultra request to the Feedmaster that is configurable at the request level using the X-SL-RequestTimeout HTTP request header or at the Snaplex level by using the llfeed.request_timeout config setting. Note that both ultra.max_redelivery_count and llfeed.request_timeout are configured under Node Properties -> Global Properties for GroundPlexes. You can submit a support request to configure these properties for your Cloudplexes. Pipeline Load Balancing The Control plane performs load balancing for Pipeline execution requests on a Snaplex. The following table lists the configurations that are involved: Property / Threshold Where configured Default value Comments Maximum Slots Node properties tab of the Snaplex 4000 One slot = One Snap = One active thread on the node A percentage of slots (configurable with the Reserved slot % property) are reserved for interactive Pipeline executions and validations thru the Designer tool. Pipelines will be queued if the threshold is reached. Some Snaps such as Pipeline Execute, Bulk loaders, and Snaps performing input/output, can use a higher number of threads compared to other Snaps. Maximum memory % Node properties tab of the Snaplex 85 (%) Threshold at which no more Pipelines will be assigned to a node Snaplex node resources (CPU, FDs, Memory) Node server configurations Configurable If the Control plane detects that there are not enough resources available on the Snaplex, then the Pipeline execution requests will be queued up on the control plane, and resume when resources are available. The Control plane dispatches the Pipeline to the node which has the most available capacity in terms of CPU/memory and file descriptors. For child Pipeline executions using the Pipeline Execute Snap, there is a preference given for running the child on the local node to avoid the network transfer penalty. Table 1.0 Configurations for Pipeline load balancing Snaplex Resource Management Capacity Planning This section provides some guidelines for Snaplex capacity planning and tuning. Configuration / Use-case Comments Workload isolation Isolate workloads across Snaplexes based on workload type, geographic location, and business unit. Node sizing Size the node (CPU, RAM, disk space) in a Snaplex based on Pipeline workload type. Batch data processing needs larger nodes while Streaming/API processing can use smaller nodes. Maximum Slots One slot = One Snap = One active thread on the node A percentage of slots (configurable with the Reserved slot % property) are reserved for interactive Pipeline executions and validations thru the Designer tool. Pipelines will be queued if the threshold is reached. Some Snaps such as Pipeline Execute, Bulk loaders, and Snaps performing input/output, can use a higher number of threads compared to other Snaps. The general recommendation is to configure this property based on the node memory configuration. Example: 8 GB - 2000 Slots 16 GB - 4000 Slots API Workloads For API workloads, the rule of thumb is to have 100 active ultra API calls per 8 GB of RAM, or 20 active triggered API calls per 8 GB of RAM. So a 16 GB node can have 200 active ultra API calls or 40 active triggered API calls. Node sizing The number of nodes in a Snaplex can be estimated based on the count of batch and streaming Pipelines. The number of FeedMaster nodes can be half of the JCC node count, with a minimum of two recommended for high availability. For active Pipeline count estimates, error Pipelines can be excluded from the count since they do not consume resources under the normal workload. Table 1.1 Configurations for Snaplex capacity planning Capacity Tuning Below are some best practices for Snaplex capacity tuning: Configuration / Use-case Comments Slot counts The Maximum slot count can be tuned based on the alerts and dashboard events. It is not required to restart the nodes for this configuration to take effect. Queued Pipelines - Increase slot count by 25% Busy nodes - Reduce slot count by 25% The slot count should not be set to more than 50% above the recommended value for the node configuration. e.g. The recommended slot count on a node with 16 GB RAM is 4000. Setting it to higher than 6000 is not advisable.If you observe high CPU / memory consumption on the node despite lowering the slot count by 25%, then consider allocating additional resources to the Snaplex nodes. Workloads Batch Workloads: Expand the node memory up to 64 GB, and deploy additional nodes for increased capacity. API Workloads: Deploy additional nodes instead of expanding the memory on the current node. Active Pipelines As a general rule, it's suggested to maintain fewer than 500 active Pipeline instances on a single node. Exceeding this threshold can lead to communication bottlenecks with the Control plane. If the number of active Pipeline instances exceeds 500, then the advisable course of action is to consider the addition of more nodes. CPU CPU consumption can be optimized by setting the Pool size and Batch size options on Pipeline Execute Snaps. Memory See Table 3.0 below Additional Reference: Optimizations for Swap Memory Table 2.0 Configurations for Snaplex capacity tuning Memory Configuration thresholds Property / Threshold Where configured Default value Comments Maximum memory % Node properties tab of the Snaplex 85 (%) Threshold at which no more Pipelines will be assigned to a node Pipeline termination threshold Internal (Can be configured by setting the feature flag at the org level com.snaplogic.cc.snap.common.SnapThreadStatsPoller. MEMORY_HIGH_WATERMARK_PERCENT) 95 (%) Threshold at which the active Pipeline management feature kicks in and terminates pipelines when the node memory consumption exceeds the threshold. Ideal range: 75-99 Pipeline restart delay interval Internal (Can be configured by setting the feature flag at the org level com.snaplogic.cc.snap.common.SnapThreadStatsPoller. PIPELINE_RESTART_DELAY_SECS) 30 (seconds) One Pipeline is terminated every 30 seconds until the node memory goes below the threshold (i.e. goes below 95%) Table 3.0 Snaplex node memory configurations The above thresholds can be optimized to minimize Pipeline terminations due Out-of-Memory exceptions. Note that the memory thresholds are based on the Physical memory on the node, and not the Virtual / Swap memory. Snaplex Alerts SnapLogic supports alerts and notifications through email and Slack channels. These can be configured in the Manager interface under Settings. The recommended alerts are listed in the table below. Alert type Comments Snaplex status alerts Status alerts can be created at the org level or the Snaplex level (in the Snaplex properties). These allow notifications to be sent when the Snaplex node is unable to communicate with the SnapLogic control plane or there are other issues detected with the Snaplex. Snaplex Resource usage alerts Set up alerts for these event types: Snaplex congestion Snaplex load average Snaplex node memory usage Snaplex node disk usage Table 4.0 Recommended Snaplex Alerts Reference: Alerts Slack notifications
ramaonline
Admin
3 years ago
Admin and Operation
Sigma
11KViews
3likes
5Comments
Automated Deployment (CICD) of SnapLogic assets with GitHub
Introduction This guide is a reference document for the deployment of SnapLogic assets to a GitHub repository. It also includes sample YAML code for a GitHub Actions workflow which can be used to automate the deployment of assets across orgs (Dev -> Stg / Stg -> Prod, etc.) This guide is targeted towards SnapLogic Environment Administrators (Org Administrators) and users who are responsible for the deployment of SnapLogic assets / Release management operations. Section B covers automated deployment with GitHub Actions, and Section A illustrates a manual deployment flow using the Manager interface. Author: Ram Bysani SnapLogic Enterprise Architecture team SnapLogic Git Integration Git Integration allows you to track, update, and manage versions of SnapLogic assets using the graphical interface or the public APIs. The following asset types can be tracked in a GitHub repository: Accounts Files Pipelines Tasks Git model A) Asset deployment across environments - an example The example in this document illustrates a sample deployment of SnapLogic assets from the Dev environment (org) to the Prod environment. A similar methodology can be adopted to deploy assets from Dev -> Stg -> Prod environments. The environments should be configured for Git integration with GitHub. Please refer to the steps in the documentation. Git Integration Git operations The assets in this example are tracked at a project space level, i.e. one Project Space in Dev is associated with a single branch in the GitHub repository. A single GitHub repository is used to maintain the branches for Dev, Stg, Prod, etc. Repository branches can also be deleted and re-created for specific deployment needs. New / Modified Assets in the Dev Org Project Space: Dev_Integration_Space with the below project folders having SnapLogic assets. Integration_Project_1, Integration_Project_2, share Prod Environment We have already defined an empty project space named Prod_GH_Integration in the Prod org. This step can also be done by using the SnapLogic public API. Project APIs. Define branches in the GitHub repository Create individual branches in the GitHub repository for the Dev and Prod project space assets. You can choose the main branch as the default branch while creating Dev_GH_Space. Choose the Dev_GH_Space branch as the source when creating the Prod_GH_Space branch. Each branch in the GitHub repository corresponds to a Project Space in SnapLogic. e.g.: Dev_GH_Space, Prod_GH_Space Commit Dev assets to GitHub Connect to the Dev (source) environment in the SnapLogic Manager interface, and navigate to the project space named Dev_GH_Integration_Space. Right click and select Git Repository Checkout. Choose the Git repository branch Dev_GH_Space. You can see that the Git status has changed to Tracked for all assets under the child projects. Note that some assets appear with status Untracked as these were already existing in the main branch. These assets would not be committed to the Git repository. Notice the tracking message with the branch name and commit id next to the project space name: Tracked with Git repository: byaniram/RB_Snaprepo/heads/Dev_GH_Space, commit: 9a22ac8 Connect to the GitHub repository and verify the commit status for the branch Dev_GH_Space. Create Pull Request in GitHub At this step, you would need to create a Pull Request in GitHub. Choose Prod_GH_Space as the base branch, and Dev_GH_Space as the compare branch, and create the Pull request. This action would merge the assets contained in the Dev_GH_Space branch into the Prod_GH_Space branch. Connect to the GitHub repository and verify the commit status for the branch Prod_GH_Space. The assets have now been committed to the Prod environment and are tracked in the GitHub repository under the branch - Prod_GH_Space. It is also possible to merge and pull from additional branch(es) into a single Prod_GH_Space if you have a need for it. You would need to repeat the Pull / Merge process as above with the base branch being Prod_GH_Space, and the compare branch being one of Dev_GH_Space, Dev_GH_Space_1, or Dev_GH_Space_2. Pulling / Committing assets into the Prod Org Connect to the Prod (target) environment in the SnapLogic Manager interface, and navigate to the project space named Prod_GH_Integration_Space. Right click and select Git Repository Checkout. Choose the Git repository branch Prod_GH_Space. Choose Git Pull to pull the assets into the Project space. The assets from the Dev_Integration_Space project space of the Dev environment are deployed to the Prod_Integration_Space project space of the Prod environment. Notice the tracking message with the branch name and commit id next to the project space name: Tracked with Git repository: byaniram/RB_Snaprepo/heads/Prod_GH_Space, commit: ce0c368 For subsequent deployments of changed assets, you would first do a Commit to Git for the project space in the SnapLogic Dev environment, followed by the above steps. Changed assets would be visible with a Git status of ‘Tracked, Modified locally’ in the SnapLogic Manager. B) Deployment Automation using a GitHub Actions Workflow Actions workflow YAML sample A GitHub Actions workflow can be used to automate the deployment of assets across SnapLogic environments (such as Dev to Stg, Stg to Prod, etc.). A workflow is a configurable automated process made up of one or more jobs. You must create a YAML file to define your workflow configuration. Here’s a complete YAML file for the Dev -> Prod deployment example described in Section A above. The complete YAML file is attached for your reference. Please create a new Workflow from the Actions tab, and paste the contents of the file in the workflow editor and commit changes. # Actions workflow for automated deployment of SnapLogic assets name: SnapLogic CICD Sample on: push: branches: - Dev_GH_Space # Uncomment the below line if you need to execute the workflow manually. # workflow_dispatch: jobs: pull_merge_branches: runs-on: ubuntu-latest steps: - name: Checkout repository uses: actions/checkout@v4 - name: Merge Dev to Prod uses: devmasx/merge-branch@master with: type: now from_branch: Dev_GH_Space target_branch: Prod_GH_Space github_token: ${{ secrets.ACTIONS_TOKEN }} - name: Checkout project assets to Prod project space run: | curl -s -X POST \${{vars.SNAP_URL}}/api/1/rest/public/project/pull/${{vars.SNAP_ORG}}/${{vars.PROJECT_SPACE}} \ -H "Content-Type:application/json" -H "Authorization:Basic ${{secrets.BASE64_TOKEN}}" \ -d '{"use_theirs":"true"}' Please refer to the GitHub documentation for information related to Workflow usage and syntax: GitHub Workflows Workflow syntax The following table provides clarification on certain aspects of the sample workflow for better understanding. Section Comments runs-on: ubuntu-latest runs-on defines the runner (type of machine) to use to run the job. ubuntu-latest specifies a GitHub hosted runner image. GitHub hosted runners uses: actions/checkout@v4 checkout is an action which is available in the GitHub marketplace. This action checks out the repository for use. v4 is the version number of the action. https://github.com/marketplace/actions/checkout uses: devmasx/merge-branch@master merge-branch is an action from the GitHub marketplace. This action runs a Git merge operation. https://github.com/marketplace/actions/merge-branch It also requires you to define a personal access token (classic) under Developer Settings -> Personal access tokens. Select both the repo and workflow checkboxes. curl -s -X POST \ ${{vars.SNAP_URL}}/api/1/rest/public/project/pull/${{vars.SNAP_ORG}}/${{vars.PROJECT_SPACE}} \ -H "Content-Type:application/json" -H "Authorization:Basic ${{secrets.BASE64_TOKEN}}" \ -d '{"use_theirs":"true"}' This is a CURL command that executes the SnapLogic public API to pull the latest project files from Git. See Pull the latest project files from Git. The referenced variables are defined on the GitHub repository under Settings -> Secrets and variables -> Actions. The vars context is used to reference those variables. (e.g. SNAP_ORG, PROJECT_SPACE) You can also define encrypted Secrets for sensitive data and reference them using the secrets context as in the example. (e.g. BASE64_TOKEN has the base64 encoded string for username and password). Workflow Variables Table 1.0 - Workflow Actions Workflow execution The above Actions workflow will be automatically executed whenever there is a “Push” / “Git Commit” operation to the Dev_GH_Space branch. i.e. whenever a commit is done from the Dev SnapLogic environment project space. The workflow will execute the pull-merge operation to the Prod_GH_Space branch, and pull the latest project assets into the Prod SnapLogic environment. The YAML file must be created under the .github/workflows folder of the Dev_GH_Space branch in the GitHub repository. The workflow run status will be visible under the Actions tab. Note: If you wish to manually execute the pull-merge post code review, then you can uncomment the two lines in the script to enable workflow_dispatch, and execute the Actions workflow manually from the Actions tab on GitHub. # Uncomment the below line if you need to execute the workflow manually. # workflow_dispatch: You can edit and modify the YAML file as per your requirements. Subsequent commits and deployments from Dev->Prod can be automated similarly. Action Comments Developer commits new code or updates assets in the Dev org to the GitHub repository SnapLogic Dev org Manager Interface Asset -> Add to repository. Ensure status shows Tracked Project Space -> Commit to Git Create and merge Pull Request Create a new Pull Request on GitHub, and merge the newly committed assets by choosing the Prod branch as the base, and the Dev branch as the compare branch. Pull the updated assets into the Prod org SnapLogic Prod org Manager Interface Project Space -> Git Pull Table 2.0 - Steps for subsequent / future asset deployment Deployment flow (Dev->Test->Prod) Note: Future versions of this document will cover additional deployment scenarios. Please post your comments on the article.
ramaonline
Admin
2 years ago
Actions workflow
CICD
collaborative development
Continuous Integration
Deployment
9.2KViews
3likes
8Comments
Platform Administration Reference guide v3
Introduction This document is a reference manual for common administrative and management tasks on the SnapLogic platform. It has been revised to include the new Admin Manager and Monitor functionality, which replace the Classic Manager and Dashboard interfaces respectively. This document is for SnapLogic Environment Administrators (Org Administrators) and users involved in supporting or managing the platform components. Author: Ram Bysani SnapLogic Enterprise Architecture team Environment Administrator (known as Org Admin in the Classic Manager) permissions There are two reserved groups in SnapLogic: admins: Users in this group have full access to all projects in the Org. members: Users in this group have access to projects that they create, or to which they are granted access. Users are automatically added to this group when you create them, and they must be a part of the members group to have any privileges within that Org. There are two user roles: Environment admins: Org users who can manage the Org. Environment admins are part of the admins group, and this role is named “Org Admin” in the classic Manager. Basic user: All non-admin users. Within an Org, basic users can create projects and work with assets in the Project spaces to which they have been granted permission. To gain Org administrator privileges, a Basic user can be added to the admins group. The below table lists the various tasks under the different categories that an Environment admin user can perform: Task Comments USER MANAGEMENT Create and delete users. Update user profiles. Create and delete groups. Add users to a group. Configure password expiration policies. Enable users’ access to applications (AutoSync, IIP) When a user is removed from an Org, the administrator that removes the user becomes the owner of that user's assets. Reference: User Management MANAGER Create and manage Project Spaces. Update permissions (R, W, X) on an individual Project space and projects. Delete a Project space. Restore Project spaces, projects, and assets from the Recycle bin. Permanently delete Project spaces, projects, and assets from the Recycle bin. Configure Git integration and integration with tools such as Azure Repos, GitLab, and GHES. View Account Statistics, and generate reports for accounts, projects, and pipelines within the project that use an account. Upgrade/downgrade Snap Pack versions. ALERTS and NOTIFICATIONS Set up alerts and notifications. Set up Slack channels and recipients for notifications. Reference: Alerts SNAPLEX and ORG Create Groundplexes. Manage Snaplex versions. Update Snaplex settings. Update or revert a Snaplex version. APIM Publish, unpublish, and deprecate APIs on the Developer portal. Configure the Developer portal. Approve API subscriptions and manage/approve user accounts. Reference: API Management AutoSync Configure AutoSync user permissions. Configure connections for data pipeline endpoints. Create user groups to share connection configuration. View information on all data pipelines in the Org. Reference: AutoSync Administration Table 1.0 Org Admin Tasks SnapLogic Monitoring Dashboards The enhanced Monitor interface can be launched from the Apps (Waffle) menu located on the top right corner of the page. The enhanced Monitor Interface enables you to observe integration executions, activities, events, and infrastructure health in your SnapLogic environment. The Monitor pages are categorized under three main groups: Analyze Observe Review Reference: Move_from_Dashboard_to_Monitor The following table lists some common administrative and monitoring tasks for which the Monitor interface can be used. Task Monitor App page Integration Catalog to fetch and display metadata for all integrations in the environment. Monitor -> Analyze -> Integration Catalog Reference: Integration Catalog View of the environment over a time period. Monitor -> Analyze -> Insights Reference: Insights View pipeline and task executions along with statistics, logs, and other details. Stop executions. Download execution details. Monitor -> Analyze -> Execution Reference: Execution Monitor and manage Snaplex services and nodes with graph views for a time period. Monitor -> Analyze -> Infrastructure Reference: Infrastructure View and download metrics for Snaplex nodes for a time period. Monitor -> Analyze -> Metrics Monitor -> Observe -> API Metrics Reference: Metrics, API-Metrics Review Alert history and Activity logs. Monitor -> Review Reference: Alert History, Activity Log Troubleshooting Snaplex / Node / Pipeline issues. Reference: Troubleshooting Table 2.0 Monitor App features Metrics for monitoring CPU Consumption CPU consumption can be high (and exceed 90% at times) when pipelines are executing. A high CPU consumption percentage when no pipelines are executing could indicate a high CPU usage by other processes on the Snaplex node. Review CPU Metrics under the Monitor -> Metrics, and Monitor -> Infrastructure tabs. Reference: CPU utilization metrics System load average (For Unix based systems) Load average is a measure of the number of processes that are either actively running on the CPU or waiting in line to be processed by the CPU. e.g. in a system with 4 virtual CPUs: A load average value of 4.0 means average full use of all CPUs without any idle time or queue. A load average value of >4.0 suggests that processes are waiting for CPU time. A load average value of <4.0 indicates underutilization. System load. Monitor -> Metrics tab. Heap Memory Heap memory is used by the SnapLogic application to dynamically allocate memory at runtime to perform memory intensive operations. The JVM can crash with an Out-of-Memory exception if the heap memory limit is reached. High heap memory usage can also impact other application functions such as pipeline execution, metrics collection, etc. The key heap metrics are listed in the table below: Metric Comments Heap Size Amount of heap memory reserved by the OS This value can grow or shrink depending on usage. Used heap Portion of heap memory in use by the application’s Java objects This value changes constantly with usage. Max heap size Upper heap memory limit This value is constant and does not change. It can be configured by setting the jcc.heap.max_size property in the global.properties file or as a node property. Heap memory. Monitor -> Metrics tab. Non-heap memory consumption The JVM reserves additional native memory that is not part of the heap memory. This memory area is called Metaspace, and is used to store class metadata. Metaspace can grow dynamically based on the application’s needs. Non-heap memory metrics are similar to heap memory metrics however there is no limit on the size of the non-heap memory. In a Snaplex, non-heap size tends to stay somewhat flat or grow slowly over longer periods of time. Non-heap size values larger than 1 GiB should be investigated with help from SnapLogic support. Note that all memory values are displayed in GiB (Gibibytes). Non-Heap memory. Monitor -> Analyze -> Metrics (Node) Swap memory Swap memory or swap space is a portion of disk used by the operating system to extend the virtual memory beyond the physical RAM. This allows multiple processes to share the computer’s memory by “swapping out” some of the RAM used by less active processes to the disk, making more RAM available for the more active processes. Swap space is entirely managed by the operating system, and not by individual processes such as the SnapLogic Snaplex. Note that swap space is not “extra” memory that can compensate for low heap memory. Refer to this document for information about auto, and custom heap settings. Reference: Custom heap setting. High swap utilization is an indicator of contention between processes, and may suggest a need for higher RAM. Additional Metrics Select the node from Monitor -> Analyze, and navigate to the Metrics tab. Review the following metrics. Active Pipelines Monitor the Average and Max active pipeline counts for specific time periods. Consider adding nodes for load balancing and platform stability if these counts are consistently high. Active Pipelines. Monitor -> Analyze -> Metrics (Node) Active Threads Active threads. Monitor -> Analyze -> Metrics (Node) Every Snap in an active pipeline consumes at least one thread. Some Snaps such as Pipeline Execute, Bulk loaders, and Snaps performing input/output can use a higher number of threads compared to other Snaps. Refer to this Sigma document on community.snaplogic.com: Snaplex Capacity Tuning Guide for additional configuration details. Disk Utilization It is important to monitor disk utilization as the lack of free disk space can lead to blocking threads, and can potentially impact essential Snaplex functions such as heartbeats to the Control Plane. Disk utilization. Monitor -> Analyze -> Metrics (Node) Additional Reference: Analyze Metrics. Download data in csv format for the individual Metrics graphs. Enabling Notifications for Snaplex node events Event Notifications can be created on the Manager (Currently in the Classic Manager) under Settings -> Notifications. The notification rule can be set up to send an alert about a tracked event to multiple email addresses. The alerts can also be viewed on the Manager under the Alerts tab. Reference: Notification Events Snaplex Node notifications Telemetry Integration with third-party observability tools using OpenTelemetry (OTEL) The SnapLogic platform uses OpenTelemetry (OTEL) to support telemetry data integration with third-party observability tools. Please contact your CSM to enable the Open Telemetry feature. Reference: Open Telemetry Integration Node diagnostics details The Node diagnostics table includes diagnostic data that can be useful for troubleshooting. For configurable settings, the table displays the Maximum, Minimum, Recommended, and Current values in GiB (Gibibytes) where applicable. The values in red indicate settings outside of the recommended range. Navigate to the Monitor -> infrastructure -> (Node) -> Additional Details tab. Example: Node diagnostics table Identifying pipelines that contribute to a node crash / termination Monitor Page Comments Monitor -> Activity logs Filter by category = Snaplex. Make note of the node crash events for a specific time period Event name text: Node crash event is reported Reference: Activity Logs Monitor -> Execution Select the execution window in the Calendar. Filter executions by setting these Filter conditions: Status: Failed Node name: <Enter node name from the crash event> Reference: Execution Sort on the Documents column to identify the pipeline executions processing the most number of documents. Click anywhere on the row to view the execution statistics. You can also view the active pipelines for that time period from the Monitor -> Metrics -> Active pipelines view. Table 3.0 Pipeline execution review Additional configurations to mitigate pipeline terminations The below thresholds can be optimized to minimize pipeline terminations due to Out-of-Memory exceptions. Note that the memory thresholds are based on the physical memory on the node, and not the Virtual / Swap memory. Maximum Memory % Pipeline termination threshold Pipeline restart delay interval Refer to the table Table 3.0 Snaplex node memory configurations in this Sigma document for additional details and recommended values: Snaplex Capacity Tuning Pipeline Quality Check API The Linter public API for pipeline quality provides additional rules to provide complete reports for all standard checks, including message levels (Critical / Warning / Info), with actionable message descriptions for pipeline quality. Reference: Pipeline Quality Check By applying the quality checks, it is possible to optimize pipelines, and improve maintainability. You can also use SnapGPT to analyze pipelines, identify issues, and suggest best practices to improve your pipelines. (SnapGPT_Analyze_Pipelines) Other third party profiling tools Third party profiling tools such as VisualVM can be used to monitor local memory, CPU, and other metrics. This document will be updated in a later version to include the VisualVM configurations for the SnapLogic application running on a Groundplex. Java Component Container (jcc) command line utility (for Groundplexes) The jcc script is a command-line tool that provides a set of commands to manage the Snaplex nodes. This utility is installed in the /opt/snaplogic/bin directory of the Groundplex node. The below table lists the commonly used arguments for the jcc script (jcc.sh on Linux and jcc.bat on Windows). Note that the command would list other arguments (for example, try-restart). However, those are mainly included for backward compatibility and not frequently used. $SNAPLOGIC refers to the /opt/snaplogic directory on Linux or the <Windows drive>:\opt\snaplogic directory on Windows servers. Run these commands as the root user on Linux and as an Administrator on Windows. Example: sudo /opt/snaplogic/bin/jcc.sh restart or c:\snaplogic\bin\jcc.bat restart Argument Description Comments status Returns the Snaplex status. The response string would indicate if the Snaplex Java process is running. start Starts the Snaplex process on the node. stop Stops the Snaplex process on the node. restart Stops and restarts the Snaplex process on the node. Restarts both the monitor and the Snaplex processes. diagnostic Generates the diagnostic report for the Snaplex node. The HTML output file is generated in the $SNAPLOGIC/run/log directory. Resolve any warnings from the report to ensure normal operations. clearcache Clears the cache files from the node. This command must be executed when the JCC is stopped. addDataKey Generates a new key pair and appends it to the keystore in the /etc/snaplogic folder with the specified alias. This command is used to rotate the private keys for Enhanced Account Encryption. Doc reference: Enhanced Account Encryption The following options are available for a Groundplex on Windows server. install_service remove_service The jcc.bat install_service command installs the Snaplex as a Windows service. The jcc.bat remove_service command removes the installed Windows service. Run these commands as an Administrator user. Table 4.0 jcc script arguments Example of custom log configuration for a Snaplex node (Groundplex) Custom log file configuration is occasionally required due to internal logging specifications or to troubleshoot problems with specific Snaps. In the following example, we illustrate the steps to configure the log level of ‘Debug’ for the Azure SQL Snap pack. The log level can be customized for each node of the Groundplex where the related pipelines are executed, and will be effective for all pipelines that use any of the Azure SQL Snaps (for example, Azure SQL - Execute, Azure SQL - Update, etc.). Note that Debug logging can affect pipeline performance so this configuration must only be used for debugging purposes. Configuration Steps Follow steps 1 and 2 from this document: Custom log configuration Note: You can perform Step 2 by adding the property key and value under the Global Properties section. Example: Key: jcc.jvm_options Value: -Dlog4j.configurationFile=/opt/snaplogic/logconfig/log4j2-jcc.xml The Snaplex node must be restarted for the change to take effect. Refer to the commands in Table 3.0. b. Edit the log4j2-jcc.xml file configured in Step a. c. Add a new RollingRandomAccessFile element under <Appenders>. In this example, the element is referenced with a unique name JCC_AZURE. It also has a log size and rollover policy defined. The policy would enable generation of up to 10 log files of 1 MB each. These values can be adjusted depending on your requirements. <RollingRandomAccessFile name="JCC_AZURE" fileName="${env:SL_ROOT}/run/log/${sys:log.file_prefix}jcc_azure.json" immediateFlush="true" append="true" filePattern="${env:SL_ROOT}/run/log/jcc_azure-log-%d{yyyy-MM-dd-HH-mm}.json” ignoreExceptions="false"> <JsonLogLayout properties="true"/> <Policies> <SizeBasedTriggeringPolicy size="1 MB"/> </Policies> <DefaultRolloverStrategy max="10"/> </RollingRandomAccessFile> … … </Appenders> d. The next step is to configure a Logger that references the Appender defined in step #c. This is done by adding a new <Logger> element. In this example, the Logger is defined with log level = Debug. <Logger name="com.snaplogic.snaps.azuresql" level="debug" includeLocation="true" additivity="false"> <AppenderRef ref="JCC_AZURE" /> </Logger> .. .. <Root> … </Root </Loggers> </Configuration> The value for the name attribute is derived from the Class FQID value of the associated Snap. The changes to log4j2-jcc.xml are marked by the highlighted text in steps c and d. The complete XML file is also attached for reference. You can refer to the Log4j documentation for more details on the attributes or for additional customization. Log4j reference Debug log messages and log files Additional debug log messages will be printed to the pipeline execution logs for any pipeline with Azure SQL Snaps. These logs can be retrieved from Dashboard. Example: {"ts": "2023-11-30T20:21:33.490Z", "lvl": "DEBUG", "fi": "JdbcDataSourceRegistryImpl.java:369", "msg": "JDBC URL: jdbc:sqlserver://sltapdb.database.windows.net:1433;database=SL.TAP;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;authentication=sqlPassword;loginTimeout=30;connectRetryCount=3;connectRetryInterval=5;applicationName=SnapLogic (main23721) - pid-113e3955-1969-4541-9c9c-e3e0c897cccd, database server: Microsoft SQL Server(12.00.2531), driver: Microsoft JDBC Driver 11.2 for SQL Server(11.2.0.0)", "snlb": "Azure+SQL+-+Update", "snrd": "5c06e157-81c7-497f-babb-edc7274fa4f6", "plrd": "5410a1bdc8c71346894494a2_f319696c-6053-46af-9251-b50a8a874ff9", "prc": "Azure SQL - The updated log configuration would also write the custom JCC logs (for all pipelines that have executed the Azure SQL Snaps) to disk under the /opt/snaplogic/run/log directory. The file size for each log file and the number of files would depend on the configuration in the log4j2-jcc.xml file. The changes to log4j2-jcc.xml can be reverted if the additional custom logging is no longer required. Log level configuration for a Snaplex in Production Orgs The default log level for a new Snaplex is ‘Debug.’ This value can be updated to ‘Info’ in Production Orgs as a best practice. The available values are: Trace: Records details of all events associated with the Snaplex. Debug: Records all events associated with the Snaplex. Info: Records messages that outline the status of the Snaplex and the completed Tasks. Warning: Records all warning messages associated with the Snaplex. Error: Records all error messages associated with the Snaplex. Reference: Snaplex logging PlexFS File Storage considerations PlexFS also known as suggest space is a storage location on the local disk of the JCC node. The /opt/snaplogic/run/fs folder is commonly designated for this purpose. It is used as a data store to temporarily store preview data during pipeline validation, as well as to maintain the state data for Resumable pipelines. Disk volumes To address issues that cause disk full errors and to ensure smoother operations of the systems that affect the stability of the Groundplex, you need to have separate mounts on Groundplex nodes. Follow the steps suggested below to create two separate disk volumes on the JCC nodes. Reference: Disk Volumes The /opt/snaplogic/run/fs folder location is used for the PlexFS operations. mount --bind /workspace/fs /opt/snaplogic/run/fs Folder Structure: The folders under PlexFS are created with this path structure: /opt/snaplogic/run/fs/<Environment>/<ProjectSpace>/<Project>/__suggest__/<Asset_ID> Example: /opt/snaplogic/run/fs/Org1/Proj_Space_1/Project1/__suggest__/aaa5010bc The files in the sub-folders are created with these extensions: *.jsonl *.dat PlexFS File Creation The files in /opt/snaplogic/run/fs are generated when a user performs pipeline validation. The amount of data in a .dat file is based on the “Preview Document Count” user setting. For Snaps with binary output (such as File Reader), the Snap will stop writing to PlexFS when the next downstream Snap has generated its limit of Preview data. PlexFS File Deletion The files for a specific pipeline are deleted when the user clicks ‘Retry’ to perform validation. New data files are generated. Files for a specific user session are deleted when the user logs out of SnapLogic. All PlexFS files are deleted when the Snaplex is restarted. Files in PlexFS are generated with an expiration date. The default expiration date is two days. The files are cleaned up periodically based on the expiration date. It is possible to set a feature flag to override the expiration time, and delete the files sooner. Recommendations The temp files are cleaned up periodically based on the default expiration date however you might occasionally encounter disk space availability issues due to excessive Preview data being written to the PlexFS file storage. The mount directory location can be configured with additional disk space or shared file storage (e.g. Amazon EFS). Contact SnapLogic support for details on the feature flag configuration to update the expiration time to a shorter duration for faster file clean up. The value for this feature flag is set in seconds.
ramaonline
Admin
3 years ago
Administration
Dashboard
file system
Logging
Monitor
5.9KViews
4likes
0Comments
Sigma Framework for Operational Excellence
Sigma Framework for Operational Excellence Operationalizing innovation to maximize business value Modern Digital Platforms such as SnapLogic empower enterprises to unlock the value held in existing systems and data, allowing new services and strategies to be achieved. To support these best practices, governance and a culture of innovation are needed. Enterprises are adopting or evolving the concept of a Center of Excellence to foster and support Digital Transformation. This concept is far from new, but trends such as Agile, DevOps, API strategy, AI, and the rise of the Citizen Developer have also forced these Operating Models to change, or risk being considered outdated and relegated as not fit for purpose to support Digital Transformation. The Sigma Framework for Operational Excellence is a set of best practices distilled from the experience of SnapLogic architects — customers, partners and employees. SnapLogic is dedicated to continuously increasing the value our technology delivers to customers over time. Meanwhile, SnapLogic customer architects require a formalized operational framework to guide them from an initial tactical project to true strategic value realization through adoption across multiple roles and departments. The Sigma Framework delivers a standardized, integrated, holistic, and opinionated set of cross-disciplinary best practices, ready for adoption by customers to support sophisticated enterprise-scale deployments. Different assets in this section are designed for the various phases of a customer journey, addressing different personas and stages of maturity. By adopting the Sigma Framework for Operational Excellence, users will be able to extract maximum value from their investment in SnapLogic — both the technology, and the time commitment to develop required skills and competence. Scope of the Framework Sigma Framework for Operational Excellence is structured to reflect the way your enterprise works. We coordinate, collaborate, and consolidate expertise between lines of business, development teams, and your enterprise architecture group. leadership and advice. Let us help you operationalize and scale data-driven, 360- degree innovation throughout your organization. The Sigma Framework for Operational Excellence encompasses several domains, including strategy, technology, process, and the culture of the users of the SnapLogic Platform. The aim of the Sigma Framework is to facilitate the adoption of SnapLogic at scale across the customer organization to achieve the required positive business outcomes. This framework will be shared with the SnapLogic community and maintained and developed as new technical capabilities and implementation best practices emerge over time. Notes 12/11/2023: Diagram updated for style. PDF now in brand formatting. No content changes.
Dominic
Employee
3 years ago
Sigma
4.9KViews
1like
2Comments
SnapGPT - Security and Data Handling Protocols
Authors: Aaron Kesler, Jump Thanawut, Scott Monteith Security and Data Handling Protocols for SnapGPT SnapLogic acknowledges and respects the data concerns of our customers. The purpose of this document is to present our data handling and global data protection standards for SnapGPT. Overview & SnapLogic’s Approach to AI / LLM: SnapLogic utilizes high-quality Enterprise Language Learning Models (LLMs), selecting the most appropriate one for each specific task. Current support includes Azure OpenAI GPT, Anthropic Claude on Amazon Bedrock, and Google Vertex PaLM. Product & Data: Product Features & Scope: SnapGPT offers a range of features, each designed to enhance user experience and productivity in various aspects of pipeline and SQL query generation: Input Prompts: This feature allows customers to interact directly with the LLM by providing input prompts. These prompts are the primary method through which users can specify their requirements or ask questions to the LLM. Describe Pipeline: This skill enables users to obtain a comprehensive description of an existing pipeline. It helps in understanding and documenting the pipeline's structure and functionality. Analyze Pipeline: This feature ingests the entire pipeline configuration and analyzes it to make suggestions for optimization and improvement. It assists users in enhancing the efficiency and effectiveness of their pipelines. Mapper Configuration: Facilitates the configuration of the mapper snap by generating expressions to simplify the process of mapping input to output. Pipeline Generation: Users can create prototype pipelines using simple input prompts. This feature is geared towards streamlining the pipeline creation process, making it more accessible and less time-consuming. SQL Generation without Schema: Tailored for situations where the schema information is not available or cannot be shared, this feature generates SQL queries based solely on the customer's prompt, offering flexibility and convenience. SQL Generation with Schema (coming feb 2024): This advanced feature generates SQL queries by taking into account the schema of the input database. It is particularly useful for creating contextually accurate and efficient SQL queries. Data Usage & Opt-Out Options: At SnapLogic, we recognize the importance of data security and user privacy in the rapidly evolving Generative AI space. SnapGPT has been designed with these principles at its core, ensuring that customers can leverage the power of AI and machine learning while maintaining control over their data. Our approach prioritizes transparency, giving users the ability to opt-out of data sharing, and aligning with industry best practices for data handling. This commitment reflects our dedication to not only providing advanced AI solutions but also ensuring that these solutions align with the highest standards of privacy and data protection. Data Usage in SnapGPT: SnapGPT is designed to handle customer data with the utmost care and precision, ensuring that data usage is aligned with the functionality of each feature: Customer Input and Interaction: Customer inputs, such as prompts or pipeline configurations, are key to the functionality of SnapGPT. This data is used solely for the purpose of processing specific requests and generating responses or suggestions relevant to the user's query. No data is retained for model training purposes. Feature-Specific Data Handling: Each feature/skill of SnapGPT, like pipeline analysis or SQL generation, uses customer data differently. See the table below for details on each skill. Skill Name Description of the Skill Data Transferred to LLM Input Prompts Direct input prompts from customers are transferred to the LLM and tracked by SnapLogic analytics. Prompt details only; these are not stored or used for training by the LLM. Describe & Analyze Pipeline Allows customers to describe a pipeline, with the entire pipeline configuration relayed to the LLM. Entire pipeline configuration excluding account credential information. Mapper Configuration Enables sending input schema information within the prompt to the LLM for the “Mapper configuration” feature. Input schema information without account credential information. Pipeline Generation Uses input prompts to create pipeline prototypes by transmitting them to the LLM. Input prompts only; not stored or used for training by the LLM. SQL Generation W/out Schema Generates SQL queries based only on the customer's prompt in situations where schema information cannot be shared. Only the customer's prompt; no schema information is used. SQL Generation W/ Schema (Feb 2024) Generates accurate SQL queries by considering the schema of the input database. Schema of the input database excluding any account credentials, enhancing query accuracy. Future Adaptations: In the near future, we intend to offer customers opt-out options. Choosing to opt-out of including any environment-specific data in SnapGPT prompts can impact the quality of response from SnapGPT as it will lack additional context. As of the current version, usage of SnapGPT will include sending the data from the features listed above to the LLMs. We recommend that customers who are not comfortable with the described data transfers to wait for the opt-out option to become available. Impact of Opting Out: Choosing to opt-out of data sharing may impact the functionality and effectiveness of SnapGPT. For example, opting out of schema retrieval in SQL Generation may lead to less precise query outputs. Users are advised to consider these impacts when setting their data sharing preferences. Data Processing: Architecture: Data Flow: Data Retention & Residency: SnapLogic is committed to ensuring the secure handling and appropriate residency of customer data. Our data retention policies are designed to respect customer privacy while providing the necessary functionality of SnapGPT: Data Retention: No Retention for Model Training: SnapGPT is designed to prioritize user privacy. Therefore, no customer data processed by SnapGPT is retained for the purpose of model training. This ensures that user data is not used in any way to train or refine the underlying AI models. Storing Usage Data for Adoption Tracking: While we do not retain data for model training, SnapLogic stores usage data related to SnapGPT in Heap Analytics. This is strictly for the purpose of tracking product adoption and usage patterns. The collection of usage data helps us understand how our customers interact with SnapGPT, enabling us to continuously improve the product and tailor it to user needs. Data Residency: Location-Based Data Storage: Our control planes in the United States and the EMEA region adhere to the specific data residency policies of these locations. We ensure compliance with regional data protection and privacy laws, offering customers the assurance that their data is managed in accordance with local regulations. Controls – Admin, Groups, Users: SnapLogic provides robust control mechanisms for administrators, while ensuring that group and user-level controls align with organizational policies: Administrators have granular control over the use of SnapGPT within their organization. They can determine what data is shared with the LLM and have the ability to opt out of data sharing to meet specific data retention and sharing policies. Additionally, admins can control user access to various features and skills, ensuring alignment with organizational needs and security policies. Group Controls: Currently, groups do not have specific controls over SnapGPT. Group-level policies are managed by administrators to ensure consistency and security across the organization. User Controls: Users can access and utilize the features and skills of SnapGPT to which they are entitled. User entitlements are managed by administrators, ensuring that each user has access to the necessary tools for their role while maintaining data security and compliance. Guidelines for Secure and Compliant use of SnapGPT At SnapLogic, we understand the critical importance of data security and compliance in today’s digital landscape. As such, we are dedicated to providing our customers with the tools and knowledge necessary to utilize SnapGPT in a way that aligns with their internal information security (InfoSec) and privacy policies. This section offers guidelines to help ensure that your interaction with SnapGPT is both secure and compliant with your organizational standards. Customer Data Control: Customers are encouraged to actively manage and control the data they share with SnapGPT. By understanding and utilizing the available admin and user controls, customers can ensure that their use of SnapGPT aligns with their internal InfoSec and privacy policies. Best Practices for Data Sharing: We recommend that customers review and follow best practices for data sharing, especially when working with sensitive or confidential information. This includes using anonymization or pseudonymization techniques where appropriate, and sharing only the data in prompts and pipelines that is necessary for the task at hand. Integrating with Internal Policies: Customers should integrate their use of SnapGPT with their existing InfoSec and privacy frameworks. This integration ensures that data handling through SnapGPT remains consistent with the organization’s overall data protection strategy. Regular Review and Adjustment: Customers are advised to regularly review their data sharing settings and practices with SnapGPT, adjusting them as necessary to remain aligned with evolving InfoSec and privacy requirements. Training and Awareness: We also suggest that customers provide regular training and awareness programs to their users about the responsible and secure use of AI tools like SnapGPT, emphasizing the importance of data privacy and protection. Compliance: For detailed information on SnapLogic’s commitment to compliance with various regulatory standards and data security measures, please visit our comprehensive overview at SnapLogic Security & Compliance (https://www.snaplogic.com/security-standards). This resource provides an in-depth look at how we adhere to global data protection regulations, manage data security, and ensure the highest standards of compliance across all our products, including SnapGPT. For specific compliance inquiries or more information on how we handle compliance in relation to SnapGPT, please contact the SnapLogic Compliance Team at Security@snaplogic.com. For further details or inquiries regarding SnapGPT or any other SnapLogic AI services, please contact our SnapLogic AI Services Team ( ai-services@snaplogic.com). For more information on SnapLogic Security and Compliance: https://www.snaplogic.com/security-standards
AaronK
Admin
2 years ago
Generative AI
Retrieval Augmented Generation (RAG)
Security
4KViews
2likes
0Comments
Project Structures and Team Rights Guide
Overview This document outlines the recommended organizational structure for Project Spaces, Projects and team rights within your SnapLogic org. Authors: SnapLogic Enterprise Architecture team Integration Storage Hierarchy In the SnapLogic platform, integrations (pipelines) are managed within the following hierarchy: Organization - Multiple customer organizations are configured based on their development environment e.g. DEV, QA, STAGING, UAT, and PROD etc. Project Space - Team in a Business Unit or Business Group with an access to a workspace to collaborate and implement integration at one central location Project - Group pipelines with an integration, aggregation, and reporting required for the type of function Pipeline - A specific implementation of an integration For example, the path “/MyDevOrg/SLEntArch/SamplePipelines/WorkdayToSnowflake” is broken down into the following components: Organization - MyDevOrg Project Space - SLEntArch Project - SamplePipelines Pipeline - WorkdayToSnowflake Recommended Hierarchy Naming Conventions Clarity is one of the most important factors when naming projects spaces, projects, and pipelines. Naming each level of the hierarchy must make sense to all developers and administrators so that integrations can be found quickly and easily. If using Triggered Tasks, it is important to ensure that no characters are used that will violate HTTP URI naming conventions. Further, you may wish to use characters that don’t require an URL encoding, such as spaces, commas, etc. Below is an example naming convention for your project spaces, projects, and pipelines: Project spaces are named based on business unit or project team: Sales_and_Marketing EnterpriseBI ProfessionalServices Projects are named based on business target endpoint Sales_and_Marketing/MarketingForecast EnterpriseBI/SalesDashboardReporting ProfessionalServices/CommunityExamples Integrations (pipelines) are named based on business function Sales_and_Marketing/MarketingForecast/Fall_Marketing_New_Customers_To_SalesForce EnterpriseBI/SalesDashboardReporting/Supplier_Invoices_To_Workday ProfessionalServices/CommunityExamples/Community_27986_Example_JSON_MultiArray_Join Keep in mind the shallow hierarchy (project space/project) when considering your naming scheme for project spaces and projects. In most orgs, it is acceptable to assign a project space to each business unit and allow that business unit to create projects within their project space based on integration function or target. However, If you expect a very large number of pipelines to be created by a single business unit, you might want to consider an allowance for multiple project spaces for a given business unit. Shared Folders Root Shared A special project named “shared” is added to each SnapLogic Organization (org). Using the org name in the above example, this would be /MyDevOrg/shared. This is commonly referred to as the “root shared” folder. This folder will always exist and is automatically assigned with Full Access (read, write, execute) to all members of the “admins” group of the org and Read-Execute access to all other users. As a best practice, the root shared folder should only contain objects (accounts, files, pipelines, and tasks) that all SnapLogic users in your org should have access to use. Some examples may include: SMTP account used for the Email Sender snap Readonly database account used to access common, public tables Shared expression libraries that contain global static variables or user defined functions for common string/date manipulation Shared pipelines such as error handlers or globally re-usable code Project Space Shared Another special project named “shared” is added to each project space in the org. Using the example path above, this would be /MyDevOrg/SLEntArch/shared. This folder will always exist under each project space and inherits the permissions assigned to the project space. As a best practice, the project space shared folder should only contain objects (accounts, files, pipelines, and tasks) that all SnapLogic users with access to the Project Space should have access to use. Some examples may include: Database accounts used to access common tables within your business unit Shared expression libraries that contain static variables and user defined functions common to your business unit Shared/reusable pipelines common to your business unit User Groups We recommend that you create the following Groups in all of your SnapLogic orgs: Operators - this group contains the users that may need to manually execute a pipeline but do not require Full Access to the projects Migrators - this group contains the users that will perform object migrations in your orgs but do not need to Execute pipelines You should also create Developer groups specific to each Project Space and/or Project within your org. Using the example project spaces and projects listed in the Naming Conventions section of this document, you may want to add the following Groups: Groups specific to Project Space Sales_and_Marketing_Developers EnterpriseBI_Developers ProfessionalServices_Developers Groups specific to Project MarketingForecast_Developers SalesDashboardReporting_Developers CommunityExamples_Developers You may choose to enable access for developers to only see objects within the project they are working in, or you could allow read-only access to all projects within their project space to allow for some cross-project design examples. Typically, the Developer groups will have Full Access in your development org for the Projects that they are working in, with Read-Execute access to the Project Space “shared” folder and Read-Execute access to the root “shared” folder. Developer groups will also have Read-Only access in all non-development orgs for the same Project Space “shared” and Projects that they can access in your development org. If you have a larger SnapLogic development community in your organization, you may wish to distribute the administration of Projects and create Admin groups for each Project Space who will be assigned ownership of the Project Space, which allows them to create new Projects and maintain all permissions within the Project Space. Default Users We recommend that the following service accounts be added to your org(s): snaplogic_admin@<yourdomain> - this user should own the root SnapLogic shared folder, and all/most SnapLogic Project Spaces in your org(s); add this user to the “admins” group snaplogic_service@<yourdomain> - this user should own all of your SnapLogic tasks and have permissions related to executing tasks for all Projects. Note that Read-Execute Access is required as a minimum; Full Access is required if any files are written back to the SLDB of the Project during processing. Add this user to the “Operators” group Note that during migration of tasks to your non-development org(s), you should either use the snaplogic_service@<yourdomain> user to perform the migration, or use the Update Asset Owner API to change the owner of the task after migration. Tasks are owned by the user that creates it; so if a user in the Migrators group performs the migration, they will be assigned as the owner and may not have permissions to successfully execute the task in the target org(s). Hierarchy Permissions Recommended access to the root “shared” project: admin@snaplogic.com - Owner “admins” group - Full Access “members” group - Read/Execute Access “Operators” group - Read/Execute Access “Migrators” group - Full Access “Support” group - Read/Execute Access You may wish to limit Execute Access to only certain teams. If so, change the “members” group to Read Only Access and grant Read/Execute Access to your desired team groups. If you perform migrations only within specific day/time windows, you can add/remove users from the Migrators group using a scheduled task that calls the Groups API to replace all members of the Migrators group and either remove all users from the group (close the migration window) or restore users to the group (open the migration window). Recommended access to the Project Space “shared” project: admin@snaplogic.com - Owner “admins” group - Full Access “members” group - Read-Only Access (optional) “Operators” group - Read/Execute Access “Migrators” group - Full Access “<Project>_Admins” group(s) - Full access in development “<Project>_Developers” group(s) - Read/Execute Access in development You may choose to grant Read-Only access to your <Project>_Admins and <Project>_Developers groups in non-development environments depending on your support team structure Recommended access to the Projects: admin@snaplogic.com - Owner “admins” group - Full Access “members” group - Read-Only Access (optional) “Operators” group - Read/Execute Access “Migrators” group - Full Access “<Project>_Admins” group(s) - Full Access (only in development) “<Project>_Developers” group(s) - Full Access (only in development) You may choose to grant Read-Only access to your <Project>_Admins and <Project>_Developers groups in non-development environments depending on your support team structure.
Bash
Employee
3 years ago
Permissions
Security
Sigma
4KViews
2likes
0Comments
Collaborative Development Best Practices
Collaborative Development Best Practices Sharing knowledge to ensure repeatable project success Executive Summary This guide is designed for organizations wishing to take full advantage of the potential value of the SnapLogic integration platform through adoption of field-tested best practices for collaborative development. These policies and advice will help users develop smoothly from the technical success of an initial project, to repeatable success of follow-on projects, enabling true digital transformation. The SnapLogic pla:orm offers a number of capabilities that support the goal of achieving greater organizational maturity and increased collaboration. This document aims to highlight some of these and place them in the correct context for users to be able to determine the precise steps that are most relevant in their own environment and organization. By implementing these best practices, SnapLogic users in both IT and line-of-business teams will be able to optimize their efforts and ensure the delivery of maximum value from the original investment in SnapLogic. This document is not an exhaustive how-to guide; the SnapLogic online product documentation remains the authoritative source for specific configurations. In addition, SnapLogic offers a number of consultative programs for customers who are looking to accelerate their digital transformation. Please consult your account manager to find out more about which of these resources may be right for your situation. Introduction The graphical low-code/no-code nature of the SnapLogic development environment makes it easy to get started and build useful automation that solves a user’s pressing business problem, lets them answer an urgent question about their data, or enables them to integrate previously disconnected systems via API. The introduction of generative integration through SnapGTP further lowers the barrier to entry, capturing users’ intent in natural language and generating the corresponding pipeline or query automatically. This rapid prototyping and development is key to enabling users to achieve their goals quickly, which in turn ensures rapid return on investment (RoI) for the adoption of SnapLogic technology. SnapLogic’s integration platform as a service (iPaaS) is a full-fledged development environment, but it can sometimes be under-estimated, precisely because of how easy it is to use. IT teams may not think to put in place the sorts of best practices that they would for other, more traditional development tools. In turn, users outside of IT may not have the specific experience to expect such best practices to be implemented. The risk is that multiple projects may start in parallel and operate in isolation, resulting in duplication of effort and un-optimized use of infrastructure. SnapLogic architects — employees, partners, and customers — have developed best practices that should be followed to ensure the success of collaborative development projects. These general principles are distilled from hundreds of engagements where users were able to transition from tactical, piecemeal automation to true enterprise integration, enjoying the benefits of that deep architecture work. These recommendations are ready to be adopted and adapted to specific environments and existing policies and standards. Shared Center of Excellence The first important recommendation is that a central point of contact be defined for all activities relating to SnapLogic. This group should not be seen as the “owners” of the platform in the sense that they might be with more traditional development environments, developing components in response to requests from users in line-of-business teams. Instead, they should act as expert consultants, combining a close understanding of the wider business priorities with a specific understanding of the local SnapLogic deployment. The Center of Excellence (CoE) should administer the SnapLogic platform, deploying, scaling, securing, and retiring components as necessary. Beyond these inward-facing tasks, the CoE should also ensure the integration of the SnapLogic platform into existing operational management systems, making logs and events from the SnapLogic infrastructure and pipelines visible and accessible to central SRE teams. Another useful point of integration is version control and backup, ensuring that business-critical integrations developed inside SnapLogic are managed with the same level of care and attention as those built with more traditional development tools. While much of the routine maintenance work can be carried out entirely within an IT group, certain actions may require consultation with users. Examples of technical actions to support end-user needs include: Scaling the SnapLogic infrastructure to match planned business growth Deploying Snaplexes in a distributed fashion to satisfy business requirements (regional access, performance, regulatory or policy compliance) Managing credentials and access required for integration with external systems, ensuring those integrations are available to users while also enforcing security mandates Reviewing new or proposed developments against best practices Developing shared components to avoid duplication of effort and facilitate standardization The CoE should be staffed primarily by employees of the customer company or long-term consultants who have a deep understanding of the business context within which the SnapLogic pla:orm has been adopted and is expected to operate. Strong relationships with users in line-of-business (LoB) teams will facilitate an ongoing consultative engagement between the CoE and end-users. Customers who desire specific support may choose to augment their own CoE with a Resident Architect (RA), an expert SnapLogic employee who is assigned to support a specific customer organization. The RA will share best practices and guide the CoE personnel in the most effective deployment and use of SnapLogic technology, ensuring accelerated return on the investment in SnapLogic technology. Shared Component Library — Design for Reuse In traditional, text-oriented development, it has been best practice almost since the beginning of programming to establish shared libraries of functionality. With increasing levels of abstraction in the design of programming languages themselves, these shared libraries moved from straigh:orward re-use of code through copying, to becoming external resources that can be called upon as easily as internal components of a program. In the same way, mature organizations will ensure that pipelines are created with the expectation of reuse built into their design. These reusable pipelines should be made available as part of a shared component library. SnapLogic offers a number of Pattern Pipelines for this purpose as part of the Cloud Pattern Catalog. Meanwhile, Pattern Pipelines created within the customer organization will be listed under Projects. In both cases, the Pattern Pipelines are available for immediate use, but will generally require users to supply credentials and potentially some minor environment customization. This organization-level shared library should be owned by the CoE in order to ensure that expected standards are followed and all required documentation is provided. The CoE may choose to develop components directly, harvest components that have been created for specific purposes and develop them into reusable components, or some combination of the two. SnapLogic allows for the same mechanisms to be implemented as in any other development suite. The main principles that apply are the following: Modularity: break complex workflows up into smaller pipelines, reusing as appropriate Loose coupling: avoid dependencies between pipelines, or on specific environmental concerns High cohesion: grouping related functionality in a pipeline helps ensure robustness, reliability, reusability, and understandability Information hiding (encapsulation): design pipelines so that areas expected to change regularly are grouped together to avoid regular wide-ranging changes Separation of concerns: concentrate interactions with particular third-party systems, in order to facilitate any future changes which might be required (e.g. breaking upgrades or migrations) Pipeline Nesting The simplest form of component reuse is nesting pipelines, that is, invoking one pipeline from another. This action relies on the Pipeline Execute Snap, which allows for one pipeline to call another, ensuring that complex or commonly-required functionality can be built once and made available wherever it is needed. The called pipeline can also execute on a different Snaplex than the parent pipeline, which enables further architectural flexibility. For instance, a pipeline running on less-secure infrastructure such as a DMZ could access functionality that it would not otherwise have direct connectivity with, by invoking a pipeline that executes on secured infrastructure. The benefit here is twofold: the sensitive functionality is only accessed by a single pipeline that can be exhaustively secured and debugged, and network segmentation can be used to limit access to that single point. Configurable Pipelines Users can also create an expression library file that contains the details of the environment and import that configuration into pipelines as required. For example, individual teams within the wider organization might have their own specific configuration library with the appropriate settings for the environment. Pipelines can then be moved around from one environment to another without needing to be updated, further facilitating sharing. The same mechanism should be used to facilitate promotion between development, test, and production environments, without needing to reconfigure the pipelines themselves. These two goals can support one another, further stimulating adoption of best practices. Error Handling A particular type of pipeline that may well be overlooked by users without specific development experience is the Error Pipeline. The baseline requirement of this type of pipeline is to handle errors that may occur during execution of the business logic in the main pipeline. However, in a larger and more articulated organization, especially where there is extensive code reuse, error pipelines will also need to be standardized. Standardization of error pipelines satisfies two requirements: Handling of common error conditions – It is expected that many pipelines in an organization will interact with common systems and therefore will encounter the same or similar error conditions. These situations should be handled in a uniform fashion, both to avoid duplicate work to handle the same error, and to ensure that the error condition is indeed handled correctly every time it occurs, regardless of which pipeline has triggered it. Easier troubleshooting – when users are presented with an error message, rapid resolution requires that it be easy to understand and that it be obvious which actions can be taken (and by whom) to resolve the issue. One of the best ways to deliver on this goal of reducing mean time to resolution (MTTR) is to ensure that the same situation will always trigger the same clear and comprehensible error message. An Error Pipeline is simply a pipeline that is created specifically for handling errors; it processes error documents produced by other Pipelines in the environment. The Error Pipeline runs even if errors are not encountered in the Snaps from the main Pipeline. Error Pipelines also capture the data payload that triggered the error, which can be vitally important to error resolution. Since the SnapLogic platform will not store customer data, the error pipeline is the most standardized way to capture the actual data which was being processed by the pipeline when the error occurred. Architecture Best Practices for Collaboration Environment Separation It is recommended to create separate environments, at a minimum for development and production, and ideally with an explicit testing area as well. This practice will help to avoid production outages, since errors can be found and corrected in the pre-production environments, before the pipeline is promoted to production where errors would have greater impact. Different permissions and access controls should be set on the different environments. The development environments should be as open as possible, to enable the maximum number of users to take advantage of the ease of use of the SnapLogic platform to deliver business value. Promotion to testing and production environments on the other hand should be tightly restricted and controlled. SnapLogic offers a powerful hierarchical model of assets and containers, with individual permissions to enable the desired security level to be set for each item. The Configurable Pipelines feature helps to facilitate promotion of pipelines between environments without the need for manual reconfiguration, which itself would risk introducing errors. By configuring the pipelines in this way, the promotion process can be made entirely automatic. This automation in turn makes it possible to include automated testing and rollback as a routine part of the promotion process. Operational Best Practices Versioning and Backup As the SnapLogic platform is used and new automations are developed, significant business value will be captured within the pipelines that are created. The impact of the loss of any of these developments would be correspondingly large, whether due to human error or technical failure. The built-in versioning capabilities allows users to replace an existing pipeline with a newer one or to rollback to a previous version in the event that something went wrong. Each version is tagged automatically with information about its creator, as well as the time and date of creation, but it is recommended that users also document the state and purpose of the pipeline in the notes. It is recommended that users take advantage of these features to safeguard the valuable automations and integrations that users develop over time, and to avoid the negative consequences for the organization that would occur if those were to be lost. The CoE should work to make sure that users without a background in development be made aware of the importance of versioning their work correctly. In addition to these built-in capabilities, the SnapLogic iPaas offers out-of-the-box integration with industry-standard Git version-control, and specifically the following implementations (as of Fall 2023): GitHub GitHub Enterprise Server (GHES) GitLab.com GitLab (self-managed) Azure Repos Through Git integration, users can track, update, and manage versions of SnapLogic project files, pipelines, tasks, and accounts, using either the SnapLogic graphical interface or the SnapLogic Public APIs. These operations include checkout, add, remove, pull, and commit, as well as creating branches and tags. Git integration supports both Gitflow and trunk-based development; however, SnapLogic recommends trunk-based development. Simultaneous Edits It is important to point out that SnapLogic does not currently support simultaneous collaboration in the same pipeline. If one user is editing a pipeline and a different user makes a change to the same pipeline, the original user will be prompted that a new version of the pipeline exists. This is another reason to adopt explicit version management strategies, in order to avoid confusing users who may be used to simultaneous collaborative editing in other tools such as Google Docs. Secrets Management Secrets Management is an established best practice in traditional development environments, enabling organizations to use a third-party secrets manager to store endpoint credentials. Instead of entering credentials directly in SnapLogic accounts and relying on SnapLogic to encrypt them, the accounts contain only the information necessary to retrieve the secrets. During validation and execution, pipelines obtain the credentials directly from the secrets manager. With Secrets Management, the configured secrets are never stored by the SnapLogic control plane or by Snaplex nodes, facilitating security hardening and simplifying compliance with internal or external policies. Currently, SnapLogic support integration with the following secrets managers (as of Fall 2023): Avoid relying on the tool to monitor itself Reduce Total Cost of Ownership (TCO) by managing monitoring, alerts, and notification system in one place Manage the historical retention of data for audit needs Correlate data from different tools in order to understand correctly the cause and impact of an issue, avoiding duplication of troubleshooting efforts The SnapLogic platform uses OpenTelemetry (As of Fall 2023, this feature is available as a beta release, enabling monitoring of Pipeline execution runtime logs in Datadog.) to support telemetry data integration 3 with third-party Observability tools. Conclusion The SnapLogic iPaaS is a full-fledged development environment, and should be considered as such when it comes to both management of assets developed within the platform, and industry-standard integration with third-party tools. The ease of use of SnapLogic naturally lends itself to wide adoption by users outside of IT, and this trend is only expected to accelerate further with the introduction of generative integration via SnapGPT. The speed with which these users can achieve their goals and deliver business value may however be overlooked by centralized IT teams, leading them to underestimate the potential business value that can be achieved through implementation of industry-standard best practices for development environments. Fortunately, most of these well-understood best practices — collaborative development, modular architectures, staged infrastructure, and integration with operational management tools — are available out of the box with SnapLogic. All that is required is active engagement between core IT, the CoE team responsible for the SnapLogic, and the end-users themselves, who are after all the ones closest to the business value that is being delivered by the platform.
Dominic
Employee
3 years ago
Best Practices
Center of Excellence - CoE
collaborative development
Development - CICD
Sigma
3.7KViews
4likes
0Comments
Disaster Recovery
How SnapLogic will recover the platform and data in the event of a disaster; and a forward look at SnapLogic’s disaster recovery strategy. Introduction The purpose of this whitepaper is to present SnapLogic’s method and means to recover from a disruptive event to ensure workloads run as expected and remain durable. Background Businesses create and manage large volumes of data. As companies extend their use of artificial intelligence into daily operations, data becomes mission critical. The impact of data loss or corruption can be significant. Preparing plans to continue business in response to an event is not only good management, but essential in today’s business climate. At SnapLogic, we are revising our view of resiliency and taking a broader approach. Beyond the standard disaster recovery and cybersecurity requirements, real resiliency should include being able to handle a full range of responses a company needs to keep its business going, whatever happens. SnapLogic’s Disaster Recovery Model SnapLogic’s infrastructure is deployed on Amazon Web Services. When users access the application, their request is filtered through a load balancer which is a cloud networking device to help distribute large volumes of internet traffic. Requests are passed from the load balancer to a cluster of web servers which present the application(s) to the user. The application servers, sometimes referred to as the Control Plane, facilitate and manage pipeline creation, updates, and monitoring. The data execution servers, sometimes referred to as Cloudplexes, Groundplexes, or the Data Plane, are the heart of pipeline execution. Database servers store pipeline metadata and log data each time a pipeline is run. The last component are Snaps, sometimes referred to as connectors or APIs, they are small computational units that process data. These assets are monitored for performance 24x7x365. If performance drops below a certain threshold, an automated process notifies the appropriate team(s) to investigate and if necessary, recover or replace an infrastructure asset. Assets are managed using a combination of automated and manual processes. The presentation, data execution, and database layers would be restored manually. The application and Snaps layers leverage fully automated processes. In the event of a disaster, SnapLogic’s end to end recovery time for the entire infrastructure is within 48 hours of being hard down (inoperable). The amount of data loss would be 2 hours or less. The data execution servers, sometimes referred to as Cloudplexes, Groundplexes, or the Data Plane, have a shared responsibility for resiliency. Groundplex resiliency is solely the responsibility of the customer. Cloudplex resiliency is shared. SnapLogic is responsible for the resiliency of the infrastructure, once recovered, the customer is responsible for recovering / restarting their pipeline(s). This outlines SnapLogic’s disaster recovery model. If you have any questions about this white-paper or would like to discuss resiliency in more detail, please contact your SnapLogic representative who can schedule a deeper discussion.
dmiller
Former Employee
3 years ago
Disaster Recovery
Sigma
2.9KViews
1like
0Comments
SnapLogic API Development - Best Practices
This paper describes best practices for Pipeline design to support a REST API end to support an API Product lifecycle Author and Contributors; Chris Ward and Roberto Oliva
GuyM
Employee
2 years ago
2.8KViews
3likes
0Comments