ML

2 Topics

Multi Pipeline Function Generator - Simplifies Agent Worker Pipeline
This article introduces a new Snap called the “Multi Pipeline Function Generator”. The Multi Pipeline Function Generator is designed to take existing Pipelines in your SnapLogic Project and turn their configurations into function definitions for LLM-based tool calling. It achieves the following: It replaces the existing chain of function generators, therefore reduces the length of the worker pipeline. Combined with our updates to the tool calling snaps, this snap allows multiple tool calling branches to be merged into a single branch, simplifying the pipeline structure. With it, users can directly select the desired pipeline to be used as a tool from a dropdown menu. The snap will automatically retrieve the tool name, purpose, and parameters from the pipeline properties to generate a function definition in the required format. Problem Statement Currently, the complexity of the agent worker pipeline increases linearly with the number of tools it has. The image below shows a worker pipeline with three tools. It requires three function generators and has three tool calling branches to execute different tools. This becomes problematic when the number of tools is large, as the pipeline becomes very long both horizontally and vertically. Current Agent Worker Pipeline With Three Tools Solution Overview One Multi Pipeline Function Generator snap can replace multiple function generators (as long as the tool is a pipeline; it's not applicable if the tool is of another type, such as OpenAPI or APIM service). New Agent Worker Pipeline Using “Multi Pipeline Function Generator” Additionally, for each outputted tool definition, it includes the corresponding pipeline's path. This allows downstream components (the Pipeline Execute snap) to directly call the respective tool pipeline with the path, as shown below. The Multi Pipeline Function Generator snap allows users to select multiple tool pipelines at once through dropdown menus. It reads the necessary data for generating function definition from the pipeline properties. Of course, this requires that the data has been set up in the pipeline properties beforehand (will be explained later). The image below shows the settings for this snap. Snap Settings How to Use the Snap To use this snap, you need to: Fill in the necessary information for generating the function definition in the properties of your tool pipeline. The pipeline's name will become the function name The information under 'info -> purpose' will become the function description. Each key in your OpenAPI specification will be treated as a parameter, so you will ALSO need to add the expected input parameters to the list of pipeline parameters. Please note that in the current design, the pipeline parameters specified here are solely used for generating the function definition. When utilizing parameters within the pipeline, you do not need to retrieve their values using pipeline parameters. Instead, you can directly access the argument values from the input document, as determined by the model based on the function definition. Then, you can select this pipeline as a tool from the dropdown menu in the Multi Pipeline Function Generator snap. In the second output of the tool calling snap, we only need to keep one branch. In the pipeline execute snap, we can directly use the expression $sl_tool_metadata.path to dynamically retrieve the path of the tool pipeline being called. See image below. Below is an example of the pipeline properties for the tool 'CRM_insight' for your reference. Below is the settings page of the original function generator snap for comparison. As you can see, the information required is the same. The difference is that now we directly fill this information into the pipeline's properties. Step 3 - reduce the number of branches More Design Details The tool calling snap has also been updated to support $sl_tool_metadata.path , since the model's initial response doesn't include the pipeline path which is needed. After the tool calling snap receives the tools the model needs to call, it adds the sl_tool_metadata containing the pipeline path to the model's response and outputs it to the snap's second output view. This allows us to use it in the pipeline execute snap later. This feature is supported for tool calling with Amazon Bedrock, OpenAI, Azure OpenAI, and Google GenAI snap packs. The pipeline path can accept either a string or a list as input. By turning on the 'Aggregate input' mode, multiple input documents can be combined into a single function definition document for output, similar to that of a gate snap. This can be useful in scenarios like this: you use a SnapLogic list snap to enumerate all pipelines within a project, then use a filter snap to select the desired tool pipelines, and finally use the multi pipeline function generator to convert this series of pipelines into function definitions. Example Pipelines Download here. Conclusion In summary, the Multi Pipeline Function Generator snap streamlines the creation of function definitions for pipeline as tool in agent worker pipelines. This significantly reduces pipeline length in scenarios with numerous tools, and by associating pipeline information directly with the pipeline, it enhances overall manageability. Furthermore, its applicability extends across various providers.
Luna
8 months ago Place SnapLogic Technical Blog
735Views
0likes
1Comment
Embeddings and Vector Databases
What are embeddings Embeddings are numerical representations of real-world objects, like text, images or audio. They are generated by machine learning models as vectors, an array of numbers, where the distance between vectors can be seens as the degree of similarity between objects. While an embedding model may have its own meaning for each of the dimensions, there’s no guarantee between embedding models of the meaning for each of the dimensions used by the embedding models. For example, the word “cat”, “dog” and “apple” might be embedded into the following vectors: cat -> (1, -1, 2) dog -> (1.5, -1.5, 1.8) apple -> (-1, 2, 0) These vectors are made-up for a simpler example. Real vectors are much larger, see the Dimension section for details. Visualizing these vectors as points in a 3D space, we can see that "cat" and "dog" are closer, while "apple" is positioned further away. Figure 1. Vectors as points in a 3D space By embedding words and contexts into vectors, we enable systems to assess how related two embedded items are to each other via vector comparison. Dimension of embeddings The dimension of embeddings refers to the length of the vector representing the object. In the previous example, we embedded each word into a 3-dimensional vector. However, a 3-dimensional embedding inevitably leads to a massive loss of information. In reality, word embeddings typically require hundreds or thousands of dimensions to capture the nuances of language. For example, OpenAI's text-embedding-ada-002 model outputs a 1536-dimensional vector Google Gemini's text-embedding-004 model outputs a 768-dimensional vector Amazon Titan's amazon.titan-embed-text-v2:0 model outputs a default 1024-dimensional vector Figure 2. Using text-embedding-ada-002 to embed the sentence “I have a calico cat.” In short, an embedding is a vector that represents a real-world object. The distance between these vectors indicates the similarity between the objects. Limitation of embedding models Embedding models are subject to a crucial limitation: the token limit, where a token can be a word, punctuation mark, or subword part. This constraint defines the maximum amount of text a model can process in a single input. For instance, the Amazon Titan Text Embeddings models can handle up to 8,192 tokens. When input text exceeds the limit, the model typically truncates it, discarding the remaining information. This can lead to a loss of context and diminished embedding quality, as crucial details might be omitted. To address this, several strategies can help mitigate its impact: Text Summarization or Chunking: Long texts can be summarized or divided into smaller, manageable chunks before embedding. Model Selection: Different embedding models have varying token limits. Choosing a model with a higher limit can accommodate longer inputs. What is a Vector Database Vector databases are optimized for storing embeddings, enabling fast retrieval and similarity search. By calculating the similarity between the query vector and the other vectors in the database, the system returns the vectors with the highest similarity, indicating the most relevant content. The following diagram illustrates a vector database search. A query vector 'favorite sport' is compared to a set of stored vectors, each representing a text phrase. The nearest neighbor, 'I like football', is returned as the top result. Figure 3. Vector Query Example Figure 4. Store Vectors into Database Figure 5. Retrieve Vectors from Database When working with vector databases, two key parameters come into play: Top K and similarity measure (or distance function). Top K When querying a vector database, the goal is often to retrieve the most similar items to a given query vector. This is where the Top K concept comes into play. Top K refers to retrieving the top K most similar items based on a similarity metric. For instance, if you're building a product recommendation system, you might want to find the top 10 products similar to the one a user is currently viewing. In this case, K would be 10. The vector database would return the 10 product vectors closest to the query product's vector. Similarity Measures To determine the similarity between vectors, various distance metrics are employed, including: Cosine Similarity: This measures the cosine of the angle between two vectors. It is often used for text-based applications as it captures semantic similarity well. A value closer to 1 indicates higher similarity. Euclidean Distance: This calculates the straight-line distance between two points in Euclidean space. It is sensitive to magnitude differences between vectors. Manhattan Distance: Also known as L1 distance, it calculates the sum of the absolute differences between corresponding elements of two vectors. It is less sensitive to outliers compared to Euclidean distance. Figure 6. Similarity Measures There are many other similarity measures not listed here. The choice of distance metric depends on the specific application and the nature of the data. It is recommended to experiment with various similarity metrics to see which one produces better results. What embedders are supported in SnapLogic As of October 2024, SnapLogic has supported embedders for major models and continues to expand its support. Supported embedders include: Amazon Titan Embedder OpenAI Embedder Azure OpenAi Embedder Google Gemini Embedder What vector databases are supported in SnapLogic Pinecone OpenSearch MongoDB Snowflake Postgres AlloyDB Pipeline examples Embed a text file Read the file using the File Reader snap. Convert the binary input to a document format using the Binary to Document snap, as all embedders require document input. Embed the document using your chosen embedder snap. Figure 7. Embed a File Figure 8. Output of the Embedder Snap Store a Vector Utilize the JSON Generator snap to simulate a document as input, containing the original text to be stored in the vector database. Vectorize the original text using the embedder snap. Employ a mapper snap to format the structure into the format required by Pinecone - the vector field is named "values", and the original text and other relevant data are placed in the "metadata" field. Store the data in the vector database using the vector database's upsert/insert snap. Figure 9. Store a Vector into Database Figure 10. A Vector in the Pinecone Database Retrieve Vectors Utilize the JSON Generator snap to simulate the text to be queried. Vectorize the original text using the embedder snap. Employ a mapper snap to format the structure into the format required by Pinecone, naming the query vector as "vector". Retrieve the top 1 vector, which is the nearest neighbor. Figure 11. Retrieve Vectors from a Database [ { "content" : "favorite sport" } ] Figure 12. Query Text Figure 13. All Vectors in the Database { "matches": [ { "id": "db873b4d-81d9-421c-9718-5a2c2bd9e720", "score": 0.547461033, "values": [], "metadata": { "content": "I like football." } } ] } Figure 14. Pipeline Output: the Closest Neighbor to the Query Embedder and vector databases are widely used in applications such as Retrieval Augmented Generation (RAG) and building chat assistants. Multimodal Embeddings While the focus thus far has been on text embeddings, the concept extends beyond words and sentences. Multimodal embeddings represent a powerful advancement, enabling the representation of various data types, such as images, audio, and video, within a unified vector space. By projecting different modalities into a shared semantic space, complex relationships and interactions between these data types can be explored. For instance, an image of a cat and the word "cat" might be positioned closely together in a multimodal embedding space, reflecting their semantic similarity. This capability opens up a vast array of possibilities, including image search with text queries, video content understanding, and advanced recommendation systems that consider multiple data modalities.
Luna
2 years ago Place SnapLogic Technical Blog
3.3KViews
5likes
0Comments