BankTanapat
Employee
Employee

Multimodal processing in Generative AI represents a transformative leap in how AI systems extract and synthesize information from multiple data types—such as text, images, audio, and video—simultaneously. Unlike traditional single-modality AI models, which focus on one type of data, Multimodal systems integrate and process diverse data streams in parallel, creating a holistic understanding of complex scenarios. This integrated approach is critical for applications that require not just isolated insights from one modality, but a coherent synthesis across different data sources, leading to outputs that are contextually richer and more accurate.

Generative AI, with multimodal processing, is redefining text extraction, surpassing traditional OCR by interpreting text within its visual and contextual environment. Unlike OCR, which only converts images to text, generative AI analyzes the surrounding image context, layout, and meaning, enhancing accuracy and depth. For instance, in complex documents, it can differentiate between headings, body text, and annotations, structuring information more intelligently. Additionally, it excels in low-quality or multilingual texts, making it invaluable in industries requiring precision and nuanced interpretation.

In video analysis, a generative AI equipped with Multimodal processing can simultaneously interpret the visual elements of a scene, the audio (such as dialogue or background sounds), and any associated text (like subtitles or metadata). This allows the AI to produce a description or summary of the scene that is far more nuanced than what could be achieved by analyzing the video or audio alone. The interplay between these modalities ensures that the generated description reflects not only the visual and auditory content but also the deeper context and meaning derived from their combination.

In tasks such as image captioning, Multimodal AI systems go beyond simply recognizing objects in a photo. They can interpret the semantic relationship between the image and accompanying text, enhancing the relevance and specificity of the generated captions. This capability is particularly useful in fields where the context provided by one modality significantly influences the interpretation of another, such as in journalism, where images and written reports must align meaningfully, or in education, where visual aids are integrated with instructional text.

Multimodal processing enables AI to synthesize medical images (such as X-rays or MRIs) with patient history, clinical notes, and even live doctor-patient interactions in highly specialized applications like medical diagnostics. This comprehensive analysis allows the AI to provide more accurate diagnoses and treatment recommendations, addressing the complex interplay of symptoms, historical data, and visual diagnostics. Similarly, in customer service, Multimodal AI systems can improve communication quality by analyzing the textual content of a customer's inquiry and the tone and sentiment of their voice, leading to more empathetic and effective responses.

Beyond individual use cases, Multimodal processing plays a crucial role in improving the learning and generalization capabilities of AI models. By training on a broader spectrum of data types, AI systems develop more robust, flexible models that can adapt to a wider variety of tasks and scenarios. This is especially important in real-world environments where data is often heterogeneous and requires cross-modal understanding to interpret fully.

As Multimodal processing technologies continue to advance, they promise to unlock new capabilities across diverse sectors. In entertainment, Multimodal AI could enhance interactive media experiences by seamlessly integrating voice, visuals, and narrative elements. In education, it could revolutionize personalized learning by adapting content delivery to different sensory inputs. In healthcare, the fusion of Multimodal data could lead to breakthroughs in precision medicine. Ultimately, the ability to understand and generate contextually rich, Multimodal content positions Generative AI as a cornerstone technology in the next wave of AI-driven innovation.

Multimodal Content Generator Snap

The Multimodal Content Generator Snap encodes file or document inputs into the Snap's multimodal content format, preparing it for seamless integration. The output from this Snap must be connected to the Prompt Generator Snap to complete and format the message payload for further processing. This streamlined setup enables efficient multimodal content handling within the Snap ecosystem.

The Snap Properties

BankTanapat_0-1731514731205.png

Type - Select the type of multimodal content.
Content Type - Define the specific content type for data transmitted to the LLM.
Content - Specify the content path to the multimodal content data for processing.
Document Name - Name the document for reference and identification purposes.
Aggregate Input - Enable this option to combine all inputs into a single content.
Encode Base64 - Enable this option to convert the text input into Base64 encoding.

Note:

  • The Content property appears only if the input view is of the document type. The value assigned to Content must be in Base64 format for document inputs, while Snap will automatically use binary as content for binary input types.
  • The Document Name can be set specifically for multimodal document types.
  • The Encode Base64 property encodes text input into Base64 by default. If unchecked, the content will be passed through without encoding.

Designing a Multimodal Prompt Workflow

BankTanapat_1-1731514964664.png

In this process, we will integrate multiple Snaps to create a seamless workflow for multimodal content generation and prompt delivery. By connecting the Multimodal Content Generator Snap to the Prompt Generator Snap, we configure it to handle multimodal content. The finalized message payload will then be sent to Claude by Anthropic Claude on AWS Messages.

Steps:

1. Add the File Reader Snap:
BankTanapat_6-1731515403070.png

    1. Drag and drop the File Reader Snap onto the designer canvas.
    2. Configure the File Reader Snap by accessing its settings panel, then select a file containing images (e.g., a PDF file). Download the sample image files at the bottom of this post if you have not already.

Sample image file (Japan_flowers.jpg)

BankTanapat_7-1731515403051.jpeg

2. Add the Multimodal Content Generator Snap:
BankTanapat_8-1731515425896.png

    1. Drag and drop the Multimodal Content Generator Snap onto the designer and connect it to the File Reader Snap.
    2. Open its settings panel, select the file type, and specify the appropriate content type.
    3. Here's a refined description of the output attributes from the Multimodal Content Generator:
      BankTanapat_9-1731515425898.png
      1. sl_content: Contains the actual content encoded in Base64 format.
      2. sl_contentType: Indicates the content type of the data. This is either selected from the configuration or, if the input is a binary, it extracts the contentType from the binary header.
      3. sl_type: Specifies the content type as defined in the Snap settings; in this case, it will display "image."

3. Add the Prompt Generator Snap:
BankTanapat_10-1731515521556.png

    1. Add the Prompt Generator Snap to the designer and link it to the Multimodal Content Generator Snap.
    2. In the settings panel, enable the Advanced Prompt Output checkbox and configure the Content property to use the input from the Multimodal Content Generator Snap.
    3. Click “Edit Prompt” and input your instructions
      BankTanapat_11-1731515521516.png

4. Add and Configure the LLM Snap:
BankTanapat_12-1731515611335.png

  1. Add the Anthropic Claude on AWS Message API Snap as the LLM.
  2. Connect this Snap to the Prompt Generator Snap.
  3. In the settings, select a model that supports multimodal content.
  4. Enable the Use Message Payload checkbox and input the message payload in the Message Payload field.

5. Verify the Result:
BankTanapat_13-1731515611336.png

    1. Review the output from the LLM Snap to ensure the multimodal content has been processed correctly.
    2. Validate that the generated response aligns with the expected content and format requirements.
    3. If adjustments are needed, revisit the settings in previous Snaps to refine the configuration.

 

Multimodal Models for Advanced Data Extraction

Multimodal models are redefining data extraction by advancing beyond traditional OCR capabilities. Unlike OCR, which primarily converts images to text, these models directly analyze and interpret content within PDFs and images, capturing complex contextual information such as layout, formatting, and semantic relationships that OCR alone cannot achieve. By understanding both textual and visual structures, multimodal AI can manage intricate documents, including tables, forms, and embedded graphics, without requiring separate OCR processes. This approach not only enhances accuracy but also optimizes workflows by reducing dependency on traditional OCR tools.

In today’s data-rich environment, information is often presented in varied formats, making the ability to analyze and derive insights from diverse data sources essential. Imagine managing a collection of invoices saved as PDFs or photos from scanners and smartphones, where a streamlined approach is needed to interpret their contents. Multimodal large language models (LLMs) excel in these scenarios, enabling seamless extraction of information across file types. These models support tasks such as automatically identifying key details, generating comprehensive summaries, and analyzing trends within invoices whether from scanned documents or images. Here’s a step-by-step guide to implementing this functionality within SnapLogic.

BankTanapat_14-1731515698177.jpeg

Sample invoice files (download the files at the bottom of this post if you have not already)
Invoice1.pdf

BankTanapat_15-1731515716873.jpeg

 

Invoice2.pdf

BankTanapat_16-1731515716867.jpeg

Invoice3.jpeg (Sometimes, the invoice image might be tilted)

BankTanapat_17-1731515748540.jpeg

Upload the invoice files
  1. Open Manager page and go to your project that will be used to store the pipelines and related filesBankTanapat_18-1731515808136.jpeg
  2. Click the + (plus) sign and select File
    BankTanapat_19-1731515808141.jpeg
  3. The Upload File dialog pops up. Click “Choose Files” to select all the invoice files both PDF and image formats (download the sample invoice files at the bottom of this post if you have not already)
    BankTanapat_20-1731515868611.jpeg

     

  4. Click Upload button and the uploaded files will be shown.
    BankTanapat_21-1731515868612.jpeg
Building the pipeline
  1. Add the JSON Generator Snap:
    1. Drag and drop the JSON Generator onto the designer canvas.
    2. Click on the Snap to open settings, then click the "Edit JSON" button
      BankTanapat_22-1731515903141.jpeg
    3. Highlight all the text from the template and delete it.
    4. Paste all invoice filenames in the format below. The editor should look like this.
      BankTanapat_23-1731515903143.jpeg
    5. Click "OK" in the lower-right corner to save the prompt
    6. Save the settings and close the Snap
  2. Add the File Reader Snap:
    1. Drag and drop the File Reader Snap onto the designer canvas
    2. Click the Snap to open the configuration panel.
    3. Connect the Snap to the JSON Generator Snap by following these steps:
      1. Select Views tab
      2. Click plus(+) button on the Input pane to add the input view(input0)
        BankTanapat_29-1731516080761.jpeg
      3. Save the configuration
      4. The Snap on the canvas will have the input view. Connecting it to the JSON Generator Snap 
    4. In the configuration panel, select the Settings tab
    5. Set the File field by enabling expression by clicking the equal sign in front of the text input and set it to $filename to read all the files we specified in the JSON Generator Snap
      BankTanapat_30-1731516189900.jpeg
    6.  Validate the pipeline to see the File Reader output.

      Fields that will be used in the Multimodal Content Generator Snap
      1. Content-type shows file content type
      2. Content-location shows the file path and it will be used in the document name
        BankTanapat_31-1731516567079.jpeg

         

  3. Add the Multimodal Content Generator Snap:
    1. Drag and drop the Multimodal Content Generator Snap onto the designer canvas and connect to the File Reader Snap
    2. Click the Snap to open the settings panel and configure the following fields:
      1. Type:
        • enable the expression
        • set the value to $['content-location'].endsWith('.pdf') ? 'document' : 'image'
      2. Document name
        • enable the expression
        • set the value to
          $['content-location'].snakeCase()
        • Use the snake-case version of the file path as the document name to identify each file and make it compatible with the Amazon Bedrock Converse API. In snake case, words are lowercase and separated by underscores(_).
      3. Aggregate input
        • check the checkbox
        • Use this option to combine all input files into a single document. 
        • The settings should now look like the following
          BankTanapat_34-1731516877655.jpeg
    3. Validate the pipeline to see the Multimodal Content Generator Snap output.
      The preview output should look like the below image. The sl_type will be document for the pdf file and image for the image file and the name will be the simplified file path.
      BankTanapat_35-1731516934671.jpeg

       

  4. Add the Prompt Generator Snap:
    1. Drag and drop the Prompt Generator Snap onto the designer canvas and connect to the Multimodal Content Generator Snap
    2. Click the Snap to open the settings panel and configure the following fields:
      BankTanapat_36-1731516995391.jpeg
      1. Enable the Advanced Prompt Output checkbox
      2. Set the Content to $content to use the content input from the Multimodal Content Generator Snap
      3. Click “Edit Prompt” and input your instructions. For example,
        Based on the total quantity across all invoices,
        which product has the highest and lowest purchase quantities,
        and in which invoices are these details found?
        BankTanapat_37-1731517041103.jpeg

         

  5. Add and Configure the LLM Snap:
    1. Add the Amazon Bedrock Converse API Snap as the LLM
    2. Connect this Snap to the Prompt Generator Snap
    3. Click the Snap to open the configuration panel
    4. Select the Account tab and select your account
    5. Select the Settings tab
      1. Select a model that supports multimodal content.
      2. Enable the Use Message Payload checkbox
      3. Set the Message Payload to $messages to use the message from the Prompt Generator Snap
        BankTanapat_38-1731517136595.jpeg
  6. Verify the result:
    Validate the pipeline and open the preview of the Amazon Bedrock Converse API Snap. The result should look like the following:
    BankTanapat_39-1731517159255.jpeg
    In this example, the LLM successfully processes invoices in both PDF and image formats, demonstrating its ability to handle diverse inputs in a single workflow. By extracting and analyzing data across these formats, the LLM provides accurate responses and insights, showcasing the efficiency and flexibility of multimodal processing. You can adjust the queries in the Prompt Generator Snap to explore different results.