Hi, we have a requirement in RAG Ingestion usecase to ingest the PDF files in Azure VectorDB. And the ask is while user prompts for some information it should respond with answer along with specific PDF Page as attachment, so the user has possibility to click on it and validate the prompt response. How to we ingest this Specific PDF Page chunks in the VectorDB. Currently we are ingesting the sourceURL and PageNumber as metadata, but the requirement is to also add additional metadata field to ingest only this particular page. Hope my question is clear. Your support would help us.
Hi Sharan, If your goal is simply to return a specific PDF page to the user, one option would be to physically split the document and store individual pages. However, this introduces additional complexity around document lifecycle management (updates, re-ingestion, monitoring, etc.) and is generally not necessary for your use case. Based on your description, the real requirement is to return the relevant page of an existing document that corresponds to a retrieved vector, not to ingest each page as a standalone document. For that scenario, your current approach is already sufficient. By storing the sourceURL and pageNumber as metadata during ingestion, you already have everything needed to present the correct page back to the user. Since you surfaces RAG responses via a Microsoft Teams app, a practical frontend solution is to render the PDF using something like PDF.js, passing the document URL and page number dynamically. For example:
iframe
src="/pdfjs/web/viewer.html?file=/docs/manual.pdf#page=5"
width="100%"
height="100%"
style="border:none">
</iframe>Both the document URL and the page number can be parameterized directly from your vector metadata. This allows the user to open the PDF at the exact page referenced by the answer, while still being able to scroll to adjacent pages for additional context—reducing the need for follow-up queries. In summary: your RAG ingestion setup is already sound. There’s no need to ingest individual PDF pages as separate entities. The remaining work is primarily on the frontend: using the existing metadata to present the correct page to the user in a seamless and verifiable way. Hope this helps.
Thanks, will propose this solution to our user and let you know on the feedback.
