Hi Sharan,
If your goal is simply to return a specific PDF page to the user, one option would be to physically split the document and store individual pages. However, this introduces additional complexity around document lifecycle management (updates, re-ingestion, monitoring, etc.) and is generally not necessary for your use case.
Based on your description, the real requirement is to return the relevant page of an existing document that corresponds to a retrieved vector, not to ingest each page as a standalone document. For that scenario, your current approach is already sufficient.
By storing the sourceURL and pageNumber as metadata during ingestion, you already have everything needed to present the correct page back to the user.
Since you surfaces RAG responses via a Microsoft Teams app, a practical frontend solution is to render the PDF using something like PDF.js, passing the document URL and page number dynamically. For example:
iframe
src="/pdfjs/web/viewer.html?file=/docs/manual.pdf#page=5"
width="100%"
height="100%"
style="border:none">
</iframe>
Both the document URL and the page number can be parameterized directly from your vector metadata. This allows the user to open the PDF at the exact page referenced by the answer, while still being able to scroll to adjacent pages for additional context—reducing the need for follow-up queries.
In summary: your RAG ingestion setup is already sound. There’s no need to ingest individual PDF pages as separate entities. The remaining work is primarily on the frontend: using the existing metadata to present the correct page to the user in a seamless and verifiable way.
Hope this helps.