Creating a searchable database from uploaded pdfs

d0za · January 2, 2021, 5:13pm

What up folks,

I’m a self taught coder whose been using rw for a side project since I heard about it on the React Podcast. I’m currently stuck trying to implement the following, and thought I would seek out the rw community as they have been awesome on the discord so far with my dumb noob questions.

I want an authenticated user to be able to search by word or document title through almost a 100 uploaded pdfs that are in a AWS S3 bucket (these are reference docs, and all users will be able to have read access to view all uploaded pdfs). I already have created the file upload to AWS through a tutorial someone posted here ( Thanks [Tobbe] for the writeup!). If anyone has any guidance on OCR or lessons learned trying to implement something similar I’m all ears. Thanks for reading!

dthyresson · January 2, 2021, 5:58pm

@d0za hi!

Your best bet to implement that search is to use Algolia.

https://www.algolia.com/

Searching pdf isn’t like searching text, but they have some tips on indexing and then searching longer form content:

And ways to extract the content from PDFs:

https://stories.algolia.com/indexing-pdf-or-other-file-contents-for-searching-b2499c23568f