Exceeding the maximum size for serverless functions

tctrautman · October 21, 2020, 10:16pm

Hey all!

I’m trying to use puppeteer-core and chrome-aws-lambda to crawl public sites and collect relevant data for my users.

But it seems that, since chrome-aws-lambda includes an entire browser, adding these packages to the api workspace brings my graphql function to a whopping 85 mb compressed / 268 mb uncompressed, which is greater than the 66 mb compressed / 250 mb uncompressed limit imposed by Netlify / AWS.

So unless I’m overlooking something, it seems that I can’t use these packages on the api side. I’ve considered the following work-arounds and would greatly appreciate any thoughts you might have:

Move off of Netlify’s serverless hosting, and onto a “server-full” hosting option like Heroku (I’m pretty sure I’ve seen mentioning of this as an option in the forums – haven’t researched it closely though)
- Pros: this should remove the limit and allow for me to use puppeteer
- Cons: likely complicates the deployment and development process, moves further away from the vision of redwood
Continue with the serverless hosting, but move all puppeteer code off onto its own server, sharing the same heroku database, running puppeteer code on a cron job.
- Pros: development process doesn’t get complicated, puppeteer can run
- Cons: without building its own API the puppeteer code couldn’t get triggered by user actions, which isn’t ideal

Right now I’m leaning toward the second option, but i’m curious if anyone here (1) is aware of some way I can do this entirely within RW somehow or (2) can think of a more elegant / less time-consuming workaround than one of the above.

Thank you in advance! And, yes, trying to run a browser within a lambda function is kind of ridiculous

ajcwebdev · October 22, 2020, 1:15am

Having not hit this limit myself I have not explored solutions to this problem but you may want to keep an eye things like Netlify Edge Handlers and Fastly Compute@Edge which are going through beta testing. I think these are influenced by Lambda@Edge which is aiming to essentially be Lambda on CloudFront so your functions are on a CDN.

Netlify Edge Handlers have a limit of 256 megabytes, so you’d be right around the limit when uncompressed. I usually don’t see web scraping listed as a use case for these but it seems like the point is to be able to run whatever you want on these things.

tctrautman · October 22, 2020, 2:46am

Ah, I wasn’t aware of those – thank you @ajcwebdev !

Tobbe · October 22, 2020, 11:35am

Another option is to consider if you really need puppeteer at all. Could something like jsdom be enough?

https://www.twilio.com/blog/web-scraping-and-parsing-html-in-node-js-with-jsdom

That would let you do it 100% the RW way

tctrautman · October 23, 2020, 1:20am

Ah, thank you @Tobbe! I wasn’t familiar with jsdom – this looks like a great solution.