Over the past few months I’ve been building out Tape Index – a radically simple organizational system designed for content creators. Indexing literally everything and anything that someone uploads, regardless of file format, is core to keeping things simple. To accomplish this, Tape Index needs to support transcription and natural language processing (NLP) right from the start. I am also self-funding Tape Index, so whatever solution I use can’t break the bank, specifically my wallet. Oh, did I also mention that data security is of utmost importance and places severe limitations on where any user data can be processed? I started to wonder what my options were for accomplishing transcription and other language processing with all of these constraints. After some quick research, there were a few options that could be viable for Tape Index:

Transcription/NLP SaaS product (eg. OpenAI, AssemblyAI)
Serverless GPUs (eg. Inferless, Modal, Beam)
Cloud hosted GPU servers (eg. Paperspace, Google Cloud, Linode/Akamai)

Transcription/NLP SaaS Products

One of the simplest ways to enable transcription and natural language processing is to take advantage of the many various SaaS products out there. AssemblyAI for example makes things super simple by allowing a developer to transcribe audio from a URL and receive a webhook upon completion. While services like OpenAI and AssemblyAI unlock many powerful features quickly, they are not usually the most cost-effective solutions, especially when dealing with large amounts of content to process.

Another downside to SaaS solutions is the possibility of having to send sensitive information to a 3rd party service to process. While this can be done in a secure way for certain types of data, not all data can be encrypted in such a way as to still yield relevant results. Sending data securely also assumes that the system being built knows what data is supposed to be considered “sensitive.” Since literally anything uploaded to Tape Index could be sensitive information and Tape Index has no clue what would or wouldn’t be sensitive, using a 3rd party SaaS product is not a viable solution for Tape Index.

The cost of using a SaaS solution is typically charged per “token” for NLP and per hour for audio transcription. AssemblyAI will transcribe and diarize for $0.65/hour and OpenAI can transcribe (no diarization) audio at $0.36/hour. While these costs won’t break the bank, they do add up quickly and there’s got to be a cheaper way.

Serverless GPUs

When you start diving down the rabbit hole of NLP, you quickly realize that regular CPUs and RAM will not cut it for computing power. The power of GPUs, normally used for graphics processing, can be harnessed in such a way as to provide significantly more computing power for NLP. To take advantage of this type of computing power it’s important to find a hosting company that provides cloud hosted machines with GPU capabilities. Unfortunately, GPU machines are very expensive to run so it’s also important to find a hosting company that supports “serverless” infrastructure, but with the power of GPUs. The best type of infrastructure has the ability to spin up a machine almost instantly, run a complex set of functions and shut the machine down upon completion. This method of on-demand computing reduces costs by only being charged for actual execution time.

A huge benefit of using serverless GPUs over a SaaS product is unlocking the ability to control everything about the system and its security. Serverless GPUs are ephemeral and therefore offer a layer security through lack of permanence. But more importantly, since the entire serverless GPU can be configured and controlled, it is possible to encrypt and decrypt all information coming in and out of the system. This ensures that all data is protected and processed on the server without having to be passed to a 3rd party pipeline, although data is still transmitted over public networks.

I have been experimenting with Modal and have been extremely impressed with their infrastructure. Their base machine costs around $0.70/hour to run, but they also give $30 worth of credits per month. Using Modal I’ve been able to build out a transcription and diarization pipeline capable of processing an hour worth of audio in a couple of minutes. Something that would have cost $0.65 on AssemblyAI, now costs a couple of pennies on Modal. Shaving off 10s of cents per action yields significant savings, especially at scale. For example, processing 10 hours of audio on AssemblyAI costs $6.50 versus $0.50 for the same amount of audio on Modal.

Cloud Hosted GPU Servers

While serverless GPUs are an extremely cost effective way of harnessing the power of GPUs versus a SaaS product, there is a point at which dedicated cloud hosted GPUs become more cost-effective. The breaking point is when the serverless functions have to be run for longer than the average 24 hour a day cost of an equivalent cloud hosted GPU. For example, the cheapest cloud hosted GPU running with an average of 730 hours a month at $0.59/hour would cost $430/month. The cheapest serverless GPU could only be run for 614 hours at $0.70/hr to match the $430/month price of a cloud hosted GPU.

Think of a cloud hosted GPU server like a serverless GPU, but with a lot more control over certain aspects of the server and infrastructure. With greater flexibility comes greater responsibility and worrying about things like uptime, scaling, networking, package management, etc. are an unquantifiable cost that should not be neglected. In addition to the unquantifiable cost of upkeep, cloud hosted GPU servers are only cheaper than a serverless GPU when tasks need to be running all the time.

While cloud hosted GPU servers can be more secure than a serverless GPU, there is also a greater potential for security holes through human error if something is not set up correctly. Using a service like Paperspace can unlock VPN only access to the GPU servers preventing data being transmitted publicly and can greatly reduce the vulnerability footprint of the system.

Verdict

It didn’t take me long to come to a decision on what I’d initially use for Tape Index. I needed better security than was offered on a 3rd party SaaS product and I wouldn’t even come close to processing enough data that would justify the costs of a cloud hosted GPU server. Because I can get both the costs and security I need for Tape Index with a serverless GPU solution, this is the path that I’ve currently been using to build out the transcription and NLP functionality for Tape Index. I’ve specifically been using Modal as this has provided the easiest and most flexible solution I needed. Modal also provides $30 worth of credits, which means I’m able to build out everything without spending a single dime.