If you’re familiar with serverless computing especially functions as a service (FaaS) like AWS Lambda or DigitalOcean Functions, you might be aware of the limitations that these services have. Limitations usually revolve around compute power, runtime environments and the additional packages you might be able to install to get a function running. For Tape Index and its natural language processing (NLP) flow, I quickly hit all those limitations from the start. In my searching for alternative solutions I came across Modal. Modal is similar to serverless functions, but fits more in line with what I would consider a serverless application.

I will be writing a multi-part series outlining how I’ve set up a Modal application and how it handles the entire natural language processing flow for Tape Index. This first post will outline the high-level concepts, with subsequent posts looking at actual code and providing practical tips on how you could build something similar.

Why Natural Language Processing?

Within the Tape Index Dashboard a user can upload any kind of file that they want. Theoretically, these files can be as large or as small as the user want. Tape Index’s NLP flow needs to be able to handle this wide range of files and extract out as much information as it possibly can.

For every file uploaded, all metainfo (dimensions, format, size, etc.) will be extracted. In addition to the metainfo a textual representation of the file will try to be created. If it’s an image, the system can identify visually represented words. For documents, text will be extracted out directly. In the case of video or audio, they will be diarized (speaker identification) and transcribed. All of this extracted information is imperative to how people will later find and use what they’ve uploaded in meaningful and powerful ways.

Going With the NLP Flow

Now that you know why Tape Index is using NLP, let’s look at the actual flow that is used in conjunction with Modal.

Here’s a fancy little drawing I drew myself

1. Upload - This first step in the whole process begins with a user uploading a file on Tape Index. Since I use DigitalOcean for hosting the files are hosted securely on the DigitalOcean Spaces product. Upon successful upload a notification is sent to a Modal endpoint.

2. Modal Endpoint - The Modal web endpoint is where the magic begins. This endpoint receives a few data points from the upload notification. Other than ensuring that the notification was authenticated and allowed, it quickly spawns two asynchronous Modal functions. One to extract the metainfo of the file and one to download the file.

👉 Read Part 2 of this series for more on the Modal Web Endpoint

3. Metainfo Extraction - Metainfo extraction is pretty straightforward since it only looks at information about the file itself and any additional headers saved to the file. Once this function is done, it sends the information back to Tape Index. This whole process takes only a few seconds to complete.

4. File Download - File downloading is its own function for speed and cost savings only. Isolating out downloading allows me to use the cheapest Modal resources available while also being able to share the file later on with much more resource intense functions. If the file is a video file, it will be transcoded on the fly to audio only. Once the file is downloaded, it will trigger one of two separate paths depending on the file type.

If the file is not audio or video:

4a. Text Extraction - The text extraction process can take a variety of different file types, images, pdfs, text files, etc. and will do its best to pull out whatever text it can identify. Once it has extracted out all the text possible, it sends the text along to be categorized.

If the file is audio or video:

4b. Diarization - Of all the steps in this whole process, diarization is the slowest. Because diarization requires the entire audio file it cannot be done in parallel. Once diarization is done, it breaks up the audio file into small speaker segments which are then sent to be transcribed.

4c. Transcription - Running in parallel, the transcription process takes all the smaller speaker segments and transcribes them. Once all segments have been transcribed, categorization can happen.

5. Categorization - Regardless of which path was taken to get the textual representation of the file, categorization will be run on those results. Categorization can identify things like people, places, organizations, topics, summarization and more. Once categorization is done, all data is sent back to Tape Index to be stored.

The End

From the data sent back to Tape Index, powerful tools like searching, associating and grouping can be utilized by the user to better understand their content. The potential applications of this are plentiful and I can’t wait for the world to try it out!