Bumper: Podcast X-Ray, A Case Study
I quit my job and started freelancing in June of 2023. With any sudden massive move like that, it’s impossible to predict what kind of projects will show up, but I was super pleased when the opportunity to join Bumper as their fractional CTO came along. If you aren’t familiar with Bumper, they are a data-driven podcast growth agency helping organizations increase the success of their audio properties by providing growth consulting services, measurement solutions and podcast marketing assistance.
Starting Point: An Internal Tool
The team at Bumper have been using an internally built tool that they call Podcast X-Ray. Podcast X-Ray helps them, and the organizations they work with, better understand the high-level view of a podcast. This tool had proven to be an extremely valuable internal resource and the Bumper team wanted to share it with a broader audience.
Bumper’s end goals were as follows:
Take their internal tool used by a few people and make it available to a broad audience of users
Allow for the new external facing tool to scale with increased traffic growth and data needs
Create a codebase and infrastructure that was simple to maintain, scale, and expand upon for future iterations
Move “on-demand” data rendering to a performant database for quick retrieval and analysis
Update and ingest all new data on a rolling 24 hour basis
Use the pre-existing codebase (Flask app, custom Python libraries, etc.) as the starting point for future development
Create a high-quality solution, while meeting objectives and sticking within a tight budget
Ignore the current look-and feel (for now)
Creating the Solutions
To support Bumper’s end goals, there were three areas that would need to be focused on for success:
Database
Codebase
Infrastructure
1. Database
Podcast X-Ray in its original form had no database and was loading all its data in real-time from various sources every single time a user visited the site. This caused response times to be slow and introduced a potential point of failure if any of those sources were to go down. Additionally, since no data was stored permanently, the Bumper team was unable to use the data for deeper analysis.
When thinking about how to set up the database there were a few pieces that needed to be considered. At the highest level the entire podcast ecosystem consisted of millions of podcasts and tens of millions of episodes and had to be stored in a database. But more specifically, every single change that could occur on a podcast needed to be tracked as well. All of the podcasts that were known about would also, at minimum, need to be updated every 24 hours. In addition to all of these things, any podcast that wasn’t in the system needed to be pulled in real-time, but then stored in the database for subsequent requests.
To put this into perspective, Apple Podcast’s US catalog contained over 2,500,000 podcasts and 84,000,000+ episodes. This meant that the system needed to process at minimum 2,375,000 podcasts & episodes combined every single hour to store everything in the database. It should be noted that the data being gathered originated from places other than an RSS feed and needed to be gathered, every single day, for all of time.
The storage of the data was relatively straightforward and PostgreSQL was chosen as the main storage and retrieval mechanism. The mechanics behind gathering the volume of data every day was a much more difficult problem to solve and had to be solved with code.
2. Codebase
The original codebase was a basic Python Flask app that fetched and displayed the appropriate podcast data. To support the changes of sourcing data from the database and to make further development easily maintainable, multiple micro services were set up to handle different aspects of the application for both insertion and retrieval. These micro services could then be appropriately tuned to operate at maximum efficiency and scale as appropriate.
ID worker - The ID worker’s sole job is to get the ID of every podcast that we care about as fast as possible and store them in the database with new podcasts being added to the database daily. These IDs are then used by the podcast worker.
Podcast worker - The podcast worker scans the database to see if there are any new podcasts or podcasts that need to be updated. Once it finds podcasts it needs to process, it will load all the data that it can find on the podcast. The worker uses the data to compare to what currently exists and either adds new data, or tracks changes that have been made to the data.
Episode worker - Similar to the podcast worker, the episode worker takes the episode data that the podcast worker loaded and hydrates the information with all relevant episode data and either adds new data, or tracks changes that have been made.
Web application - Purely a display layer for the end-user, the web application loads the podcast and episode information and displays data about the podcast. In the instance where an “unknown” podcast is asked for, the web application will store the ID data in the DB and queue it up for the workers to handle.
3. Infrastructure
Infrastructure is usually the “gotcha” part of a project. It can be expensive, confusing and hard to maintain. Finding the proper balance is different for every project, but for this one I chose DigitalOcean’s App interface. This allowed for easy set up of PostgreSQL, Redis, Workers and Servers, all with simple to configure redundancy, scalability and monitoring. The cost, while not the cheapest, was a good balance between ease of use, quality and power. DigitalOcean also integrates with GitHub, so I was able to create a fully automated, zero downtime, CI/CD flow with little to no effort. The final resources needed for the infrastructure are as follows:
2 worker servers (1 for the ID/Podcast workers, 1 for the Episode worker)
1 web application server
1 PostgreSQL database
1 Redis database for caching
Optimization & Launch 🚀
Once everything was built, deploying, and running correctly, the most difficult part was ensuring that all the various pieces were meeting the needs and constraints of the project. While the PostgreSQL database required the most resources, everything else was able to be tuned to run efficiently on the smallest available resources that DigitalOcean provided, keeping costs low and leaving ample room to scale.
While there’s still a lot more to come – a new and better design, more analytics, and other fun surprises – there was nothing left to do but to launch Podcast X-Ray as a publicly facing website! You can find it at https://podcastxray.com and myself and the Bumper team hope you enjoy it!