Looking back at 2025, forcing AI into everything was clearly a trend. At Filestage we also had our fair share of that. We tried to be deliberate and didn't rush to add AI into our product. We carefully thought about the use cases where it actually provided value to our customers. We settled on AI Reviewers to save time when reviewing files.
Operating without dedicated ML engineers or GPU infrastructure forced disciplined trade-offs—every infrastructure choice had to justify its complexity. We don't have to invent a novel foundation model but rather use the available AI models and tools. That's why we focused on specific use cases we were confident we could ship quickly and validate with real customers.
We prioritized getting feedback from our customers as soon as possible instead of building the perfect solution. By focusing on very specific use cases we could select the best AI models and approaches for each. Our first MVP was ready in less than a month, with reviews completing in under 30 seconds.
This was our first and most obvious attempt to leverage AI. Currently LLM foundation models have gotten really good at text-based tasks so we were confident that if we were able to extract the text from the files we should be able to provide solid review comments.
Although we programmed it in a way that would make it easy for us to create multiple AI Reviewers with different prompts, we started with the simple use case of spelling and grammar. Spelling checks have been with us for decades and can be achieved without AI, but it allowed us to test the pipeline end to end. We knew we could later add different prompts to easily cover other cases like: avoiding forbidden terms or helping with the enforcement of a consistent tone of voice.
We allow users to upload many file types at Filestage (websites, PDF, Microsoft Office, ...) and extracting text from each file format has its own challenges. We started targeting PDFs because they usually contain the most text and there are more tools available out there. We also already convert many file types into PDFs to make them web compatible. After evaluating multiple tools we settled on AWS Textract. Although OCR has existed for a long time, the new machine learning approach is more accurate and can deal better with structured data like tables and forms inside documents.
But why even extract the text in the first place? Just pick a multimodal LLM and problem solved. Well... currently text-based performance is drastically better than multimodal performance and it is faster and cheaper. After testing many files our accuracy with the text-based approach was much better, although the pipeline complexity was higher due to the extra steps involved.
Another obvious downside of text-based is that once you extract the text the corresponding coordinates of each piece of text in the original document are gone, you strip out everything to obtain the text only. At Filestage comments need to be placed accordingly in the original documents. We ended up processing the LLM output to identify the offending piece of text, searching for it with our PDF processing library (we use Apryse) to obtain the coordinates.
This sounds straightforward but several edge cases made it tricky: the same text can appear multiple times in a document, a text extract can be split across multiple pages, and the LLM occasionally hallucinates parts of the text which makes finding the original source harder. We handled these with heuristics that worked well enough for the MVP.
For our first visual use case we focused on checking if particular images were present in designs. For example, it can be particularly helpful in packaging design: due to regulations you need to make sure the recycle icon is present in every design. The icon has more or less always the same shape but depending on the design it can have different colors or sometimes be only outlined and other times filled in, which is why it is better solved with AI. Our AI reviewer goal was to automatically verify the images uploaded by the customer were present.
We researched the models available with our sample test files and icons we wanted to verify. We ultimately chose DINOv2 (open-source vision model) instead of training our own. With our customer data it was accurate enough, this skipped months of work and let us verify with customers sooner.
This was the first time we ran our own AI workloads in production and therefore we hadn't had the requirement before of needing GPUs in our backend servers. Instead of trying to optimize too early we tested CPU inference. Because our flow was asynchronous anyway and the volume was going to be low at the beginning, we decided to move forward with it. This allowed us to reuse our existing background job infrastructure.
This minimised the ops work necessary, though I still had to adapt our backend infrastructure to run Python workloads alongside our existing Node.js services, and efficiently download the model parameters (a few GBs) into our instances.
Every time a user uploads a file we generate embeddings for the uploaded file, these are then used when processing the image to find coincidences. Setting up a new vector database would delay us and add a new moving part to our infrastructure, so we decided to test out the recently released vector search from MongoDB Atlas, which is our current and only database in our backend. No new infrastructure to operate.
Rather than training a classifier or using multiple weighted similarity measures, we used simple cosine similarity with a single threshold to find a match. If users needed different sensitivity levels, we could expose the threshold as a configuration option rather than building complex per-icon-type logic upfront.
We constantly need to be testing and benchmarking new model releases. Better models allow us to rework our current AI reviewers, automate more complex use cases, and maybe even deal with other file types like video which require much more processing.
Customer feedback is the other critical part. We need to closely monitor the accuracy of our AI reviewers and make sure the feedback loop with customers is tight so we can learn what new use cases we can implement and what the key challenges in adoption are.
Finally we need to make sure costs and performance are up to our standards. If the usage continues to grow we will switch to GPU inference and we may introduce caching to optimize the LLM API calls.
Gather real customer examples early. Having actual customer files with their expected review comments let us verify accuracy quickly across different approaches. Without this ground truth, we would've been guessing.
Software engineers can build AI pipelines, but it's important to learn deeply how LLMs and other models work behind the scenes to improve accuracy and make better tool choices. I found AI Engineering by Chip Huyen particularly helpful for building this understanding.
Customer feedback beats perfection. When testing novel approaches that need validation, keep pushing to deliver. The real edge cases only surface with real usage.
There is no doubt AI can empower users to do more with your product. I highly encourage you to take the time to do the thought experiment of how you would develop your product from scratch now that AI is readily available. At the same time, don't force AI into your product if you don't see a fit. Listen to your customers carefully, don't just follow trends blindly.