Going Big with Vertex AI: Lessons from the Edge

09 May 2024

A series of posts on Vertex AI, sharing lessons learned from serving machine learning models in production.

If you’ve spent any time deploying ML systems in production, you know that managed platforms can be a blessing and a curse. At Recursion, I started working with Vertex AI, Google Cloud’s fully managed AI platform. It promises to take care of the infrastructure so you can focus on models and data. Sounds good, right?

It mostly is. Until it’s not.

This isn’t a teardown post. I actually like a lot of what Vertex AI offers: out-of-the-box pipelines, flexible deployment options, decent integration with the rest of GCP. But once you move beyond tutorials and toy datasets, the platform’s limits start showing up fast.

The deeper I went, the more friction I hit.

Instead of a single massive post listing every issue I ran into, I’m breaking this into focused chunks, each tackling a specific area where Vertex AI falls short for real-world use.

🔖 In this series:

Limitations – Online Predictions
What happens when your deployed model doesn’t respond? How much control do you actually have over the prediction server? (Spoiler: not much.)
Limitations – Batch Predictions
Scaling batch jobs should be easy. But good luck if you need custom logic, retries, or visibility into what actually ran.
Limitations – Model Registry Predictions
Model versioning is essential for any ML workflow. Vertex AI has a registry, but it’s missing critical features for traceability and reproducibility.

Each post includes real-world errors, workarounds, and the kind of surprises you only discover after you’ve sunk hours into debugging.

Le Miroir