LLM Deployment Best Practices for Production Teams

The spotlight in AI has recently shifted towards foundation models (FMs) and their subset – large language models (LLMs). Despite this trend, the cornerstone of machine learning success remains in deployment – turning these sophisticated algorithms into practical, operational tools. In systematizing the knowledge about the newly developed FMOps infrastructure, we want to highlight that whether it's a traditional ML model or an advanced LLM, the deployment process shares many similarities. There are some additional considerations, of course, but a lot of what you heard might just be hype ;)

The earlier you deploy the model, the earlier you become aware of production issues. As a result, you can tackle them before you are set to deploy the final version of your model. It is estimated that about 80% of the ML models never make it to production! Today, we'll cut through the complexities, highlighting the key parallels and unique challenges in deploying both traditional and advanced models. This streamlined approach demystifies deployment, equipping you with best practices that you can follow even before training an ML model to make it easier and faster to deploy.

In today’s Token:

How to choose the right model?
Where can I store my embeddings?
Are feature stores any useful?
How can we get the best performance from a chosen model for our use case?
Now that I know how to choose a given model, how do I choose the infrastructure to deploy it?
I have deployed my model. Can I sit back and relax?
Bonus resources
Conclusion

Let’s get started with how you can choose the right model.

How to Choose an ML Model for Production

Well, the answer is, to use the simplest model that can get the job done. For instance, if you are working on a binary classification problem, start with logistic regression. There are two benefits to using the simplest model:

Simpler models are explainable
Simpler models can be trained quickly, and the team can focus on other phases of the ML lifecycle. This reduces the time to deployment.

For LLMs, this translates to using good enough models and prompt engineering instead of fine-tuning or training a custom LLM. For tasks like information extraction, where the required answer is present in the prompt itself, prompt engineering is very effective and less time-consuming than fine-tuning or training a custom model.

❝

Choosing between open-source and closed-source foundation models involves weighing operational, financial, and strategic factors.

Closed-source models offer ease of use and robust support but come at a higher cost and with limitations in customization and potential provider dependency. They are ideal for those who prioritize reliability and simplicity over technical control.

Open-source models, while more affordable upfront, demand greater technical expertise and can incur hidden operational costs. They offer customization and innovation opportunities, but require significant investment in infrastructure and expertise management.

We covered it in detail in Token 1.9: Open- vs Closed-Source AI Models: Which is the Better Choice for Your Business?

Where can I store my embeddings?

Unlike neural network models like CNNs, RNNs, etc. that internally compute the embeddings, for language models, it is possible to compute and provide the embeddings separately. Models like stable diffusion expect an embedding or a vector representation of the actual prompt to compute the output. But, can we store these embeddings like a normal text or image file?

Vector databases for embeddings

Embeddings can be stored in a normal text file or a csv file but that is not the optimal place. Vector database, a specialized database made specifically to store embeddings, makes operations like querying, scaling, similarity search, filtering, etc. to perform on top of embeddings. Hence, it is better to store embeddings in a vector database.

Another advantage of using a vector database is efficient search. Milvus, an open-source vector database, can help you save time and money via hardware-efficient indexing algorithms that can boost retrieval speed by up to 10 times.

Feature stores: when you need them in deployment

In conventional ML projects, feature stores help store and share features and facilitate collaboration among the data scientists and engineers involved in the project. How can LLMs take advantage of the feature store?
In numerous applications, input to the language models consists of features like user attributes, product attributes, and history in addition to the textual prompt and context. For example, for a book recommendation, to recommend the most suitable book, the language models will most probably need the attributes of the book, user attributes, and history in addition to the abstract of the book. Thus, feature stores are still helpful for foundation models. In this case, a separate feature store for storing the embedding and user/product attribute is needed.

Recommend this newsletter

LLM Performance Optimization for Your Use Case

Getting the optimal performance from the model depends on the underlying task you are planning to use the model for. For a simple task, where the context in the prompt has the answer to the problem, you can get decent performance with prompt engineering.

When the task involves finding answers from a knowledge base, it is better to use retrieval augmented generation (RAG). Retrieval Augmented Generation (RAG) is an approach that combines retrieval-based methods with generative models. It leverages a pre-trained language model like GPT and retrieves relevant information from a knowledge base to augment the generation process. This helps the model provide accurate and relevant responses, especially when the task involves retrieving specific knowledge or answers from a large corpus of information.

Image Credit: RAG for Knowledge-Intensive NLP Tasks

ML Model Deployment Infrastructure: Cloud vs. On-Premise

For the deployment of ML models, you have three major options:

Deploy in your server
Deploy in the public cloud
Hybrid deployment (combination of private resources and public cloud)

If you choose the first option, you will be responsible for setting up the infrastructure, allocating the resources, and maintaining the server. This will also require a skilled system administrator in the team and some heavy upfront costs. If you have the resources to deploy the ML model in your server, you will enjoy benefits like:

Fine-grained control over the resources
Enhanced security and data privacy(Policies in some organizations and domains such as medicine may require that the data doesn’t leave your server)
Long-term cost savings

If you don’t want to go through the hassle of setting up your own infrastructure, you can choose to deploy the model in public clouds like AWS and Azure. The pros to deploying your model in the public cloud include:

Low upfront cost and easy scalability (public clouds provide auto-scaling which allows automatic instantiation and termination of hardware resources depending upon the usage)
Easy deployment options for popular ML models and Language models.
Services like Sagemaker provide end-to-end machine learning services (including storage service, data analysis tools, feature stores, CI/CD, etc.) at no additional cost.
Services offered by these cloud providers are updated regularly to satisfy local regulations such as HIPAA. Furthermore, the data can be encrypted which adds another layer of security.

Finally, we can also combine our private resources with public cloud services. For example, if you have a strict requirement that the user data cannot leave your servers, then you can connect your local data center to public clouds. Users will then send requests to your data center, the data center will then preprocess the request (such as normalizing the inputs, extracting embeddings, etc.) and send the processed input to a public cloud (over an encrypted network) for final computation. The output from the cloud is then sent to the user via the local data center. This approach where a machine learning model or system is deployed using a combination of on-premises and cloud-based resources is called hybrid deployment.

👀 Today, we have multiple hardware options: TPU, GPU, and CPU, to train and deploy our models. TPUs are accelerators designed specifically for handling deep learning workloads. GPUs support parallelization and hence are suitable for training neural networks. CPUs do not support parallelization, training on a CPU can take a lot more time than on a TPU or a GPU, and hence, CPUs are suitable for inference. The general trend is to train the model on a GPU or TPU and then use a CPU for inference. If the inference/prediction steps involve batch processing, you can use GPU to improve the inference speed.

Regardless of the hardware used for deployment, there are a few techniques that can help you improve the inference speed:

Quantization: Quantization reduces the precision of model parameters (weights and activations). This minimizes the model's memory footprint and speeds up inference. A common example of quantization is the conversion of a model based on 32-bit(FP32) to a 16-bit model(FP16).
Pruning: Pruning reduces the connections in the network with minimal compromise in the performance.

ML Model Monitoring in Production

Machine learning is an iterative process and hence doesn’t finish after deploying a model. For instance, when first released, GPT-4 was trained with data till September 2021. It did not know about the events that happened after September 2021. Later, multiple revisions of the model were rolled out and the latest one is trained with data till April 2023. ChatGPT has a similar case. Hence, deployment is not the finish line for an ML project (no matter if it’s a traditional model or a foundation model like LLaMA). In addition to continuous training, you will need to look out for issues like:

Data and Concept Drift:
Data drift refers to the change in data distribution with time. As time passes, people’s preferences change. For instance, a user who likes to watch romantic movies today may prefer watching action or suspense after a year or so. As seasons change, people’s choices for their drinks change (beer in summer and whisky or rum in winter). If the input data is different from what the model is trained on, the model will very likely fail.
Concept drift refers to the change in the value of the dependent variable or the variable associated with the outcome/prediction with time. A common example of concept drift is the behavior of people during the COVID lockdown. Before the lockdown, for a restaurant or its mobile app, weekends would be the right time to send offers but during COVID the behavior changed and people wouldn’t respond to those offers.

Since, language models are used in conjunction with user/product data (Customer Support bots, Personalization/recommendation services), they are equally impacted by data and concept drift.
These issues can be solved by continuously training the model with newly acquired data.
Resource usage, errors, and bugs:
Once the model is deployed, apart from issues related to the accuracy of the machine learning model, it may suffer from issues like disproportionate resource usage, faults and errors, fluctuating user requests, network congestion, etc. Only what gets measured, gets managed, meaning we can tackle the problem only after identifying it. You can use loggers to log these issues and tools like Grafana to visualize the metrics like response time, user requests, response time, etc. Once you identify the problem, you can tackle them using the right approach i.e load balancers to uniformly balance the load on the resources, auto-scaling to automatically scale up or down the active resources, etc.
Compliance and Regulations:
Depending upon the quality of input data the model is trained on, the model may generate outcomes that are morally incorrect or fail to comply with regulations. For instance, a few years back, Amazon built an AI tool to find the best candidate for the job. Later, upon inspection, it was found that the AI was not gender-neutral. It penalized resumes that contained the word “women”. Even the most advanced language models aren’t immune to such problems, GPT-3 was found to capture anti-muslim bias.
One solution to this issue is to use a guardrail model. During training, the guardrail model can be used to identify and eliminate text containing biased, hurtful, and socially unacceptable text. During deployment, the same guardrail model can be used to filter out biased information from the output.

Open Source Model Deployment Tools

TensorFlow Serving (github)
A flexible, high-performance serving system for machine learning models, designed for production environments. It's part of the TensorFlow Extended (TFX) ecosystem.
Not specifically for foundation models, but can handle various types of TensorFlow models.
TorchServe (github)
Developed by AWS and Facebook, it's a model-serving framework for PyTorch models that simplifies the deployment process.
Ideal for PyTorch models, including large models, but not exclusively for foundation models.
MLflow (github)
An open-source platform for managing the end-to-end machine learning lifecycle. It includes capabilities for deploying models in diverse environments.
General purpose, not limited to any specific model type.
Kubeflow (github)
A Kubernetes-based platform to deploy, monitor, and manage machine learning models at scale.
Not specific to foundation models but highly scalable for various model types.
Seldon Core (github)
An open-source platform that enables deployment, scaling, and monitoring of machine learning models in Kubernetes.
Versatile for different model types, including large models.
13 Open-Source Tools for Foundation Model Deployment

Conclusion

Deploying an ML model is an iterative and convoluted process. But, following the best practice can save time, effort, and future legal battles. For the sake of convenience and explainability, it is better to use the simplest model that can get the job done. Furthermore, for language models, using a vector database makes it efficient to work with embeddings, if needed, a separate feature store can be used to store additional attributes. Once you are done with the training, depending on the maturity of the organization, available resources and the spread of users, you may choose to deploy your model in your own infrastructure, the public cloud, or a combination of both (the hybrid approach). It is important to note that techniques like quantization and pruning can improve the inference speed of your models (with some compromise on the accuracy). Finally, continuous training of your model can save you from an accuracy drop due to data and concept drift, continuous monitoring can help you find network and infrastructure-related issues, and guardrail models can help you prevent responding to users with biased and socially unacceptable responses.

Please give us feedback

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

Previously in the FM/LLM series:

Token 1.17: LLM Deployment Best Practices for Production