This document shows you how to deploy your own finetuned HuggingFace Command-R model using Amazon SageMaker. More specifically, assuming you already have the adapter weights or merged weights from your own finetuned Command model, we will show you how to:
You can also find a companion notebook with working code samples.
AmazonSageMakerFullAccessaws-marketplace:ViewSubscriptionsaws-marketplace:Unsubscribeaws-marketplace:SubscribeNOTE: If you’re running the companion notebook, know that it contains elements which render correctly in Jupyter interface, so you should open it from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
To subscribe to the algorithm:
First, let’s install the Python packages and import them.
Make sure you have access to the resources in your AWS account. For example, you can configure an AWS profile by the command aws configure sso (see here) and run the command below to set the environment variable AWS_PROFILE as your profile name.
Finally, you need to set all the following variables using your own information. It’s best not to add a trailing slash to these paths, as that could mean some parts won’t work correctly. You can use either ml.p4de.24xlarge or ml.p5.48xlarge as the instance_type for Cohere Bring Your Own Fine-tuning, but the instance_type used for export and inference (endpoint creation) must be identical.
Assuming you use HuggingFace’s PEFT to finetune Cohere Command and get the adapter weights, you can then merge your adapter weights to the base model weights to get the merged weights as shown below. Skip this step if you have already got the merged weights.
Create Cohere client and use it to export the merged weights to the TensorRT-LLM inference engine. The exported TensorRT-LLM engine will be stored in a tar file {s3_output_dir}/{export_name}.tar.gz in S3, where the file name is the same as the export_name.
The Cohere client provides a built-in method to create an endpoint for inference, which will automatically deploy the model from the TensorRT-LLM engine you just exported.
Now, you can perform real-time inference by calling the endpoint you just deployed.
You can also evaluate your finetuned model using an evaluation dataset. The following is an example with the ScienceQA evaluation using these data:
After you successfully performed the inference, you can delete the deployed endpoint to avoid being charged continuously.
If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any deployable models created from the model package or using the algorithm.
Note: You can find this information by looking at the container name associated with the model.
Here’s how you do that: