Deploy your finetuned model on AWS Marketplace
Deploy Your Own Finetuned Command-R-0824 Model from AWS Marketplace
This sample notebook shows you how to deploy your own finetuned HuggingFace Command-R model CohereForAI/c4ai-command-r-08-2024 using Amazon SageMaker. More specifically, assuming you already have the adapter weights or merged weights from your own finetuning of CohereForAI/c4ai-command-r-08-2024, we will show you how to
- Merge the adapter weights to the weights of the base model, if you bring only the adapter weights
- Export the merged weights to the TensorRT-LLM inference engine using Amazon SageMaker
- Deploy the engine as a SageMaker endpoint to serve your business use cases
Note: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.
Pre-requisites:
- Note: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
- Ensure that IAM role used has AmazonSageMakerFullAccess
- To deploy this ML model successfully, ensure that:
- Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used:
- aws-marketplace:ViewSubscriptions
- aws-marketplace:Unsubscribe
- aws-marketplace:Subscribe
- or your AWS account has a subscription to the packages for Cohere Bring Your Own Fine-tuning. If so, skip step: Subscribe to the bring your own finetuning algorithm
- Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used:
Contents:
- Subscribe to the bring your own finetuning algorithm
- Preliminary setup
- Get the merged weights
- Upload the merged weights to S3
- Export the merged weights to the TensorRT-LLM inference engine
- Create an endpoint for inference from the exported engine
- Perform real-time inference by calling the endpoint
- Delete the endpoint (optional)
- Unsubscribe to the listing (optional)
Usage instructions:
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).
1. Subscribe to the bring your own finetuning algorithm
To subscribe to the algorithm:
- Open the algorithm listing page Cohere Bring Your Own Fine-tuning.
- On the AWS Marketplace listing, click on the Continue to Subscribe button.
- On the Subscribe to this software page, review and click on “Accept Offer” if you and your organization agrees with EULA, pricing, and support terms. On the “Configure and launch” page, make sure the ARN displayed in your region match with the ARN you will use below.
2. Preliminary setup
Install the Python packages you will use below and import them. For example, you can run the command below to install cohere
if you haven’t done so.
Make sure you have access to the resources in your AWS account. For example, you can configure an AWS profile by the command aws configure sso
(see here) and run the command below to set the environment variable AWS_PROFILE
as your profile name.
Finally, you need to set all the following variables using your own information. In general, do not add a trailing slash to these paths (otherwise some parts won’t work). You can use either ml.p4de.24xlarge
or ml.p5.48xlarge
as the instance_type
for Cohere Bring Your Own Fine-tuning, but the instance_type
used for export and inference (endpoint creation) must be identical.
3. Get the merged weights
Assuming you use HuggingFace’s PEFT to finetune CohereForAI/c4ai-command-r-08-2024 and get the adapter weights, you can then merge your adapter weights to the base model weights to get the merged weights as shown below. Skip this step if you have already got the merged weights.
4. Upload the merged weights to S3
5. Export the merged weights to the TensorRT-LLM inference engine
Create Cohere client and use it to export the merged weights to the TensorRT-LLM inference engine. The exported TensorRT-LLM engine will be stored in a tar file {s3_output_dir}/{export_name}.tar.gz
in S3, where the file name is the same as the export_name
.
6. Create an endpoint for inference from the exported engine
The Cohere client provides a built-in method to create an endpoint for inference, which will automatically deploy the model from the TensorRT-LLM engine you just exported.
7. Perform real-time inference by calling the endpoint
Now, you can perform real-time inference by calling the endpoint you just deployed.
You can also evaluate your finetuned model using a evaluation dataset. The following is an example with the ScienceQA evaluation data at here.
8. Delete the endpoint (optional)
After you successfully performed the inference, you can delete the deployed endpoint to avoid being charged continuously.
9. Unsubscribe to the listing (optional)
If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any deployable models created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model.
Steps to unsubscribe to product from AWS Marketplace:
- Navigate to Machine Learning tab on Your Software subscriptions page
- Locate the listing that you want to cancel the subscription for, and then choose Cancel Subscription to cancel the subscription.