Deploy Finetuned Command Models from AWS Marketplace

This document shows you how to deploy your own finetuned HuggingFace Command-R model using Amazon SageMaker. More specifically, assuming you already have the adapter weights or merged weights from your own finetuned Command model, we will show you how to:

  • Merge the adapter weights with the weights of the base model if you only bring the adapter weights;
  • Export the merged weights to the TensorRT-LLM inference engine using Amazon SageMaker;
  • Deploy the engine as a SageMaker endpoint to serve your business use cases;

You can also find a companion notebook with working code samples.

Prerequisites

  • Ensure that IAM role used has AmazonSageMakerFullAccess
  • To deploy your model successfully, ensure that either:
    • Your IAM role has these three permissions, and you have authority to make AWS Marketplace subscriptions in the relevant AWS account:
      • aws-marketplace:ViewSubscriptions
      • aws-marketplace:Unsubscribe
      • aws-marketplace:Subscribe
    • Or, your AWS account has a subscription to the packages for Cohere Bring Your Own Fine-tuning. If so, you can skip the “subscribe to the bring your own finetuning algorithm” step below.

NOTE: If you’re running the companion notebook, know that it contains elements which render correctly in Jupyter interface, so you should open it from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.

Step 1: Subscribe to the bring your own finetuning algorithm

To subscribe to the algorithm:

  • Open the algorithm listing page for Cohere Bring Your Own Fine-tuning.
  • On the AWS Marketplace listing, click on the Continue to Subscribe button.
  • On the Subscribe to this software page, review and click on Accept Offer if you and your organization agrees with EULA, pricing, and support terms. On the Configure and launch page, make sure the ARN displayed in your region match with the ARN you will use below.

Step 2: Preliminary setup

First, let’s install the Python packages and import them.

1pip install "cohere>=5.11.0"
PYTHON
1import cohere
2import os
3import sagemaker as sage
4
5from sagemaker.s3 import S3Uploader

Make sure you have access to the resources in your AWS account. For example, you can configure an AWS profile by the command aws configure sso (see here) and run the command below to set the environment variable AWS_PROFILE as your profile name.

PYTHON
1# Change "<aws_profile>" to your own AWS profile name
2os.environ["AWS_PROFILE"] = "<aws_profile>"

Finally, you need to set all the following variables using your own information. It’s best not to add a trailing slash to these paths, as that could mean some parts won’t work correctly. You can use either ml.p4de.24xlarge or ml.p5.48xlarge as the instance_type for Cohere Bring Your Own Fine-tuning, but the instance_type used for export and inference (endpoint creation) must be identical.

PYTHON
1# The AWS region
2region = "<region>"
3
4# Get the arn of the bring your own finetuning algorithm by region
5cohere_package = "cohere-command-r-v2-byoft-8370167e649c32a1a5f00267cd334c2c"
6algorithm_map = {
7 "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:algorithm/{cohere_package}",
8 "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:algorithm/{cohere_package}",
9 "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:algorithm/{cohere_package}",
10 "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:algorithm/{cohere_package}",
11 "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:algorithm/{cohere_package}",
12 "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:algorithm/{cohere_package}",
13 "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:algorithm/{cohere_package}",
14 "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:algorithm/{cohere_package}",
15}
16if region not in algorithm_map:
17 raise Exception(f"Current region {region} is not supported.")
18arn = algorithm_map[region]
19
20# The local directory of your adapter weights. No need to specify this, if you bring your own merged weights
21adapter_weights_dir = "<adapter_weights_dir>"
22
23# The local directory you want to save the merged weights. Or the local directory of your own merged weights, if you bring your own merged weights
24merged_weights_dir = "<merged_weights_dir>"
25
26# The S3 directory you want to save the merged weights
27s3_checkpoint_dir = "<s3_checkpoint_dir>"
28
29# The S3 directory you want to save the exported TensorRT-LLM engine. Make sure you do not reuse the same S3 directory across multiple runs
30s3_output_dir = "<s3_output_dir>"
31
32# The name of the export
33export_name = "<export_name>"
34
35# The name of the SageMaker endpoint
36endpoint_name = "<endpoint_name>"
37
38# The instance type for export and inference. Now "ml.p4de.24xlarge" and "ml.p5.48xlarge" are supported
39instance_type = "<instance_type>"

Step 3: Get the merged weights

Assuming you use HuggingFace’s PEFT to finetune Cohere Command and get the adapter weights, you can then merge your adapter weights to the base model weights to get the merged weights as shown below. Skip this step if you have already got the merged weights.

PYTHON
1import torch
2
3from peft import PeftModel
4from transformers import CohereForCausalLM
5
6
7def load_and_merge_model(base_model_name_or_path: str, adapter_weights_dir: str):
8 """
9 Load the base model and the model finetuned by PEFT, and merge the adapter weights to the base weights to get a model with merged weights
10 """
11 base_model = CohereForCausalLM.from_pretrained(base_model_name_or_path)
12 peft_model = PeftModel.from_pretrained(base_model, adapter_weights_dir)
13 merged_model = peft_model.merge_and_unload()
14 return merged_model
15
16
17def save_hf_model(output_dir: str, model, tokenizer=None, args=None):
18 """
19 Save a HuggingFace model (and optionally tokenizer as well as additional args) to a local directory
20 """
21 os.makedirs(output_dir, exist_ok=True)
22 model.save_pretrained(output_dir, state_dict=None, safe_serialization=True)
23 if tokenizer is not None:
24 tokenizer.save_pretrained(output_dir)
25 if args is not None:
26 torch.save(args, os.path.join(output_dir, "training_args.bin"))
27
28# Get the merged model from adapter weights
29merged_model = load_and_merge_model("CohereForAI/c4ai-command-r-08-2024", adapter_weights_dir)
30
31# Save the merged weights to your local directory
32save_hf_model(merged_weights_dir, merged_model)

Step 4. Upload the merged weights to S3

PYTHON
1sess = sage.Session()
2merged_weights = S3Uploader.upload(merged_weights_dir, s3_checkpoint_dir, sagemaker_session=sess)
3print("merged_weights", merged_weights)

Step 5. Export the merged weights to the TensorRT-LLM inference engine

Create Cohere client and use it to export the merged weights to the TensorRT-LLM inference engine. The exported TensorRT-LLM engine will be stored in a tar file {s3_output_dir}/{export_name}.tar.gz in S3, where the file name is the same as the export_name.

PYTHON
1co = cohere.SagemakerClient(aws_region=region)
2co.sagemaker_finetuning.export_finetune(
3 arn=arn,
4 name=export_name,
5 s3_checkpoint_dir=s3_checkpoint_dir,
6 s3_output_dir=s3_output_dir,
7 instance_type=instance_type,
8 role="ServiceRoleSagemaker",
9)

Step 6. Create an endpoint for inference from the exported engine

The Cohere client provides a built-in method to create an endpoint for inference, which will automatically deploy the model from the TensorRT-LLM engine you just exported.

PYTHON
1co.sagemaker_finetuning.create_endpoint(
2 arn=arn,
3 endpoint_name=endpoint_name,
4 s3_models_dir=s3_output_dir,
5 recreate=True,
6 instance_type=instance_type,
7 role="ServiceRoleSagemaker",
8)

Step 7. Perform real-time inference by calling the endpoint

Now, you can perform real-time inference by calling the endpoint you just deployed.

PYTHON
1# If the endpoint is already deployed, you can directly connect to it
2co.sagemaker_finetuning.connect_to_endpoint(endpoint_name=endpoint_name)
3
4message = "Classify the following text as either very negative, negative, neutral, positive or very positive: mr. deeds is , as comedy goes , very silly -- and in the best way."
5result = co.sagemaker_finetuning.chat(message=message)
6print(result)

You can also evaluate your finetuned model using an evaluation dataset. The following is an example with the ScienceQA evaluation using these data:

PYTHON
1import json
2from tqdm import tqdm
3
4eval_data_path = "<path_to_scienceQA_eval.jsonl>"
5
6total = 0
7correct = 0
8for line in tqdm(open(eval_data_path).readlines()):
9 total += 1
10 question_answer_json = json.loads(line)
11 question = question_answer_json["messages"][0]["content"]
12 answer = question_answer_json["messages"][1]["content"]
13 model_ans = co.sagemaker_finetuning.chat(message=question, temperature=0).text
14 if model_ans == answer:
15 correct += 1
16
17print(f"Accuracy of finetuned model is %.3f" % (correct / total))

Step 8. Delete the endpoint (optional)

After you successfully performed the inference, you can delete the deployed endpoint to avoid being charged continuously.

PYTHON
1co.sagemaker_finetuning.delete_endpoint()
2co.sagemaker_finetuning.close()

Step 9. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any deployable models created from the model package or using the algorithm.

Note: You can find this information by looking at the container name associated with the model.

Here’s how you do that:

  • Navigate to Machine Learning tab on the Your Software subscriptions page;
  • Locate the listing that you want to cancel the subscription for, and then choose Cancel Subscription to cancel the subscription.
Built with