Apple: Transformer architecture optimized for Apple Silicon by behnamoh

Share This Article

Sed ut perspiciatis unde.

Use ane_transformers as a reference PyTorch implementation if you are considering deploying your Transformer models on Apple devices with an A14 or newer and M1 or newer chip to achieve up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations.

ane_transformers.reference comprises a standalone reference implementation and ane_transformers.huggingface comprises optimized versions of Hugging Face model classes such as distilbert to demonstrate the application of the optimization principles laid out in our research article on existing third-party implementations.

Please check out our research article for a detailed explanation of the optimizations as well as interactive figures to explore latency and peak memory consumption data from our case study: Hugging Face distilbert model deployment on various devices and operating system versions. Below figures are non-interactive snapshots from the research article for iPhone 13 with iOS16.0 installed:

Tutorial: Optimized Deployment of Hugging Face distilbert

This tutorial is a step-by-step guide to the model deployment process from the case study in our research article. The same code is used to generate the Hugging Face distilbert performance data in the figures above.

In order to begin the optimizations, we initialize the baseline model as follows:

import transformers
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
baseline_model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_name,
    return_dict=False,
    torchscript=True,
).eval()

Then we initialize the mathematically equivalent but optimized model, and we restore its parameters using that of the baseline model:

from ane_transformers.huggingface import distilbert as ane_distilbert
optimized_model = ane_distilbert.DistilBertForSequenceClassification(
    baseline_model.config).eval()
optimized_model.load_state_dict(baseline_model.state_dict())

Next we create sample inputs for the model:

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
tokenized = tokenizer(
    ["Sample input text to trace the model"],
    return_tensors="pt",
    max_length=128,  # token sequence length
    padding="max_length",
)

We then trace the optimized model to obtain the expected input format (Torchscript) for the coremltools conversion tool.

import torch
traced_optimized_model = torch.jit.trace(
    optimized_model,
    (tokenized["input_ids"], tokenized["attention_mask"])
)

Finally, we use coremltools to generate the Core ML model package file and save it.

import coremltools as ct
import numpy as np
ane_mlpackage_obj = ct.convert(
    traced_optimized_model,
    convert_to="mlprogram",
    inputs=[
        ct.TensorType(
                f"input_{name}",
                    shape=tensor.shape,
                    dtype=np.int32,
                ) for name, tensor in tokenized.items()
            ],
            compute_units=ct.ComputeUnit.ALL,
)
out_path = "HuggingFace_ane_transformers_distilbert_seqLen128_batchSize1.mlpackage"
ane_mlpackage_obj.save(out_path)

To verify performance, developers can now launch Xcode and simply add this model package file as a resource

Apple: Transformer architecture optimized for Apple Silicon by behnamoh

Apple: Transformer architecture optimized for Apple Silicon by behnamoh

Share This Article

Newsletter

Tutorial: Optimized Deployment of Hugging Face distilbert

HackTech

Leave a comment Cancel reply

Editor's Choice

Apple: Transformer architecture optimized for Apple Silicon by behnamoh

Apple: Transformer architecture optimized for Apple Silicon by behnamoh

Share This Article

Newsletter

Tutorial: Optimized Deployment of Hugging Face distilbert

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter