TensorFlow Lite for Mobile AI Agents: On-Device Intelligence
Master TensorFlow Lite for deploying AI agent models on Android and iOS devices, including model conversion, quantization strategies, and real-world integration patterns.
Why TensorFlow Lite for Mobile Agents
TensorFlow Lite (TFLite) is Google's framework for running machine learning models on mobile and embedded devices. It provides a smaller binary footprint than full TensorFlow, hardware-accelerated inference via GPU delegates and NNAPI on Android, and Core ML delegate on iOS.
For mobile AI agents, TFLite is the practical choice when you need to run intent classification, entity extraction, text embedding, or small generative models directly on the phone — without network calls and with full offline capability.
Converting a Model to TFLite
Start with a trained Keras model and convert it to the TFLite flatbuffer format:
import tensorflow as tf
# Load or create your model
model = tf.keras.models.load_model("intent_model.keras")
# Basic conversion
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open("intent_model.tflite", "wb") as f:
f.write(tflite_model)
print(f"Model size: {len(tflite_model) / 1024 / 1024:.2f} MB")
Quantization Strategies
Quantization reduces model size and increases inference speed by converting 32-bit floating point weights to lower precision. TFLite offers three levels:
Dynamic Range Quantization
The simplest approach — quantizes weights to 8-bit integers at conversion time:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()
# Typically reduces model size by 4x
with open("intent_model_dynamic.tflite", "wb") as f:
f.write(quantized_model)
Full Integer Quantization
Both weights and activations are quantized. Requires a representative dataset:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import numpy as np
def representative_dataset():
"""Yield samples that represent typical inference inputs."""
for _ in range(100):
sample = np.random.randint(0, 30000, size=(1, 64)).astype(np.int32)
yield [sample]
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8,
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
full_int_model = converter.convert()
Float16 Quantization
A middle ground — smaller than FP32, more accurate than INT8:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
fp16_model = converter.convert()
Size and Speed Comparison
For a DistilBERT-based intent classifier:
| Method | Size | Latency (Pixel 8) | Accuracy Drop |
|---|---|---|---|
| FP32 (no quant) | 256 MB | 45 ms | Baseline |
| Dynamic INT8 | 64 MB | 28 ms | < 0.5% |
| Full INT8 | 64 MB | 18 ms | 1 - 2% |
| Float16 | 128 MB | 32 ms | < 0.1% |
Running Inference in Python
Use the TFLite interpreter for testing before mobile deployment:
import numpy as np
import tensorflow as tf
class TFLiteAgentClassifier:
INTENTS = ["greeting", "booking", "cancellation", "inquiry", "complaint"]
def __init__(self, model_path: str):
self.interpreter = tf.lite.Interpreter(model_path=model_path)
self.interpreter.allocate_tensors()
self.input_details = self.interpreter.get_input_details()
self.output_details = self.interpreter.get_output_details()
def classify(self, token_ids: np.ndarray) -> dict:
# Ensure correct shape and type
input_data = token_ids.astype(self.input_details[0]["dtype"])
self.interpreter.set_tensor(self.input_details[0]["index"], input_data)
self.interpreter.invoke()
output = self.interpreter.get_tensor(self.output_details[0]["index"])
probs = self._softmax(output[0])
top_idx = int(np.argmax(probs))
return {
"intent": self.INTENTS[top_idx],
"confidence": float(probs[top_idx]),
}
@staticmethod
def _softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# Usage
classifier = TFLiteAgentClassifier("intent_model_dynamic.tflite")
tokens = np.array([[101, 2054, 2003, 1996, 3452, 102] + [0] * 58], dtype=np.int32)
result = classifier.classify(tokens)
print(result) # {"intent": "inquiry", "confidence": 0.91}
Android Integration (Kotlin)
// build.gradle
// implementation("org.tensorflow:tensorflow-lite:2.16.1")
// implementation("org.tensorflow:tensorflow-lite-support:0.4.4")
// In your agent service
val model = Interpreter(loadModelFile("intent_model.tflite"))
val inputBuffer = ByteBuffer.allocateDirect(4 * 64).order(ByteOrder.nativeOrder())
// ... fill buffer with tokenized input
val output = Array(1) { FloatArray(5) }
model.run(inputBuffer, output)
iOS Integration (Swift)
import TensorFlowLite
let interpreter = try Interpreter(modelPath: "intent_model.tflite")
try interpreter.allocateTensors()
// Copy tokenized input to input tensor
var inputData = Data(/* tokenized bytes */)
try interpreter.copy(inputData, toInputAt: 0)
try interpreter.invoke()
let outputTensor = try interpreter.output(at: 0)
// Parse output probabilities
FAQ
How do I choose between dynamic and full integer quantization?
Use dynamic range quantization as your default — it gives 4x size reduction with minimal accuracy loss and requires no calibration data. Switch to full integer quantization only when you need maximum speed on hardware with dedicated INT8 accelerators (like the Hexagon DSP on Qualcomm chips or the Edge TPU), and you can afford the 1 to 2 percent accuracy drop.
Can TFLite run transformer models on mobile?
Yes, but with constraints. Models up to about 100 million parameters (like DistilBERT or MobileBERT) run well on modern phones. Larger models require aggressive quantization or distillation. Google's Gemma 2B can run on high-end phones using TFLite with 4-bit quantization, but inference takes 200 to 500 milliseconds per token.
What is the minimum Android and iOS version for TFLite?
TFLite supports Android API 21 (Android 5.0) and iOS 12 or later. GPU delegate requires Android API 26 and iOS 12 with Metal support. For NNAPI delegate (Android only), API 27 is the minimum, though API 30 or later provides the best hardware acceleration coverage.
#TensorFlowLite #MobileAI #OnDeviceAI #Quantization #Android #IOS #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.