Skip to content
Learn Agentic AI11 min read0 views

TensorFlow Lite for Mobile AI Agents: On-Device Intelligence

Master TensorFlow Lite for deploying AI agent models on Android and iOS devices, including model conversion, quantization strategies, and real-world integration patterns.

Why TensorFlow Lite for Mobile Agents

TensorFlow Lite (TFLite) is Google's framework for running machine learning models on mobile and embedded devices. It provides a smaller binary footprint than full TensorFlow, hardware-accelerated inference via GPU delegates and NNAPI on Android, and Core ML delegate on iOS.

For mobile AI agents, TFLite is the practical choice when you need to run intent classification, entity extraction, text embedding, or small generative models directly on the phone — without network calls and with full offline capability.

Converting a Model to TFLite

Start with a trained Keras model and convert it to the TFLite flatbuffer format:

import tensorflow as tf

# Load or create your model
model = tf.keras.models.load_model("intent_model.keras")

# Basic conversion
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

with open("intent_model.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Model size: {len(tflite_model) / 1024 / 1024:.2f} MB")

Quantization Strategies

Quantization reduces model size and increases inference speed by converting 32-bit floating point weights to lower precision. TFLite offers three levels:

Dynamic Range Quantization

The simplest approach — quantizes weights to 8-bit integers at conversion time:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

# Typically reduces model size by 4x
with open("intent_model_dynamic.tflite", "wb") as f:
    f.write(quantized_model)

Full Integer Quantization

Both weights and activations are quantized. Requires a representative dataset:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import numpy as np

def representative_dataset():
    """Yield samples that represent typical inference inputs."""
    for _ in range(100):
        sample = np.random.randint(0, 30000, size=(1, 64)).astype(np.int32)
        yield [sample]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8,
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

full_int_model = converter.convert()

Float16 Quantization

A middle ground — smaller than FP32, more accurate than INT8:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
fp16_model = converter.convert()

Size and Speed Comparison

For a DistilBERT-based intent classifier:

Method Size Latency (Pixel 8) Accuracy Drop
FP32 (no quant) 256 MB 45 ms Baseline
Dynamic INT8 64 MB 28 ms < 0.5%
Full INT8 64 MB 18 ms 1 - 2%
Float16 128 MB 32 ms < 0.1%

Running Inference in Python

Use the TFLite interpreter for testing before mobile deployment:

import numpy as np
import tensorflow as tf

class TFLiteAgentClassifier:
    INTENTS = ["greeting", "booking", "cancellation", "inquiry", "complaint"]

    def __init__(self, model_path: str):
        self.interpreter = tf.lite.Interpreter(model_path=model_path)
        self.interpreter.allocate_tensors()
        self.input_details = self.interpreter.get_input_details()
        self.output_details = self.interpreter.get_output_details()

    def classify(self, token_ids: np.ndarray) -> dict:
        # Ensure correct shape and type
        input_data = token_ids.astype(self.input_details[0]["dtype"])
        self.interpreter.set_tensor(self.input_details[0]["index"], input_data)
        self.interpreter.invoke()

        output = self.interpreter.get_tensor(self.output_details[0]["index"])
        probs = self._softmax(output[0])
        top_idx = int(np.argmax(probs))

        return {
            "intent": self.INTENTS[top_idx],
            "confidence": float(probs[top_idx]),
        }

    @staticmethod
    def _softmax(x):
        e_x = np.exp(x - np.max(x))
        return e_x / e_x.sum()

# Usage
classifier = TFLiteAgentClassifier("intent_model_dynamic.tflite")
tokens = np.array([[101, 2054, 2003, 1996, 3452, 102] + [0] * 58], dtype=np.int32)
result = classifier.classify(tokens)
print(result)  # {"intent": "inquiry", "confidence": 0.91}

Android Integration (Kotlin)

// build.gradle
// implementation("org.tensorflow:tensorflow-lite:2.16.1")
// implementation("org.tensorflow:tensorflow-lite-support:0.4.4")

// In your agent service
val model = Interpreter(loadModelFile("intent_model.tflite"))
val inputBuffer = ByteBuffer.allocateDirect(4 * 64).order(ByteOrder.nativeOrder())
// ... fill buffer with tokenized input
val output = Array(1) { FloatArray(5) }
model.run(inputBuffer, output)

iOS Integration (Swift)

import TensorFlowLite

let interpreter = try Interpreter(modelPath: "intent_model.tflite")
try interpreter.allocateTensors()

// Copy tokenized input to input tensor
var inputData = Data(/* tokenized bytes */)
try interpreter.copy(inputData, toInputAt: 0)
try interpreter.invoke()

let outputTensor = try interpreter.output(at: 0)
// Parse output probabilities

FAQ

How do I choose between dynamic and full integer quantization?

Use dynamic range quantization as your default — it gives 4x size reduction with minimal accuracy loss and requires no calibration data. Switch to full integer quantization only when you need maximum speed on hardware with dedicated INT8 accelerators (like the Hexagon DSP on Qualcomm chips or the Edge TPU), and you can afford the 1 to 2 percent accuracy drop.

Can TFLite run transformer models on mobile?

Yes, but with constraints. Models up to about 100 million parameters (like DistilBERT or MobileBERT) run well on modern phones. Larger models require aggressive quantization or distillation. Google's Gemma 2B can run on high-end phones using TFLite with 4-bit quantization, but inference takes 200 to 500 milliseconds per token.

What is the minimum Android and iOS version for TFLite?

TFLite supports Android API 21 (Android 5.0) and iOS 12 or later. GPU delegate requires Android API 26 and iOS 12 with Metal support. For NNAPI delegate (Android only), API 27 is the minimum, though API 30 or later provides the best hardware acceleration coverage.


#TensorFlowLite #MobileAI #OnDeviceAI #Quantization #Android #IOS #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.