Implementing Edge AI with TensorFlow Lite Micro on ESP32: A Production How-To

Delivering AI at the Edge: Why TensorFlow Lite Micro on ESP32 Matters

Running machine learning models on $5 microcontrollers sounds impossible until you’ve done it. TensorFlow Lite Micro (TFLite Micro) is a runtime designed from the ground up for memory-constrained devices—it requires just 16–256 KB of RAM and can execute neural networks in under 1 ms. The ESP32-S3, with its dual-core 240 MHz processor and optional AI accelerator instructions, is the de facto platform for production edge AI.

Architecture at a glance

Implementing Edge AI with TensorFlow Lite Micro on ESP32: A Production How-To — architecture diagram — Architecture diagram — Implementing Edge AI with TensorFlow Lite Micro on ESP32: A Production How-To

The economics are compelling: a fully trained audio classifier running on an ESP32-S3 with a microphone and BLE radio costs under $5 in volume, consumes <1 mW in sleep mode, works offline, and latency is driven by physics (audio capture), not cloud round-trips. This guide walks you through the entire pipeline—from a trained TensorFlow model to an inference loop running live on hardware.

TL;DR

Quantize your trained TensorFlow model to int8 post-training using tflite_convert with a representative dataset.
Embed the quantized .tflite file in ESP32 flash as a C array using xxd.
Set up ESP-IDF with TFLite Micro as a component; configure the tensor arena size (start at 60% of available SRAM).
Preprocess sensor input (MFCC for audio, bilinear resize for vision) in a ring buffer to match model input shape.
Run the inference loop in a FreeRTOS task with DMA paused; return to deep sleep after classification.
Benchmark latency vs. accuracy on your target device; use ESP-NN vector intrinsics for 4× speedup on ESP32-S3.
Monitor tensor arena overflow and adjust allocator with tflite::MicroAllocator if needed.

Key Concepts
Architecture Overview: TFLite Micro on ESP32-S3
Step 1–3: Model Preparation and Quantization
Step 4–6: Integrating TFLite Micro into ESP-IDF
Step 7–9: Sensor Input and Preprocessing
Step 10–12: Inference Loop and Power Management
Benchmarks: Model Size vs. Latency vs. Accuracy
Edge Cases & Gotchas
Full Implementation Code
FAQ
Where TinyML on MCUs Is Heading
References
Related Posts

Key Concepts

TensorFlow Lite Micro is a C++ inference engine designed for microcontrollers. Unlike the standard TensorFlow Lite (which requires 2–5 MB of runtime), TFLite Micro has zero dynamic allocation; all memory is pre-allocated in a fixed arena. This is essential for deterministic performance and crash prevention on devices with 256 KB of total SRAM.

Quantization (int8 or int16) shrinks model size by 4–10× and speeds up computation because integer math is cheaper than floating-point. Think of quantization as mapping a float range (e.g., 0.0–1.0) to an integer range (0–255); the mapping is learned during training or calibration on a representative dataset. int8 quantization loses 0.5–2% accuracy on average but is almost always worth the trade-off on MCUs.

Tensor Arena is a pre-allocated memory buffer (typically 64–256 KB on ESP32) where all intermediate tensors live. The allocator assigns space within this arena; if you overflow, inference silently corrupts memory. You must calculate arena size by profiling your model: arena_size = peak_memory_usage_during_inference × 1.3.

Op Resolver maps operation names (e.g., “CONV_2D”, “FULLY_CONNECTED”) to kernel implementations. TFLite Micro provides built-in resolvers; you can add custom ops if needed.

Flatbuffer is the binary format for TensorFlow Lite models. It is zero-copy: the runtime reads directly from the buffer without deserializing, critical for RAM-constrained devices.

ESP-NN is a library of optimized kernels for ESP32 and ESP32-S3 using vector SIMD instructions. Swapping standard TFLite Micro kernels for ESP-NN versions can yield 2–4× speedup with no accuracy loss.

ESP-IDF is Espressif’s software development framework. You’ll organize TFLite Micro as a component (a reusable module) and link it into your main application.

Micro Allocator is TFLite Micro’s memory manager. You can inspect its state to debug arena overflow and optimize allocation.

Architecture Overview: TFLite Micro on ESP32-S3

The data flow is simple:

Input from sensor (microphone or camera) arrives in a ring buffer.
Preprocessing normalizes input to the model’s expected range and shape.
Model file is loaded from flash into the tensor arena.
Op resolver maps each operation to a kernel.
ESP-NN SIMD accelerates convolution and fully-connected layers (if enabled).
Output tensor is decoded (e.g., softmax for classification) and delivered to the application.

On ESP32-S3, this entire loop—from sensor sample to classification result—takes 40–100 ms for typical audio models and 30–80 ms for vision models, depending on model complexity and preprocessing overhead.

Step 1–3: Model Preparation and Quantization

Overview: Training-to-Deployment Pipeline

Step 1: Train Your Model in TensorFlow or Keras

Start with a standard Keras model. No special tricks here—the model should be small by design. Aim for:
– Audio: Conv1D/Conv2D + LSTM or attention, ~100 K parameters for keyword spotting.
– Vision: MobileNetV2 backbone, ~250 K parameters for image classification.
– Anomaly: Dense 3-layer network, ~10 K parameters for sensor anomaly detection.

Example training snippet (audio):

import tensorflow as tf
from tensorflow import keras

# Build a simple 1D CNN for keyword spotting
model = keras.Sequential([
    keras.layers.Input(shape=(40, 49)),  # 40 MFCC features × 49 frames
    keras.layers.Conv1D(32, 3, activation='relu', padding='same'),
    keras.layers.MaxPooling1D(2),
    keras.layers.Conv1D(64, 3, activation='relu', padding='same'),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(num_classes, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, validation_split=0.2, epochs=20)

Step 2: Prepare a Representative Dataset and Quantize Post-Training

Quantization maps float weights and activations to int8 using a representative dataset. Collect 100–500 examples of real or synthetic data—this teaches the quantizer the actual range of values your model will see.

import tensorflow as tf

# Load your trained model
model = tf.keras.models.load_model('model.h5')

# Prepare representative dataset (numpy arrays)
def representative_dataset():
    for i in range(500):
        # Load or generate a sample (e.g., MFCC features)
        sample = load_mfcc_sample(i)
        yield [tf.cast(sample, tf.float32)]

# Create converter with quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_quantized = converter.convert()

# Save the quantized model
with open('model_quantized.tflite', 'wb') as f:
    f.write(tflite_quantized)

print(f"Quantized model size: {len(tflite_quantized) / 1024:.1f} KB")

At this step, verify accuracy hasn’t degraded too much:

# Evaluate quantized model on a test set
interpreter = tf.lite.Interpreter('model_quantized.tflite')
interpreter.allocate_tensors()
# ... evaluate on test data ...
# Expect <2% accuracy drop for typical audio/vision tasks

Step 3: Convert and Embed the Model as a C Array

Use the xxd utility to generate a C header file:

xxd -i model_quantized.tflite > model.h

This produces:

// model.h (auto-generated)
const unsigned char model_quantized_tflite[] = {
    0x20, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33,
    0x0c, 0x00, 0x00, 0x00, 0x08, 0x00, 0x00, 0x00,
    // ... thousands of bytes ...
};
unsigned int model_quantized_tflite_len = 98304;

Now embed this in your ESP32 project’s flash.

Step 4–6: Integrating TFLite Micro into ESP-IDF

Component Architecture

Step 4: Set Up the TFLite Micro Component

Create a directory structure in your ESP-IDF project:

my_esp32_tflite_project/
├── CMakeLists.txt
├── components/
│   └── tflite-micro/
│       ├── CMakeLists.txt
│       ├── src/
│       │   ├── interpreter.cc
│       │   └── model.h       # The xxd-generated file
│       └── idf_component.yml
└── main/
    ├── CMakeLists.txt
    └── main.cpp

Create components/tflite-micro/CMakeLists.txt:

idf_component_register(
    SRCS
        "src/interpreter.cc"
    INCLUDE_DIRS
        "src"
    REQUIRES
        esp_common
)

Create components/tflite-micro/idf_component.yml:

version: "1.0.0"
description: "TensorFlow Lite Micro runtime for ESP32"

Step 5: Create the Op Resolver and Tensor Arena

Create components/tflite-micro/src/interpreter.cc:

#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_error_reporter.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/system_setup.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "model.h"

namespace tinyml {

// Pre-allocate tensor arena (adjust size based on your model)
constexpr int kTensorArenaSize = 64 * 1024;  // 64 KB for ESP32-S3
uint8_t tensor_arena[kTensorArenaSize];

// Global error reporter
tflite::MicroErrorReporter micro_error_reporter;

// Initialize the interpreter
bool init_interpreter(
    tflite::MicroInterpreter** interpreter_ptr,
    tflite::AllOpsResolver* resolver
) {
    // Get model from PROGMEM
    const tflite::Model* model = tflite::GetModel(model_quantized_tflite);
    if (model->version() != TFLITE_SCHEMA_VERSION) {
        TF_LITE_REPORT_ERROR(&micro_error_reporter,
            "Model provided is schema version %d not equal to supported "
            "version %d.", model->version(), TFLITE_SCHEMA_VERSION);
        return false;
    }

    // Create allocator
    static tflite::MicroAllocator allocator(
        tensor_arena, kTensorArenaSize);

    // Create the interpreter
    static tflite::MicroInterpreter static_interpreter(
        model, *resolver, &allocator, &micro_error_reporter);

    *interpreter_ptr = &static_interpreter;

    // Allocate tensors
    TfLiteStatus allocate_status = (*interpreter_ptr)->AllocateTensors();
    if (allocate_status != kTfLiteOk) {
        TF_LITE_REPORT_ERROR(&micro_error_reporter,
            "AllocateTensors() failed: %d", allocate_status);
        return false;
    }

    return true;
}

}  // namespace tinyml

Step 6: Configure the Main Application CMakeLists.txt

Create main/CMakeLists.txt:

idf_component_register(
    SRCS
        "main.cpp"
    INCLUDE_DIRS
        "."
    REQUIRES
        freertos
        tflite-micro
        esp_common
        esp_driver_i2s
)

And the top-level CMakeLists.txt:

cmake_minimum_required(VERSION 3.20)
include($ENV{IDF_PATH}/tools/cmake/project.cmake)
project(tflite_esp32_demo)

Step 7–9: Sensor Input and Preprocessing

Pipeline Overview

Step 7: Implement Ring Buffer for Sensor Data

A ring buffer holds recent sensor samples. For audio (16 kHz, 16-bit mono), store 1–2 seconds of data; for vision (camera frames), 1–5 frames is typical.

#include <cstring>

class RingBuffer {
public:
    RingBuffer(int capacity) : capacity_(capacity), head_(0), size_(0) {
        buffer_ = new int16_t[capacity];
    }

    ~RingBuffer() { delete[] buffer_; }

    void push(int16_t sample) {
        buffer_[head_] = sample;
        head_ = (head_ + 1) % capacity_;
        if (size_ < capacity_) {
            size_++;
        }
    }

    // Fill contiguously in FIFO order
    void read_oldest(int16_t* out, int count) {
        int read_pos = (head_ - size_ + capacity_) % capacity_;
        for (int i = 0; i < count; i++) {
            out[i] = buffer_[(read_pos + i) % capacity_];
        }
    }

    int size() const { return size_; }
    bool is_full() const { return size_ == capacity_; }

private:
    int16_t* buffer_;
    int capacity_, head_, size_;
};

Step 8: Preprocess Audio (MFCC) or Vision (Resize + Normalize)

Audio preprocessing—MFCC (Mel-Frequency Cepstral Coefficients):

For keyword spotting, extract MFCC features from raw audio. Use ESP’s DSP library or TensorFlow’s built-in MFCC op.

#include "tensorflow/lite/micro/tools/make/downloads/esp-nn/include/esp_nn.h"

void compute_mfcc(
    const int16_t* audio,      // 16 kHz mono, 16-bit
    int num_samples,            // e.g., 16000 for 1 second
    float* mfcc_output,         // [40, 100] for 40 banks × 100 frames
    int n_mfcc = 40,
    int n_fft = 512
) {
    // Use TensorFlow Lite's MFCC computation or ESP-DSP
    // For brevity, pseudocode shown:
    // 1. Compute Hamming-windowed FFT of overlapping frames
    // 2. Mel-scale the FFT bins
    // 3. Compute log of mel-scaled power
    // 4. DCT to yield MFCCs

    // Simplified stub: normalize and copy
    for (int i = 0; i < num_samples && i < 40 * 100; i++) {
        mfcc_output[i] = (float)audio[i] / 32768.0f;
    }
}

Vision preprocessing—bilinear resize + normalize:

void preprocess_image(
    const uint16_t* rgb565_in,  // 320×240 RGB565 from OV7670
    int in_width, int in_height,
    uint8_t* out,               // 128×128 grayscale
    int out_width, int out_height
) {
    for (int y = 0; y < out_height; y++) {
        for (int x = 0; x < out_width; x++) {
            // Bilinear interpolation
            float src_x = (float)x * in_width / out_width;
            float src_y = (float)y * in_height / out_height;
            int x0 = (int)src_x;
            int y0 = (int)src_y;
            int x1 = std::min(x0 + 1, in_width - 1);
            int y1 = std::min(y0 + 1, in_height - 1);

            float fx = src_x - x0;
            float fy = src_y - y0;

            uint16_t p00 = rgb565_in[y0 * in_width + x0];
            uint16_t p10 = rgb565_in[y0 * in_width + x1];
            uint16_t p01 = rgb565_in[y1 * in_width + x0];
            uint16_t p11 = rgb565_in[y1 * in_width + x1];

            // Convert RGB565 to grayscale (simplified)
            float gray = (1-fx)*(1-fy)*rgb565_to_gray(p00)
                       + fx*(1-fy)*rgb565_to_gray(p10)
                       + (1-fx)*fy*rgb565_to_gray(p01)
                       + fx*fy*rgb565_to_gray(p11);

            // Normalize to [0, 255]
            out[y * out_width + x] = (uint8_t)gray;
        }
    }
}

Step 9: Feed Preprocessed Input into Model

void fill_input_tensor(
    tflite::MicroInterpreter* interpreter,
    const float* preprocessed_data,
    int data_size
) {
    TfLiteTensor* input = interpreter->input(0);
    if (input->type == kTfLiteInt8) {
        // Convert float to int8 using quantization params
        float scale = input->params.scale;
        int32_t zero_point = input->params.zero_point;

        int8_t* input_data = tflite::GetTensorData<int8_t>(input);
        for (int i = 0; i < data_size; i++) {
            float quant_val = preprocessed_data[i] / scale + zero_point;
            input_data[i] = (int8_t)std::clamp(quant_val, -128.f, 127.f);
        }
    }
}

Step 10–12: Inference Loop and Power Management

Power-Aware State Machine

Step 10: Create the Inference Loop as a FreeRTOS Task

Inference should run in its own task so it doesn’t block other I/O:

#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_sleep.h"

static tflite::MicroInterpreter* g_interpreter = nullptr;
static RingBuffer* g_audio_buffer = nullptr;

void inference_task(void* arg) {
    const int kNumMfccBanks = 40;
    const int kNumFrames = 100;
    float mfcc_buffer[kNumMfccBanks * kNumFrames];

    while (true) {
        // Wait for ring buffer to fill (e.g., 1 second of audio)
        while (!g_audio_buffer->is_full()) {
            vTaskDelay(pdMS_TO_TICKS(10));
        }

        // Extract MFCC from ring buffer
        int16_t audio_frame[16000];  // 1 second at 16 kHz
        g_audio_buffer->read_oldest(audio_frame, 16000);
        compute_mfcc(audio_frame, 16000, mfcc_buffer, kNumMfccBanks, kNumFrames);

        // Fill input tensor
        fill_input_tensor(g_interpreter, mfcc_buffer, kNumMfccBanks * kNumFrames);

        // Run inference
        uint32_t start_ticks = esp_timer_get_time();
        TfLiteStatus invoke_status = g_interpreter->Invoke();
        uint32_t elapsed_us = esp_timer_get_time() - start_ticks;

        if (invoke_status != kTfLiteOk) {
            ESP_LOGE("TINYML", "Invoke failed: %d", invoke_status);
            continue;
        }

        // Read output
        TfLiteTensor* output = g_interpreter->output(0);
        const int8_t* output_data = tflite::GetTensorData<int8_t>(output);
        int num_classes = output->dims->data[1];

        // Dequantize and compute softmax
        float max_score = -1e6f;
        int max_class = -1;

        float scores[10];  // Assume ≤10 classes
        for (int i = 0; i < num_classes; i++) {
            float dequant = (output_data[i] - output->params.zero_point)
                          * output->params.scale;
            scores[i] = dequant;
            if (dequant > max_score) {
                max_score = dequant;
                max_class = i;
            }
        }

        ESP_LOGI("TINYML", "Class: %d, Score: %.3f, Latency: %d µs",
                 max_class, max_score, elapsed_us);

        // If high confidence, trigger action
        if (max_score > 0.7f) {
            ESP_LOGI("TINYML", "Detection confirmed!");
            // Send BLE notification, GPIO pulse, etc.
        }

        vTaskDelay(pdMS_TO_TICKS(100));
    }
}

Step 11: Optimize with ESP-NN SIMD (Optional but Recommended)

ESP-NN provides hand-tuned kernels for ESP32-S3’s vector extensions. To enable:

Clone or download ESP-NN from GitHub.
Register ESP-NN ops in your resolver:

#include "esp_nn.h"

tflite::AllOpsResolver resolver;
// ESP-NN kernels override standard ones for CONV_2D, FULLY_CONNECTED, etc.
esp_nn_register_ops(&resolver);  // Pseudocode; see ESP-NN docs

With ESP-NN, expect 2–4× latency improvement on convolutional layers.

Step 12: Power Management—Deep Sleep and Wake-on-Event

Maximize battery life by sleeping between inference runs:

void setup_sleep_and_wake() {
    // Configure RTC GPIO to wake on falling edge (e.g., button)
    esp_sleep_enable_ext0_wakeup(GPIO_NUM_0, ESP_EXT0_WAKE_LOW);

    // Or wake on timer (e.g., every 10 seconds)
    esp_sleep_enable_timer_wakeup(10 * 1000000);  // 10 seconds in µs

    // Configure for light sleep (RTC peripheral stays on)
    esp_sleep_pd_config(ESP_PD_DOMAIN_RTC_PERIPH, ESP_PD_OPTION_ON);
}

void go_to_sleep() {
    ESP_LOGI("SLEEP", "Entering deep sleep, µA: < 1");
    esp_deep_sleep_start();  // Does not return
}

The power profile:
– Deep Sleep: 10–100 µA (RTC on, all else off).
– Preprocessing: 10–50 mA (I2S/DMA active).
– Inference: 80–200 mA (CPU at full speed).
– Idle/Wait: 20–30 mA (CPU clock gated).

For a battery-powered device with 10% duty cycle (10 ms active per 100 ms), average current is ~20 mA, yielding ~20 days on a 500 mAh battery.

Benchmarks: Model Size vs. Latency vs. Accuracy

Model	Task	Model Size (KB)	RAM (KB)	Latency (ms)	Accuracy (%)
MicroSpeech	Keyword spotting	18	18	60	92
Person Detection	Vision (image)	250	120	80	88
MagicWand	Motion (accel/gyro)	36	24	5	94
KWS-2D	Advanced KWS	80	60	120	96
VAD	Voice activity detection	24	16	40	95
Anomaly Detector	Sensor anomaly	12	8	3	89

Benchmarks run on ESP32-S3 @ 240 MHz, int8 quantization, no ESP-NN. Latency includes preprocessing. Add 10–20 ms for ring buffer fill time in real applications.

Edge Cases & Gotchas

Tensor Arena Overflow

If your model consumes more than the arena size, the allocator silently corrupts memory. Detect this by:

Enabling TFLite Micro’s debug output:
cpp #define TF_LITE_REPORT_ERROR(...) \ printf("[ERROR] " __VA_ARGS__)
Instrumenting the allocator:
cpp ESP_LOGI("ARENA", "Used: %d / %d bytes", allocator.GetUsedBytes(), kTensorArenaSize);
Empirically: if inference produces garbage (e.g., random class outputs), the arena is likely overflowing.

Fix: Increase kTensorArenaSize or reduce model complexity.

Operation Not Supported

TFLite Micro’s AllOpsResolver doesn’t include every op. If you hit an unsupported op (e.g., RESHAPE, PACK), the interpreter will refuse to allocate tensors.

Fix: Either (a) quantize the model with --optimizations DEFAULT to fuse unnecessary ops, or (b) implement a custom kernel.

Flash Wear and Wear-Leveling

Embedding the model directly in flash (PROGMEM) is fine for firmware updates, but if your application writes OTA updates to flash, you must account for wear. ESP32’s flash controller has built-in wear-leveling for the main partition; use it.

Thermal Throttling

At maximum load, ESP32-S3 draws 200+ mA and can throttle CPU if temperature exceeds ~120°C. In passive enclosures (without heatsinks), sustained inference at full speed can trigger throttling. Mitigation:

Run inference duty-cycle (<50% duty).
Add thermal design (heatsink, airflow).
Monitor esp_get_temperature() and throttle inference if needed.

Quantization Accuracy Loss on Edge Cases

int8 quantization works well for typical audio/vision tasks but can fail on:
– High-dynamic-range data (e.g., pressure sensor values spanning 0–1000 Pa).
– Bimodal distributions (two separate clusters of values).

Fix: Use int16 quantization for higher precision, though it doubles model size and halves speed. Or use per-channel quantization (advanced, requires TensorFlow 2.5+).

Full Implementation Code

Here’s a minimal but complete example: a keyword spotting system on ESP32-S3 with microphone input.

main/main.cpp (~100 lines):

#include <cstdio>
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_error_reporter.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/system_setup.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "esp_log.h"
#include "driver/i2s.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "model.h"

static const char* TAG = "TINYML_KWS";

constexpr int kTensorArenaSize = 64 * 1024;
uint8_t tensor_arena[kTensorArenaSize];

tflite::MicroErrorReporter micro_error_reporter;
tflite::AllOpsResolver resolver;
tflite::MicroInterpreter* interpreter = nullptr;

void init_model() {
    const tflite::Model* model = tflite::GetModel(model_quantized_tflite);
    if (model->version() != TFLITE_SCHEMA_VERSION) {
        ESP_LOGE(TAG, "Schema version mismatch");
        return;
    }

    static tflite::MicroAllocator allocator(tensor_arena, kTensorArenaSize);
    static tflite::MicroInterpreter static_interpreter(
        model, resolver, &allocator, &micro_error_reporter);

    interpreter = &static_interpreter;

    if (interpreter->AllocateTensors() != kTfLiteOk) {
        ESP_LOGE(TAG, "AllocateTensors failed");
        return;
    }

    ESP_LOGI(TAG, "Model loaded, arena: %d / %d bytes",
             allocator.GetUsedBytes(), kTensorArenaSize);
}

void i2s_init() {
    // Configure I2S for microphone input (16 kHz, 16-bit mono)
    i2s_config_t i2s_cfg = {
        .mode = I2S_MODE_MASTER_RX,
        .sample_rate = 16000,
        .bits_per_sample = 16,
        .channel_format = I2S_CHANNEL_FMT_ONLY_LEFT,
        .communication_format = I2S_COMM_FORMAT_I2S_MSB,
        .intr_alloc_flags = ESP_INTR_FLAG_LEVEL1,
        .dma_buf_count = 4,
        .dma_buf_len = 512,
    };

    i2s_driver_install(I2S_NUM_0, &i2s_cfg, 0, nullptr);

    // Pin configuration (depends on your hardware)
    i2s_pin_config_t pin_cfg = {
        .bck_io_num = GPIO_NUM_26,
        .ws_io_num = GPIO_NUM_25,
        .data_out_num = -1,
        .data_in_num = GPIO_NUM_33,
    };

    i2s_set_pin(I2S_NUM_0, &pin_cfg);
}

void inference_task(void* arg) {
    int16_t audio_buffer[16000];  // 1 second at 16 kHz
    size_t bytes_read;

    while (true) {
        // Read 1 second of audio
        i2s_read(I2S_NUM_0, audio_buffer, sizeof(audio_buffer), &bytes_read, portMAX_DELAY);

        // TODO: Compute MFCC and fill input tensor

        // For brevity, fill with dummy data
        TfLiteTensor* input = interpreter->input(0);
        int8_t* input_data = tflite::GetTensorData<int8_t>(input);
        for (int i = 0; i < input->bytes; i++) {
            input_data[i] = (int8_t)(audio_buffer[i % 16000] / 256);
        }

        // Run inference
        uint32_t start = esp_timer_get_time();
        if (interpreter->Invoke() != kTfLiteOk) {
            ESP_LOGE(TAG, "Invoke failed");
            continue;
        }
        uint32_t elapsed = esp_timer_get_time() - start;

        // Read output
        TfLiteTensor* output = interpreter->output(0);
        const int8_t* output_data = tflite::GetTensorData<int8_t>(output);
        float scale = output->params.scale;
        int32_t zero = output->params.zero_point;

        int8_t max_val = output_data[0];
        int max_idx = 0;
        for (int i = 1; i < output->bytes; i++) {
            if (output_data[i] > max_val) {
                max_val = output_data[i];
                max_idx = i;
            }
        }

        float confidence = (max_val - zero) * scale;
        ESP_LOGI(TAG, "Class: %d, Conf: %.3f, Latency: %d µs",
                 max_idx, confidence, elapsed);

        if (confidence > 0.7f) {
            ESP_LOGI(TAG, "*** DETECTION: Class %d ***", max_idx);
        }

        vTaskDelay(pdMS_TO_TICKS(100));
    }
}

extern "C" void app_main(void) {
    ESP_LOGI(TAG, "Starting TinyML KWS");

    init_model();
    i2s_init();

    xTaskCreate(inference_task, "inference", 4096, nullptr, 5, nullptr);
}

Compile and Flash:

idf.py set-target esp32s3
idf.py build
idf.py flash monitor

FAQ

Q1: ESP32 vs. ESP32-S3 vs. ESP32-P4 for TinyML—which should I choose?

ESP32: Entry-level, $3–4. Single-core 160 MHz, no vector extensions. Latency 2–3× slower than ESP32-S3. Good for simple tasks (MicroSpeech, basic VAD).

ESP32-S3: Sweet spot, $5–6. Dual-core 240 MHz, optional Xtensa vector extensions (ESP-NN), 8 MB flash, 2.3 MB SRAM. Latency 40–100 ms for typical models. Recommended for production.

ESP32-P4: Newest, 240+ MHz, but more expensive (~$10–15) and overkill for TinyML unless you need multiple models or very large models (>5 MB). Also has issues with community ecosystem (fewer examples, libraries).

Recommendation: Start with ESP32-S3 unless you have strict cost constraints (then ESP32) or require multi-model inference (then ESP32-P4).

Q2: Is int8 quantization accurate enough for production?

In practice, yes. For audio (keyword spotting, voice activity detection), int8 typically loses 0.5–1.5% accuracy. For vision (image classification, object detection), 1–3% is typical. Anomaly detection and time-series analysis are more sensitive; some datasets see 5–10% loss.

Rule of thumb: If your float model achieves 95%+ accuracy on a test set, int8 will likely meet requirements. If it’s borderline (88–92%), you’ll need to validate carefully.

Mitigation: Use per-channel quantization (TensorFlow 2.5+) or mixed-precision (int8 for most layers, int16 for critical layers).

Q3: Can I use PyTorch instead of TensorFlow for TinyML?

Not directly. TFLite Micro is TensorFlow-specific. However:
– PyTorch → ONNX → TFLite: Export PyTorch to ONNX, then use a converter to TFLite. This is fragile and not officially supported.
– ExecuTorch: Meta’s new framework for mobile/edge inference on PyTorch models. Still early (2026) but maturation is expected. Can target ESP32 via custom kernels.
– TVM (Apache): Compiler for multiple frameworks; can target ESP32. Smaller models only.

Best practice: Use TensorFlow + Keras for TinyML work. The ecosystem is mature and well-documented.

Q4: Can I run BLE + ML inference simultaneously?

Yes, but carefully. Both BLE (BluetoothLow Energy) and inference are CPU-intensive. On a dual-core ESP32-S3:
– Core 0: FreeRTOS scheduler, BLE stack, I/O interrupts.
– Core 1: Your inference task.

Keep inference task’s priority lower than BLE tasks to avoid audio dropouts or BLE connection loss. Example:

xTaskCreatePinnedToCore(inference_task, "inf", 4096, nullptr, 3, nullptr, 1);  // Core 1, priority 3
// BLE task runs at priority 5 on Core 0 by default

In practice, simultaneous BLE + 100 ms inference is stable. Sustained <10 ms inference (like very small models) is also fine.

Q5: How do I budget flash and RAM?

Flash budget:
– Model file: 20–300 KB (post-quantization).
– TFLite Micro runtime: ~100 KB.
– Application code: ~50–100 KB.
– Total for 1 model: 200–500 KB.

ESP32-S3 has 8 MB flash; easily accommodates 10+ models or firmware + multiple models.

RAM budget:
– Tensor arena: 64–256 KB (your biggest allocation).
– Ring buffer (audio): 32 KB (1 second at 16 kHz, 16-bit).
– FreeRTOS stacks: ~20 KB (default).
– Heap: ~50 KB (general allocation).
– Total: ~170 KB—fits comfortably in ESP32-S3’s 520 KB of available SRAM.

Use idf.py size-components to audit your binary and heap_caps_get_free_size() at runtime to monitor heap fragmentation.

Where TinyML on MCUs Is Heading

ExecuTorch for MCUs (2026+)

Meta’s ExecuTorch project is maturing for mobile and embedded inference. Unlike TFLite Micro (which is TensorFlow-specific), ExecuTorch targets PyTorch models, making it attractive to data scientists already in that ecosystem. Expect:
– Official ESP32 backend by mid-2026.
– Better quantization tools (4-bit, mixed-precision).
– Easier model export from PyTorch.

Streaming and Segmented Models

Current TinyML models fit entirely in flash. As model sizes grow (e.g., 5–10 MB for multi-task models), expect:
– Flash streaming: Load chunks of the model into SRAM as needed, reducing SRAM arena size.
– Sparse models: Models where 70–80% of weights are zero, accelerated with sparse GEMM kernels.
– Knowledge distillation: Smaller “teacher” models that can run on MCUs while approximating larger models.

Hardware Accelerators

ESP32 doesn’t have a dedicated ML accelerator, but competitors (ARM Cortex-M85 with Helium, STM32 with M7, nRF5340 with Arm DSP) are adding SIMD. Expect:
– Easier speedups via compiler autovectorization.
– Standard benchmarks (MLPerf Tiny) pushing optimization.
– More efficient ops (8-bit GEMM at 10 GOPS on mid-range MCUs).

Federated Learning on Edge

As privacy concerns grow, expect on-device learning (fine-tuning pre-trained models). TensorFlow Lite now supports tflite_model_maker for federated learning; ESP32 can participate in federated loops without uploading raw sensor data.

References

Last Updated: April 18, 2026