How CMOS Image Sensors Actually Work: Photons to Pixels (2026)

How CMOS Image Sensors Actually Work: Photons to Pixels (2026)

How CMOS Image Sensors Actually Work: Photons to Pixels (2026)

A 2026 flagship phone has a 200-megapixel sensor that ships you a 12.5 MP photo, and almost every word of that sentence is doing work. Understanding how CMOS image sensors work means tracing a single photon from the lens through silicon, into a buried photodiode, out as charge, through a four-transistor pixel, into a column ADC, and finally into a Bayer mosaic the ISP has to guess back into color. Skip any link in that chain and the explanation collapses into marketing. This post rebuilds the whole pipeline from physics up: the photoelectric effect inside silicon, the 4T pixel and correlated double sampling, rolling vs global shutter, front-side vs back-side illumination, stacked sensors with DRAM, Quad-Bayer binning, dual-pixel autofocus, staggered HDR, noise budgets, and why CCDs lost. What this post covers: every layer you need to read a sensor datasheet — Sony IMX9xx, Samsung HP9, OmniVision OV — without faking it.

Architecture at a glance

How CMOS Image Sensors Actually Work: Photons to Pixels (2026) — architecture diagram
Architecture diagram — How CMOS Image Sensors Actually Work: Photons to Pixels (2026)
How CMOS Image Sensors Actually Work: Photons to Pixels (2026) — architecture diagram
Architecture diagram — How CMOS Image Sensors Actually Work: Photons to Pixels (2026)
How CMOS Image Sensors Actually Work: Photons to Pixels (2026) — architecture diagram
Architecture diagram — How CMOS Image Sensors Actually Work: Photons to Pixels (2026)
How CMOS Image Sensors Actually Work: Photons to Pixels (2026) — architecture diagram
Architecture diagram — How CMOS Image Sensors Actually Work: Photons to Pixels (2026)
How CMOS Image Sensors Actually Work: Photons to Pixels (2026) — architecture diagram
Architecture diagram — How CMOS Image Sensors Actually Work: Photons to Pixels (2026)

Why CMOS image sensors deserve a real explanation

A CMOS image sensor turns photons into a digital frame by ejecting electrons from silicon, collecting that charge in a photodiode well, sampling it through a per-pixel amplifier, and digitizing it in a column ADC — all on the same chip as the readout logic. CCDs do the same physics but shift charge across the die before digitizing. The CMOS architecture is why your phone shoots 4K60 at 0.1 W.

For about two decades after Willard Boyle and George Smith demonstrated the charge-coupled device at Bell Labs in 1969, CCDs ran the imaging world. Then Eric Fossum’s group at NASA JPL published the active-pixel sensor paper in 1993, putting an amplifier inside every pixel. That single architectural move broke CCD’s monopoly: each pixel could be addressed individually like DRAM, the readout circuit could be fabricated in the same CMOS process as the rest of the SoC, and power dropped by an order of magnitude.

By 2008 Sony shipped the first BSI consumer sensor (IMX050) in the Cyber-shot DSC-WX1, and by 2012 the stacked Exmor RS (IMX135) put a separate logic wafer underneath the pixel array. The 2026 lineage — Sony’s LYTIA LYT-900, Samsung’s ISOCELL HP9 200 MP — is what happens when you keep iterating that design for 18 years. Same physics, three orders of magnitude more transistors.

The physics: photons, silicon, and where pixels actually come from

A photon striking silicon kicks an electron out of the valence band into the conduction band if its energy exceeds the 1.12 eV silicon bandgap — that’s the photoelectric effect. The freed electron leaves a hole behind, the pair drifts in the depletion field of a reverse-biased photodiode, and the photodiode integrates that charge for the exposure time. The pixel’s job is to count those electrons.

CMOS image sensor photon-to-pixel pipeline from lens through photodiode to ISP

Silicon has a wavelength-dependent absorption coefficient. Blue light (450 nm) gets absorbed within ~0.4 µm of the surface; red light (650 nm) penetrates 3–5 µm; near-IR (940 nm) goes 50 µm deep. That’s why phone sensor pixels are typically 3–6 µm thick in the photoactive region, and why IR cut filters sit above the CFA — without them, every “red” pixel would also catch invisible heat from the scene.

Quantum efficiency and full-well capacity

Quantum efficiency (QE) is the probability that an incoming photon produces a collected electron. Modern BSI mobile sensors hit 80–90% peak QE around 550 nm; the Sony IMX989 in the Xiaomi 13 Ultra measured ~84% peak per Image Sensors World benchmarks. DSLR-class sensors with deeper pixels and better anti-reflection stacks exceed 90%.

Full-well capacity (FWC) is the maximum electrons the photodiode can hold before saturating. A 1.0 µm phone pixel holds about 6,000 e-; a 4.3 µm Sony Alpha 1 II pixel holds around 80,000 e-. Dynamic range, in stops, is roughly log2(FWC / read_noise). With FWC=80,000 and read noise=1.5 e-, you get ~15.7 stops engineering DR. Phones with FWC=6,000 and read noise=1 e- land near 12.5 stops single-exposure — which is why staggered HDR exists.

The signal chain in electrons

A typical bright-daylight exposure on a phone might deliver 1,000 e- per pixel. Read noise floor is 1–2 e-. Photon shot noise is sqrt(1000) ≈ 32 e-. So signal-to-noise is dominated by shot noise (a physical, irreducible limit) at ~32:1 — about 30 dB. In dim indoor light you might capture 50 e-, where shot noise drops to sqrt(50) ≈ 7 e- and read noise becomes a non-trivial fraction of the signal. This is the equation that drives every architectural decision downstream.

The 4T pixel and correlated double sampling

The dominant pixel architecture since the early 2000s is the four-transistor pinned-photodiode pixel (4T PPD), which combines a buried photodiode with a transfer gate, a reset transistor, a source-follower amplifier, and a row-select switch. It enables correlated double sampling (CDS), which subtracts kTC reset noise and pushes read noise below 2 e- in modern sensors.

4T CMOS pixel circuit schematic with pinned photodiode transfer gate reset and source-follower

The four transistors do four things:

  • TX (transfer gate) moves accumulated charge from the pinned photodiode (PPD) to the floating diffusion (FD) node.
  • RST (reset) ties FD to VDD to clear it before each readout.
  • SF (source-follower) is a unity-gain amplifier — FD voltage in, column bit-line voltage out.
  • RS (row-select) connects this pixel’s source-follower to the column bus when the row driver picks it.

Correlated double sampling, the noise-killer

The reset transistor introduces kTC noise — thermal noise on the FD capacitance that varies sample to sample. CDS defeats it. The column circuit reads FD twice: once right after reset (the “reset level,” carrying kTC noise), then again after charge is transferred through TX (the “signal level,” carrying the same kTC noise plus the photo-signal). Subtract the two and kTC cancels. That’s how a sensor with a ~50 µV/e- conversion gain and a 14-bit ADC manages 0.7–1.0 e- input-referred read noise.

The pinned photodiode itself is the second piece of the noise story. By burying the photodiode under a p+ pinning layer, the surface dangling bonds (which inject dark current) are pushed away from the active region. Dark current in modern phone sensors at 25 °C is typically 1–5 e-/s/pixel; at 60 °C it can climb to 30–100 e-/s as the silicon thermal-generation rate roughly doubles every 8 °C.

Conversion gain and dual-gain HDR

Conversion gain (CG) = q / C_FD, where q is electron charge and C_FD is FD capacitance. Lower FD capacitance means higher CG, which means more voltage per electron and better low-light SNR. Sony’s dual conversion gain (DCG) sensors — IMX766, IMX989, LYT-900 — add a switch that splits FD into two capacitors. High CG for shadows, low CG for highlights, captured in one exposure. That’s why the Pixel 8 Pro and OnePlus 12 advertise “single-frame HDR” — they’re not lying, the silicon really has two gain modes per pixel.

Rolling shutter, global shutter, and the jello-cam problem

Almost every consumer CMOS sensor reads out one row at a time — that’s rolling shutter, and it’s why a propeller in a passing video frame looks like a curved noodle. Global shutter sensors expose all pixels simultaneously by adding a per-pixel storage node, but they cost area, dynamic range, and ~30–40% more silicon per pixel.

Rolling shutter vs global shutter timing diagram for CMOS image sensors

In a 4K phone sensor running 30 fps, the rolling-shutter line time is ~8 µs per row. A 4000-row readout takes ~32 ms — almost a full frame interval. Anything moving across the frame during those 32 ms gets sheared: top of a guitar string is in one position, the bottom in another. This shows up as:

  • Skew on horizontal pans (vertical lines tilt).
  • Wobble / jello when the camera vibrates (every row sees a different camera pose).
  • Partial exposure on flash photography (flash fires faster than readout completes, so only part of the frame is lit).

Global shutter sensors — Sony’s Pregius S IMX661 family is the canonical example, and Sony introduced a stacked global-shutter consumer sensor (IMX472) in 2023 — solve this by adding a memory node per pixel. Charge is transferred to that node simultaneously across the array at end-of-exposure, then read out sequentially without temporal distortion. The cost: extra transistors per pixel, smaller photodiode, and historically worse low-light performance. Industrial machine-vision cameras pay this cost because they need it. Phones don’t — yet.

Why rolling shutter survived

Three reasons. First, area: rolling-shutter pixels can be 1.0 µm in 2026 phone sensors because they only need 4–6 transistors. Adding a global storage node bloats them past 2.0 µm. Second, dynamic range: the storage node leaks charge in light (parasitic light sensitivity, PLS) which limits useful HDR. Third, computational fixes are good enough — every phone now uses gyro-aware electronic image stabilization (EIS) that estimates per-row pose and warps the frame to undo rolling skew in software.

Color, Bayer mosaics, and the demosaicing problem

Silicon doesn’t see color — it’s a monochrome electron counter. To make a color image, the sensor sits under a color filter array (CFA), typically the Bayer pattern Bryce Bayer patented at Kodak in 1976: a 2×2 tile of R, G, G, B filters repeated across the array. Half the pixels see green (where the human eye is most sensitive), a quarter each see red and blue.

That means at every pixel position the sensor only knows one of three color channels. Demosaicing is the ISP’s guess-the-other-two algorithm. Naive bilinear interpolation produces visible color fringes (“zippering”) on edges. Modern demosaicers — adaptive homogeneity-directed (AHD), Hamilton-Adams, and the residual-interpolation family — interpolate gradients to keep edges sharp. Apple, Google, and Qualcomm ISPs run learned demosaicers as part of their RAW-to-RGB pipeline.

Quad-Bayer and the “200 MP that’s really 12.5 MP” trick

Modern phone sensors don’t use a plain Bayer mosaic. Samsung’s Tetracell/Tetrapixel and Nonacell and Sony’s Quad-Bayer arrange 2×2 (or 3×3) like-color tiles: four R-pixels, four G-pixels, four G-pixels, four B-pixels in each Bayer cell.

Quad-Bayer pixel binning and remosaic data flow for 200 MP CMOS sensors

In default mode the sensor sums (bins) the four same-color sub-pixels on-chip before readout — that delivers a 50 MP frame from a 200 MP array with 4× the per-bin signal and only 2× the noise (sqrt of 4), so SNR improves by 2×. The HP9 can bin further (3×3 = 9-in-1) in low light to give a 12.5 MP frame with ~9× signal aggregation. In “detail mode” the ISP runs a remosaic step that converts the Quad-Bayer raw to a standard Bayer pattern, then demosaics at full 200 MP — useful in bright light when you want to crop or print large.

This is why “megapixels” on a 2026 phone are mostly a marketing number for the bright-light edge case. Everyday photos come out of a binned 12.5 MP virtual pixel that’s optically ~2.4 µm wide on a 0.6 µm native pitch.

Front-side vs back-side illumination, and stacked sensors

FSI sensors put the metal interconnect on top of the silicon — light has to thread through gaps in the metal stack to reach the photodiode, costing ~50% of the photons. BSI flips the wafer and thins the back side to ~3 µm so light hits the silicon directly. Stacked sensors add a second (and often third) wafer underneath bonded via through-silicon vias (TSVs), separating the pixel array from the logic and DRAM.

FSI vs BSI vs stacked CMOS image sensor cross-section comparison

The empirical numbers from Sony’s IMX series tell the story: moving from FSI (Sony IMX050, ~2008) to BSI (IMX135, 2012) approximately doubled low-light QE in the green channel. Stacked BSI (IMX260 in Galaxy S7, 2016) freed pixel-area die for more photodiode by relocating ADCs and timing logic to the bottom wafer.

The three-wafer stack: pixel + DRAM + logic

The 2017 Sony IMX400 was the first commercial sensor with three TSV-bonded wafers: pixel array on top, DRAM in the middle, logic at the bottom. The DRAM layer is what enables 960 fps slow-motion video on phones: the sensor can dump full frames into on-package DRAM at 960 fps for ~0.2 s, then trickle them out at 30 fps to the host SoC over MIPI. Without the DRAM layer, the CSI-2 bus simply cannot accept that much data.

The 2023 IMX989 and 2024 LYT-900 push this further — full readout in <8 ms (under 1/120 s), enabling staggered HDR (read short, mid, long exposures within one frame interval) and electronic stabilization headroom because the sensor can deliver more pixels per second than the screen needs to display.

Dual-pixel and quad-pixel autofocus

Canon shipped Dual Pixel CMOS AF in the EOS 70D (2013): every imaging pixel is split into two left/right photodiodes under one microlens. The left-photodiode image and the right-photodiode image have a parallax offset that’s zero only when focus is correct, so the sensor itself becomes a phase-detect autofocus (PDAF) array. Modern Sony and Samsung sensors do 2×2 quad-pixel PDAF (4 photodiodes per microlens) — used in Sony LYT-900 and Samsung GN3 — giving both horizontal and vertical phase detection at every pixel position.

HDR techniques on a single sensor

Single-exposure dynamic range is capped by FWC/read-noise. To exceed it, modern sensors combine multiple exposures inside the sensor, before the host SoC ever sees the data. Three techniques dominate.

Staggered HDR (Sony’s term) and Smart-ISO Pro (Samsung) capture two or three sub-frames at different exposure times within one frame interval, then merge them in the on-sensor logic. This avoids the ghosting of frame-sequential HDR because the long and short exposures are interleaved row-by-row with only ~1 ms separation, not 33 ms.

DOL-HDR (digital overlap) reads a row at short exposure, then re-exposes and reads it at long exposure, before moving to the next row. Lower memory pressure than staggered, but the time gap per row is larger.

Dual conversion gain (DCG) is the single-exposure version, described earlier. The same exposure is sampled twice through two different conversion-gain paths.

In 2026, flagship phones combine staggered HDR with DCG: each of the staggered sub-frames is itself a dual-gain readout, yielding a 4-way HDR fusion before demosaicing. The published dynamic range numbers — Sony LYT-900 quotes 100 dB (~16.6 stops) in DOL-HDR mode — are the result of this stacking, not a single-photodiode property.

Noise: read noise, shot noise, dark current, and what ISO really means

Sensor noise has three roughly independent sources. Shot noise is the photon-statistical sqrt(N) of the signal itself — irreducible. Read noise is the floor added by the readout chain (FD reset, source-follower, ADC quantization) and is dominated by 1/f noise in the source-follower. Dark current is thermally generated charge that accumulates even without light.

A useful mental model: total noise² = shot² + read² + dark². At bright-light signals (N>1000 e-), shot dominates and the sensor is “photon-limited.” At very low signals (N<10 e-), read noise dominates and the sensor is “read-limited” — this is the regime smartphone night mode operates in.

ISO is not a sensor property in the analog domain; it’s mostly a gain setting on the column PGA between CDS and ADC. Raising ISO multiplies signal and noise equally — it doesn’t make the sensor more sensitive, it just trades headroom for visibility. The exception is dual-gain sensors, where switching from low-CG to high-CG mode at ~ISO 800 actually does lower read noise because the source-follower is presented with a higher-impedance, lower-capacitance node.

Read-noise budgets in 2026 hardware

Published and reverse-engineered numbers as of early 2026:

  • Sony LYT-900 (1″ 50 MP, smartphone): ~0.9 e- input-referred read noise at high CG.
  • Samsung ISOCELL HP9 (1/1.4″ 200 MP): ~1.2 e- per binned super-pixel.
  • Sony IMX989 (1″ 50 MP, in Xiaomi 13 Ultra): ~1.0 e- per Photons to Photos benchmarks.
  • Sony A1 II full-frame: ~1.5 e- at base ISO, dropping to ~0.9 e- at ISO 640 (dual-gain breakpoint).

Why CMOS won, and what CCD still does

CCD sensors are still better than CMOS at one thing: charge-transfer uniformity. Because charge is physically shifted across the die before any amplifier sees it, every pixel passes through the same single amplifier — there’s no pixel-to-pixel gain mismatch. That’s why scientific imaging in the 1990s and 2000s stayed CCD.

CMOS won everywhere else for four structural reasons. Power: a CMOS sensor only activates the row being read; CCDs clock the entire array each frame, burning watts. Speed: column-parallel ADCs scale linearly with array width, while CCDs serialize through one (or a few) output amplifiers. Integration: CMOS sensors can co-fab logic, ADCs, and even compute on the same wafer; CCDs are a separate fab line. Cost: CMOS uses the same fabs that make every other digital chip on the planet.

Sony shut down its last consumer CCD line in 2017. Specialty CCDs (Teledyne, Hamamatsu) still exist for astronomy and electron-microscope detection, where the per-pixel-amp uniformity matters and frame rate doesn’t.

Trade-offs and where the model breaks

Pixel-size compromise. Shrinking pixels from 1.4 µm to 0.6 µm to fit 200 MP in a phone-sized die loses FWC and QE per native pixel. Quad/Nona-Bayer binning recovers some of it, but the binned-pixel optical performance is still worse than a same-die-size native 12 MP design. This is the architectural cost of marketing megapixels.

Rolling shutter is everywhere. On consumer hardware, global shutter remains a niche. Drone gimbal cameras, smartphone “action mode,” and pro-mirrorless electronic shutters all post-process rolling-shutter artifacts in software with gyro data — good enough for casual use, not good enough for cinema-grade dollies.

Quad-Bayer remosaic is not lossless. The 200 MP “detail” mode interpolates a Bayer-like pattern from Quad-Bayer raw. Real per-pixel sharpness is bounded by the native 0.6 µm pitch — diffraction at f/1.7 limits resolvable detail to ~2 µm anyway, so much of the “200 MP” claim is below the Airy disk.

Stacked DRAM is fragile to heat. 960 fps modes run the sensor and on-package DRAM at high duty cycle. Phones throttle slow-motion to 0.2–0.5 s bursts because sustained high-fps readout will brown out the sensor’s thermal envelope and induce dark-current shot noise.

PDAF pixels are still pixels. Dual-pixel and quad-pixel AF arrays sacrifice some signal because the split photodiodes have additional surface area and crosstalk paths. The ISP has to correct for systematic PDAF-pixel response curves in the RAW pipeline. On-pixel masks (used on some legacy PDAF designs) leave permanent dead-pixel patterns — modern dual-pixel avoids this but at the cost of doubled FD nodes per pixel.

Practical recommendations

For phone-camera evaluation, ignore megapixel count and look at three things: pixel size in the binned configuration (≥1.2 µm super-pixel matters more than 0.6 µm native), single-exposure dynamic range in DxOMark or DPReview measurements (12+ stops single-frame indicates dual-gain HDR), and rolling-shutter scan time (under 10 ms is good, under 6 ms is excellent for video).

For sensor selection in embedded / industrial / robotics designs, the priorities flip:

  • For fast-moving scenes: global shutter (Sony Pregius S, onsemi AR0234, ams Mira series).
  • For low light scientific: BSI 4T with high CG mode and FWC > 10,000 e- (Sony IMX585 in security cameras).
  • For lowest cost: standard rolling shutter, FSI is acceptable below 1.4 µm pixels for daylight.
  • For machine vision with stable lighting: 12-bit ADC is enough; 14-bit only matters with HDR scenes.
  • For colorimetric measurement: monochrome (no CFA) — you double the QE and skip demosaicing entirely.

Quick checklist before buying any image sensor for a product:

  • Verify QE curve, not just peak QE.
  • Read the FWC and the read noise — compute DR yourself, don’t trust marketing.
  • Check the MIPI CSI-2 lane count vs your SoC’s input lanes.
  • Confirm if the sensor needs an external clock or has a PLL.
  • Make sure the focal length and pixel pitch match your lens MTF.

For deeper context on related sensing and measurement physics, see how atomic clocks reach 10⁻¹⁸ stability, the thermal runaway physics of lithium-ion batteries (same silicon-temperature dependence as sensor dark current), and the electromagnetic physics behind maglev levitation.

FAQ

What’s the difference between CMOS and CCD image sensors in 2026?

CMOS sensors put an amplifier inside every pixel and digitize per column, so they consume far less power, run faster, and integrate with logic on the same die. CCDs shift charge serially through a single output amplifier, giving better pixel-uniformity but at 5–10× the power and a fraction of the frame rate. Sony, Samsung, OmniVision and onsemi all build CMOS exclusively for consumer products now; CCDs survive only in scientific imaging where photometric uniformity beats speed.

Why does my phone advertise 200 MP but save 12.5 MP photos?

The sensor uses a Nona-Bayer color filter (3×3 same-color tiles) and bins nine native sub-pixels into one virtual super-pixel before saving. Nine same-color photodiodes summed together capture 9× the signal but only 3× the noise, so SNR improves about 3× — critical in indoor light. Switching to “200 MP detail mode” remosaics the array to standard Bayer and outputs the full resolution, useful for bright daylight or large prints.

What causes the jello effect in smartphone video?

Rolling shutter. The CMOS sensor reads pixels one row at a time, taking 8–33 ms from top to bottom of the frame. If the camera moves during readout — your hand shaking, a car driving past — each row sees a different camera pose, so vertical lines tilt and the frame appears to wobble. Modern phones use gyroscope data and electronic image stabilization to estimate per-row motion and warp the frame back into geometric correctness.

What is BSI and why does it matter for low light?

Back-side illumination flips the silicon wafer so light reaches the photodiode without first passing through the metal interconnect stack on top. FSI (front-side) sensors waste roughly half the incoming photons to metal occlusion; BSI roughly doubles quantum efficiency in low light. Every phone sensor since around 2012 has been BSI, and since 2016 most flagship sensors have been stacked BSI — the pixel array on top, logic and DRAM on wafers underneath, bonded with through-silicon vias.

Why is read noise measured in electrons, not in bits or volts?

Because the photo-signal itself is in electrons, and the only meaningful noise comparison is signal-to-noise at the sensor input — before any gain or quantization changes the units. A 14-bit ADC and a 12-bit ADC can have identical input-referred read noise if the analog gain compensates. Quoting noise in electrons lets you directly compute dynamic range (log2(FWC / read_noise)), compare sensors across vendors, and predict SNR at any signal level.

How do dual-pixel autofocus sensors actually focus?

Every imaging pixel is physically split into two photodiodes under one microlens. The left half sees light from the right side of the lens aperture; the right half sees light from the left side. If the image is in focus, both half-images are identical. If it’s out of focus, the two half-images shift relative to each other — phase difference. The ISP measures that shift across the frame and tells the lens motor exactly how far and which direction to move. It’s an entire phase-detect AF system embedded in the imaging array.

Further reading

For practical pillar context, the IoT, Digital Twin and PLM landing page maps how sensors plug into device telemetry and manufacturing data flows. Related deep technical posts include the atomic clock and GPS timing explainer, the lithium-ion battery thermal physics, and the maglev electromagnetic propulsion deep dive.

For canonical external references, see Eric Fossum’s archive of CMOS APS papers, Albert Theuwissen’s Image Sensors and Signal Processing for Digital Still Cameras for the textbook physics, Sony Semiconductor Solutions product pages for current IMX/LYT datasheets, and the Samsung ISOCELL technical briefs for HP-series Tetra/Nonacell architecture.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *