pgvector vs Dedicated Vector Database: The 2026 ADR
Most teams shipping retrieval-augmented generation in 2026 reach the same fork in the road: keep embeddings inside the Postgres they already run, or stand up a purpose-built engine. The honest answer to pgvector vs dedicated vector database is that the decision is rarely about raw nearest-neighbour speed — the index math converged years ago — and almost always about filtering behaviour, write throughput at scale, operational surface, and who pays the on-call pager at 3 a.m. Choosing wrong is expensive in both directions: a premature migration to a specialist engine buys you a second datastore to keep consistent, while staying in Postgres past your scaling envelope buys you index builds that lock tables and recall that quietly degrades. This article is written as an Architecture Decision Record, so the recommendation comes with its context, the options, and the consequences you inherit either way.
What this covers: pgvector’s HNSW and IVFFlat indexes and their parameters, recall and latency trade-offs, metadata pre/post-filtering, hybrid search, the real scaling ceiling, operational and cost comparisons against Pinecone, Qdrant, Weaviate and Milvus, and a concrete decision tree for when to migrate.
Context and Background
Vector search became a mainstream database workload the moment retrieval-augmented generation (RAG) turned embeddings into a primary access path rather than a research curiosity. An embedding is a fixed-length array of floating-point numbers — typically 384 to 3,072 dimensions — produced by a model so that semantically similar inputs land near each other in space. “Search” means finding the nearest neighbours to a query vector under a distance metric like cosine or L2. Doing this exactly is O(n) per query, so production systems use Approximate Nearest Neighbour (ANN) indexes that trade a little recall for orders of magnitude less latency.
Two camps emerged. pgvector is an open-source extension that adds a vector column type and ANN indexes directly to PostgreSQL; current stable is 0.8.2 (February 2026), and it now ships in managed Postgres on AWS, Google Cloud, Azure, Supabase and Neon. The dedicated camp — Pinecone, Qdrant, Weaviate and Milvus — builds the entire engine around vectors, with distributed sharding, payload indexes, and storage tiers designed for billions of rows. For background on choosing the embeddings that feed either store, see our embedding models benchmark. The pgvector project’s own documentation is the authoritative reference for index parameters and version history, and is the primary source for the numbers in this post.
The framing has shifted in 2026. pgvector 0.7.0 added halfvec (2-byte float) and binary vector types plus expression-index quantization, and 0.8.0 added iterative index scans that fixed its most notorious filtering failure mode. Those two releases closed much of the gap that used to make the “just use a real vector DB” argument automatic.
It is worth naming the bias that this framing corrects, because it is the most common way teams reason themselves into the wrong architecture. The dedicated-engine vendors compete on a benchmark axis — raw queries per second and recall at scale — because that is the axis where a purpose-built system can win, and because it is legible and demo-able. That competition has produced genuinely excellent engines, but it has also trained the market to evaluate the decision on the vendors’ chosen axis rather than on the axis that actually dominates most production systems, which is operational and consistency cost. An ADR exists precisely to resist that pull: to insist that the decision be made on the consequences you will live with for years, not the metric that looks best in a launch blog. The rest of this document is structured to keep returning attention to the axes — filtering behaviour, write and maintenance load, consistency, total cost of ownership — that a pure speed benchmark renders invisible.
The Decision: Start in Postgres, Migrate on a Trigger
For the overwhelming majority of teams the correct first move is pgvector inside your existing Postgres, and you migrate to a dedicated vector database only when a specific, measurable trigger fires — a corpus past roughly 10–50 million vectors per table, filtered-query latency that iterative scans cannot rescue, or a write/index-rebuild rate that disrupts your primary transactional workload. Co-locating vectors with relational data eliminates a whole class of consistency and operational problems; a specialist engine earns its keep only once those problems are smaller than the ones scale creates.

Figure 1: The top-level decision path — default to pgvector and branch to a dedicated engine only when a concrete scaling, filtering, or write trigger fires.
The diagram above encodes the ADR as a flow. You enter at the top with a corpus-size and workload question. If you are already on Postgres and under the scale ceiling, you stay; if you cross a trigger threshold, you evaluate which dedicated engine matches the dominant constraint — filtering, scale, or operational simplicity. The branches are deliberately few because most real decisions collapse to “stay” or “move because of one named bottleneck,” not a feature-by-feature spreadsheet.
The “migrate on a trigger, not a hunch” discipline is the single most valuable thing in this ADR, so it is worth defending. The opposite approach — choosing the dedicated engine up front “to be safe” or “because we’ll need it eventually” — is a premature optimization that pays a certain, immediate cost (a second datastore, a sync pipeline, a new operational surface) to insure against an uncertain, future problem (scale you may never reach). Most RAG corpora never approach the tens-of-millions-of-vectors regime where the dedicated engines pull decisively ahead; teams routinely over-estimate their eventual scale because the planning conversation happens when ambition is highest and data is smallest. A named trigger inverts the risk: you pay the migration cost only when a specific, measured condition proves you need to, and until then you bank the operational simplicity. The triggers must be concrete and measurable precisely so the decision is forced by evidence rather than by anxiety or by a vendor’s roadmap deck.
Why co-location usually wins first
The strongest argument for pgvector has nothing to do with vectors. It is that your embeddings live in the same transactional database as the rows they describe. A document, its chunks, its embeddings, its access-control metadata, and its audit log all sit in one place, behind one set of JOINs and one set of BEGIN/COMMIT semantics. When you insert a document and its vector in a single transaction, either both land or neither does — there is no dual-write window where your vector store and your system of record disagree.
A dedicated engine, by contrast, is a second datastore. You now own a replication or change-data-capture pipeline to keep it synchronised, a reconciliation job for when it drifts, and a failure mode where a deleted row still returns from search because the delete never propagated. None of that is insurmountable, but it is real engineering you do not need until scale forces it.
The co-location advantage compounds in a way that is easy to underrate: it extends to querying, not just writing. Because the vector lives beside its relational metadata, a single SQL statement can join the nearest-neighbour result to the document’s owner, its access-control list, its publication status, and its parent record, filtering and enriching in one round trip under one transaction’s isolation. With a dedicated engine, the vector search returns IDs, and your application must then fetch the corresponding rows from the system of record, reconcile any that have changed or been deleted since the index was last synced, and assemble the result — more code, more latency, and a fresh opportunity for the two stores to disagree about reality. The version of this that bites hardest is access control: in Postgres you filter on the permission column in the same query that ranks by similarity, so an unauthorized document is never even a candidate; with a split store, you must re-check permissions in the application after retrieval, and any lag in propagation is a window in which search can surface a document a user should no longer see.
Why “vector search is the bottleneck” is usually false
Teams over-index on ANN query latency because it is the number benchmarks publish. In practice, a tuned pgvector HNSW index returns nearest neighbours in single-digit milliseconds; published 2026 comparisons put it in the 5–8 ms range at production sizes (treat exact figures as illustrative — they depend on dimension, dataset, and hardware). The dominant latency in a RAG request is almost always the embedding API call and the language-model generation, each of which is tens to hundreds of milliseconds. Shaving 5 ms off a 300 ms request by switching databases is the wrong optimisation.
The Amdahl’s-law intuition here is worth making explicit, because it reframes the whole speed debate. If the vector query is 5 ms of a 300 ms request, it is under 2% of the latency budget, and even reducing it to literally zero would be imperceptible to a user. The components that actually dominate — generating the query embedding, retrieving and assembling context, and above all the language model’s token generation — are an order of magnitude or two larger and are untouched by the choice of vector store. This is why a team that migrates to a faster engine and then measures end-to-end latency is so often baffled to find no improvement: they optimized the part of the system that was never the constraint. The corollary is that if your RAG latency is genuinely a problem, the productive places to look are caching embeddings, reducing the context you stuff into the prompt, streaming the generation, or choosing a faster model — not swapping a 5 ms vector query for a 3 ms one.
What actually changes at scale
The thing that genuinely degrades in single-node Postgres is not query speed but write and maintenance behaviour. HNSW index builds are CPU- and memory-intensive; on a large table a non-concurrent build can hold locks long enough to disrupt your application. Re-indexing after a model swap means rebuilding millions of graph edges. A dedicated engine isolates that work onto nodes that do nothing else, which is the real value proposition once your vector workload competes with your transactional workload for the same machine.
The memory arithmetic makes this concrete. A 1,536-dimension OpenAI-style embedding stored as 4-byte floats is roughly 6 KB per row before index overhead. An HNSW graph adds edge storage on top — with m=16 the graph holds up to 32 links per node at the base layer, so the index itself can rival or exceed the raw vector storage. Ten million such vectors is on the order of 60 GB of raw vector data plus a comparable index, and you want the working set in RAM for fast traversal because HNSW’s random-access graph walk is brutal on disk. That is the real ceiling: not a hard limit, but the point where the index no longer fits comfortably in the memory of a single reasonable instance, and where every rebuild ties up that memory for the duration. halfvec halves the vector footprint, and binary quantization cuts it by a further order of magnitude at a recall cost, which is exactly why those 0.7.0 features moved the ceiling up rather than being a curiosity.
There is also a vacuum dimension that single-node operators forget. pgvector indexes participate in PostgreSQL’s MVCC machinery, so heavy update or delete churn on embedding rows produces dead tuples that autovacuum must reclaim, and a bloated HNSW index degrades both recall and speed until it is rebuilt. A workload that constantly re-embeds and replaces documents — common in RAG pipelines that re-chunk on every ingest — stresses this path far harder than a write-once corpus. Dedicated engines sidestep MVCC entirely with their own storage formats, which is a genuine architectural advantage for high-churn vector workloads that has nothing to do with query latency.
The reason this matters more than the raw numbers suggest is that the maintenance cost scales with churn, not just size, and the two are easy to conflate when planning. A 50-million-vector corpus that is written once and read forever is comfortably within reach of a well-provisioned single node, because the expensive operations — index builds, vacuums — happen rarely. The same 50 million vectors under a pipeline that re-embeds 5% of the corpus every day is a different animal: it generates a continuous stream of dead tuples, keeps autovacuum perpetually busy, and pushes the HNSW index toward the bloated state where recall silently slips. When you size a pgvector deployment, model the daily write-and-delete volume explicitly, not just the steady-state row count, because it is the churn that determines whether maintenance windows stay invisible or start colliding with your transactional traffic — and it is the churn, far more often than the size, that ends up being the trigger that justifies a dedicated engine.
Deeper Analysis: Indexes, Filtering, and the Decision Matrix
pgvector offers two ANN index types, and the choice between them is the first tuning decision you make. Understanding their mechanics is what lets you predict whether Postgres will hold up or whether you genuinely need a specialist.

Figure 2: HNSW navigates a multi-layer proximity graph from sparse top layers down to dense base layers, while IVFFlat partitions vectors into lists and probes only the nearest few.
HNSW vs IVFFlat: the core index choice
HNSW (Hierarchical Navigable Small World) builds a layered graph where each vector links to its near neighbours. Search starts at a sparse top layer, greedily hops toward the query, and descends through denser layers until it reaches the base. Two build parameters govern quality: m (edges per node, default 16) and ef_construction (candidate-list size during build, default 64). At query time, hnsw.ef_search (default 40) sets how many candidates to explore — raise it for higher recall at the cost of latency. HNSW gives the best recall-latency curve and, crucially, supports incremental inserts without a full rebuild, which is why it is the default choice for most workloads.
IVFFlat (Inverted File with Flat compression) instead clusters vectors into lists (Voronoi cells) and, at query time, probes only the probes nearest cells. It builds far faster and uses less memory than HNSW, but recall is more sensitive to data distribution, and the index should be built after you have representative data because the cluster centroids are learned from whatever rows exist at build time. As a rule: prefer HNSW for read-heavy, recall-sensitive workloads; reach for IVFFlat when build time and memory are the binding constraint and you can tolerate more tuning. HNSW vs IVFFlat is therefore less “which is better” and more “which constraint dominates.”
The parameter interactions matter more than the defaults. For IVFFlat the standard heuristic is lists = rows / 1000 up to about a million rows and sqrt(rows) beyond that, with probes tuned at query time to trade recall for latency — too few lists and each list is huge so you scan too much, too many and you must probe more lists to find your neighbours. For HNSW, raising m and ef_construction builds a denser, higher-recall graph but costs build time and memory permanently, whereas ef_search is a per-query knob you can raise dynamically when a particular query needs better recall. A practical pattern is to keep ef_construction moderate, set a sane default ef_search, and let latency-tolerant batch jobs raise ef_search per session. None of these knobs exist in the same form on a fully managed engine like Pinecone, which is both a simplification and a loss of control — you cannot tune your way out of a recall problem you do not own.
A useful way to internalize the distinction is that HNSW spends memory and build time to buy a flatter recall-latency curve, while IVFFlat spends query-time work (probing more lists) to buy recall on a cheaper-to-build index. That framing predicts when each wins. A read-heavy semantic-search service with a stable corpus amortizes HNSW’s expensive build over millions of cheap, high-recall queries — HNSW is the obvious choice. A pipeline that must rebuild its index frequently, or that runs on a memory-constrained instance, may find IVFFlat’s fast, light build worth the per-query probing cost, especially if the queries are latency-tolerant batch jobs rather than interactive ones. The data distribution also tips the balance: IVFFlat’s clustering assumes the data forms reasonably separable groups, so on a corpus where embeddings are smeared uniformly through the space its recall degrades and you must probe many lists to compensate, eroding its build-time advantage. When IVFFlat disappoints despite generous probes, an unclusterable distribution is the usual culprit, and HNSW — which makes no such assumption — is the fix.
Both indexes cap the indexable vector type at 2,000 dimensions because each index tuple must fit inside PostgreSQL’s 8 KB page. Since 0.7.0 you sidestep this: switch the column to halfvec (2-byte floats) to index up to 4,000 dimensions at half the bytes per dimension, or binary-quantize to the bit type for up to 64,000 dimensions. The same release reported up to a 67x HNSW build speedup by combining halfvec with binary quantization, and both index types support CREATE INDEX CONCURRENTLY to avoid blocking writes during a build.
Filtering: the failure mode that defined the old debate
The classic pgvector complaint was “I asked for ten results and got three.” It happened because the ANN index pre-fetched a fixed number of candidates before your WHERE clause was applied, so a selective filter could discard most of them and leave you short of your LIMIT. This is the pre-filter versus post-filter problem: a true pre-filter restricts the candidate set first; naive post-filtering ranks then filters and can return too few rows.

Figure 3: Post-filtering ranks first and can starve the result set, a true pre-filter restricts candidates first, and pgvector iterative scans keep producing candidates until the LIMIT is satisfied.
pgvector 0.8.0 introduced iterative index scans (hnsw.iterative_scan, ivfflat.iterative_scan) that fix this directly: when a filter starves the result set, the index keeps producing more candidates until your LIMIT is met. You choose relaxed_order (approximately distance-ordered, slightly faster) or strict_order (exactly ordered, slightly slower) depending on whether ranking precision matters. This is the single most important reason the 2026 pgvector filtering story is far better than the 2024 one. Dedicated engines like Qdrant still hold an edge here — their payload indexes are built specifically for filtered vector search and remain the fastest at complex predicates like price < 50 AND category = electronics AND in_stock = true — but pgvector with iterative scans is now adequate for the large middle of use cases.
The remaining gap is most visible at high filter selectivity, and it is worth understanding why, because it is the condition under which the “use a real vector DB” argument still has teeth. When a filter eliminates 99.9% of the corpus — say, documents belonging to one small tenant in a large multi-tenant index — an HNSW graph walk is fighting the index’s own structure: the graph is built around global proximity, not your filter, so it must traverse and discard enormous numbers of non-matching candidates to accumulate enough survivors to satisfy the LIMIT. Iterative scans make this correct (you get your ten rows) but not necessarily fast (it may scan a large fraction of the index to find them). Qdrant and similar engines attack this by maintaining payload indexes that let the engine restrict to the matching subset first and search within it, which is structurally faster for highly selective filters. The pragmatic pgvector mitigation is a partial B-tree index on the filter column combined with iterative scans, which helps Postgres restrict before it walks; but if your dominant query pattern is high-selectivity filtered vector search at scale, this specific gap is a legitimate, named trigger to evaluate a dedicated engine.
Hybrid search
Hybrid search blends dense vector similarity with sparse keyword relevance, usually via Reciprocal Rank Fusion (RRF): each retriever returns a ranked list, every document scores 1 / (k + rank) from each list it appears in, the scores sum, and the merged list re-sorts. In Postgres you get this without a sidecar — run a tsvector full-text query and a pgvector ANN query, then fuse the ranks in SQL or in your application. Weaviate ships hybrid search as a first-class single-query feature, which is a genuine convenience; pgvector makes you assemble it, but keeps everything in one database and one transaction.
The reason hybrid search is worth the assembly effort, rather than relying on dense vectors alone, is that the two retrievers fail in complementary ways. Dense vector search excels at semantic similarity — it finds the paragraph about “automobile insurance premiums” when you ask about “car coverage costs” — but it can miss exact-match needles like a specific error code, a product SKU, or a rare proper noun, because those tokens get blended into a holistic embedding that does not privilege them. Sparse keyword search is the mirror image: it nails the exact token and is helpless at paraphrase. Fusing the two with RRF gives you a result set that is robust to both query styles, which is why hybrid retrieval consistently outperforms either retriever alone on heterogeneous real-world queries. The practical point for this ADR is that you do not need a dedicated engine to get this benefit — Postgres already ships a mature full-text search engine in tsvector, so the dense and sparse retrievers live in the same database, and the only thing you assemble is the rank fusion, which is a dozen lines of SQL or application code.
The decision matrix
The table below is the core comparison. Treat latency figures as directional — they come from published 2026 third-party benchmarks and vary with dimension, dataset, and hardware.
| Dimension | pgvector (in Postgres) | Pinecone | Qdrant | Weaviate | Milvus |
|---|---|---|---|---|---|
| Best-fit scale | < ~10–50M vectors/table | Any (managed) | 10M–1B+ | 10M–1B+ | 100M–10B+ |
| Filtered search | Good with iterative scans | Good | Best-in-class | Good | Good |
| Hybrid search | DIY (tsvector + RRF) | Limited | Supported | First-class single query | Supported |
| Operational model | Zero new system | Fully managed, no ops | Self-host or cloud | Self-host or cloud | Heavy self-host or cloud |
| Transactions with app data | Native ACID JOINs | None | None | None | None |
| Sharding / distribution | Manual / Citus | Automatic | Built-in | Built-in | Most mature |
| Cost driver | Existing DB capacity | Per-pod / usage | Compute you run | Compute you run | Compute you run |
| Migration cost to adopt | Near zero | Low | Medium | Medium | High |
The pattern is consistent: pgvector wins on operational simplicity and data integrity and loses on extreme scale and the most demanding filtered/distributed workloads. For deeper context on whether retrieval is even the right architecture versus alternatives, see our breakdown of fine-tuning vs RAG vs long context.
Cost, consistency, and the total bill
Cost comparisons mislead when they only price the database. With pgvector the marginal cost is usually the incremental RAM and CPU your existing Postgres needs to hold the index — often a single instance-size bump rather than a new line item. With a managed engine you pay for the service plus the integration plumbing: the change-data-capture pipeline, the reconciliation job, the extra monitoring, and the engineering time to keep two systems consistent. That hidden integration cost is frequently larger than the database invoice, and it recurs forever. The honest total-cost-of-ownership question is not “which database is cheaper per million vectors” but “what does it cost to keep this consistent and observable for two years.”
Consistency is the other axis the matrix understates. In Postgres, a write to a row and its vector is one ACID transaction — Atomicity, Consistency, Isolation, Durability — so reads never see a half-updated state, and a rollback unwinds both. Every dedicated engine is eventually consistent relative to your system of record: there is a window after a write where search results lag reality. For a recommendation feed that window is harmless; for a system that must never surface a just-deleted or just-revoked document — think access-controlled enterprise RAG — it is a correctness bug you have to engineer around with tombstones and propagation checks. That requirement alone keeps many regulated workloads on pgvector regardless of scale.
Trade-offs, Gotchas, and What Goes Wrong
The most common pgvector failure is silent recall decay. HNSW recall depends on ef_search at query time; if you tuned it on a 1M-row table and grew to 20M without revisiting it, your top-k results quietly get worse and no error is thrown. Recall is invisible in production unless you measure it against an exact-search ground truth on a sample. Build a recall regression test and run it as the corpus grows.
The second is index builds that hurt the primary workload. A non-concurrent HNSW build on a large table takes locks and burns CPU your transactional queries need. Always use CREATE INDEX CONCURRENTLY, raise maintenance_work_mem for the build, and schedule rebuilds off-peak. On a busy single node this is the operational pain that eventually justifies a separate engine.
The third is forgetting iterative scans exist. Teams still hit the “got fewer rows than my LIMIT” problem in 2026 simply because hnsw.iterative_scan is off by default. If you filter alongside vector search, turn it on — and know that very selective filters combined with large LIMITs can still scan a lot of the index, so a partial B-tree index on the filter column matters.
The fourth is a stale CVE on the cluster. pgvector 0.8.2 is a security release fixing a buffer-overflow in parallel HNSW index builds (CVE-2026-3172) that could leak data from other relations or crash the backend. If you run parallel builds, patch to 0.8.2 or later. On the dedicated side, the mirror-image gotchas are dual-write drift, eventual-consistency surprises where a deleted document still returns, and per-pod or per-node pricing that scales faster than your traffic if you over-provision replicas.
A fifth, subtler trap is benchmarking the wrong thing during evaluation. Teams run a vendor’s published benchmark, see a dedicated engine win on raw queries-per-second, and conclude they must migrate — without measuring recall at matched latency, which is the only fair comparison for ANN systems. Any engine can be fast at low recall; the honest metric is recall@10 at a fixed p95 latency on your data distribution and your filter selectivity. A second evaluation mistake is testing on a tiny corpus where everything is fast, then being surprised when behaviour diverges at production size. Always benchmark on a corpus within an order of magnitude of your real one, with your real filter predicates, or the result tells you nothing about the decision you are actually making.
Finally, watch for the abstraction tax going the wrong way. The right retrieval interface keeps stores swappable, but a leaky one that exposes pgvector-specific SQL or a particular engine’s filter DSL across your codebase quietly welds you to that store. The fix is disciplined: the interface speaks in vectors, k, and a neutral filter object, and every store-specific dialect lives behind it. Get this wrong and the “cheap migration” you planned for becomes the rewrite you were trying to avoid.
A sixth trap deserves a mention because it inverts the usual failure: migrating and keeping both stores forever. Teams that move to a dedicated engine under real scale pressure often run the dual-write phase — pgvector and the new engine in parallel — as a safety measure, then never decommission the pgvector index because “it’s working and we’re busy.” Now they pay for two stores, maintain two indexes, and carry the very dual-write consistency risk the migration was supposed to be worth. The dual-write phase is a transition, not a destination: it exists to let you validate recall against pgvector as ground truth and cut reads over behind a flag, after which the old index should be dropped on a defined timeline. A migration that never completes is, in total-cost terms, frequently worse than either endpoint alone.
Practical Recommendations
Default to pgvector. If you already run Postgres and your corpus is under roughly ten million vectors per table, adding the extension is the lowest-risk, lowest-cost path, and it keeps your embeddings transactionally consistent with the data they describe. Use HNSW unless build time or memory forces IVFFlat, enable iterative scans the moment you filter, and instrument recall before you trust the results.
Migrate to a dedicated engine on a trigger, not a hunch. The named triggers are: a corpus that pushes a single table past tens of millions of vectors, filtered-query latency that iterative scans plus B-tree indexes cannot bring under your SLO, or index-build and write load that visibly degrades your transactional traffic. When a trigger fires, pick the engine by the dominant constraint — Qdrant for the hardest filtering, Milvus for billion-scale, Pinecone for zero-ops, Weaviate for built-in hybrid. The LLM gateway architecture post covers how to keep the retrieval layer swappable so this migration is a config change, not a rewrite.
Checklist before you decide:
- [ ] Measured corpus size and 12-month growth projection per table.
- [ ] Recall test against exact search on a held-out sample.
- [ ] p95 latency of the whole RAG request, not just the vector query.
- [ ] Filter selectivity profile and whether iterative scans meet the SLO.
- [ ] Index-build impact on transactional workload measured under load.
- [ ] Patched to pgvector 0.8.2+ if using parallel builds.
- [ ] A retrieval abstraction so swapping stores is a config change.

Figure 4: When a trigger fires, dual-write into both stores, backfill and validate recall, cut reads over behind a flag, then decommission the pgvector index.
Figure 4 lays out the migration as a reversible, validated sequence rather than a big-bang cutover, and the ordering is deliberate. You dual-write first so the new engine accumulates current data while the old one still serves all reads — no user-visible change, full rollback available. You then backfill historical rows and, critically, validate recall against pgvector as ground truth, because the new engine’s defaults will not match your tuned pgvector behaviour and an unvalidated cutover can silently degrade result quality. Only then do you move reads over behind a feature flag, ramping a percentage of traffic so a regression surfaces on 5% of requests rather than 100%. The final, often-skipped step is to decommission the pgvector index once the new engine is proven, closing out the dual-write cost and consistency risk. Each arrow in the diagram is a checkpoint you can stop at and reverse, which is exactly the property you want when migrating the retrieval layer of a production RAG system.
Frequently Asked Questions
Is pgvector slower than a dedicated vector database?
For nearest-neighbour queries at moderate scale, no meaningful difference exists for most applications — tuned pgvector HNSW returns results in single-digit milliseconds, and the RAG request is dominated by embedding and generation latency anyway. Dedicated engines pull ahead on complex filtered queries and at very large scale where distributed sharding matters. Below roughly ten million vectors per table, the database is rarely your bottleneck, so “slower” is usually the wrong question to optimise.
What is the scaling ceiling for pgvector?
There is no hard limit, but practical pain begins in the tens of millions of vectors per table on a single node, driven by index-build time, memory pressure, and maintenance windows rather than query speed. You can extend the ceiling with Citus sharding, halfvec to halve memory, and binary quantization, but at hundreds of millions to billions of vectors a purpose-built distributed engine like Milvus is built for that regime and pgvector is not.
Should I use HNSW or IVFFlat in pgvector?
Use HNSW for almost everything: it gives the best recall-latency curve and supports incremental inserts without a full rebuild. Choose IVFFlat only when index build time or memory is your binding constraint and you can tolerate more tuning, and remember to build it after representative data exists because its clusters are learned at build time. HNSW vs IVFFlat is a constraint question, not a quality ranking.
Does pgvector support metadata filtering and hybrid search?
Yes to both. Metadata filtering is just a SQL WHERE clause, and since 0.8.0 iterative index scans prevent the old “too few results” failure when filters are selective. Hybrid search combines a tsvector full-text query with a pgvector ANN query, fused with Reciprocal Rank Fusion in SQL — no Elasticsearch sidecar required, though you assemble it yourself rather than getting it as one built-in call.
When does a dedicated vector database actually win?
When a concrete trigger fires: a corpus past tens of millions of vectors per table, the hardest filtered-search SLOs that Qdrant’s payload indexes serve better, billion-scale workloads that need Milvus-style sharding, or a desire for zero operational overhead that Pinecone’s fully managed model provides. If none of those apply, a dedicated engine mostly adds a second datastore to keep consistent.
How do I keep my vector store swappable so migration is cheap?
Put a thin retrieval interface in front of the store with two methods — upsert(id, vector, metadata) and search(vector, k, filter) — and keep store-specific logic behind it. When you migrate, dual-write into both stores, backfill historical rows, validate recall against pgvector as ground truth, then cut reads over behind a feature flag. With that abstraction, switching from pgvector to a dedicated engine is a configuration change rather than an application rewrite.
Further Reading
- Embedding models benchmark: OpenAI, Cohere, Voyage, BGE (2026) — choosing the embeddings that feed any vector store.
- Fine-tuning vs RAG vs long context (2026) — deciding whether retrieval is the right architecture at all.
- LLM gateway architecture (2026) — keeping the retrieval layer swappable behind a control plane.
- pgvector documentation and changelog — authoritative reference for index parameters and version history.
- PostgreSQL: pgvector 0.7.0 release notes — the release that added halfvec, binary vectors, and quantization.
By Riju — about
