RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval

RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval

RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval

Last Updated: 2026-05-16

Architecture at a glance

RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval — architecture diagram
Architecture diagram — RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval
RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval — architecture diagram
Architecture diagram — RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval
RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval — architecture diagram
Architecture diagram — RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval
RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval — architecture diagram
Architecture diagram — RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval
RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval — architecture diagram
Architecture diagram — RAG Over CAD and BOM: Reference Architecture for PLM Knowledge Retrieval

Three years into the LLM-in-enterprise wave, the “drop your PDFs into a vector store and ask questions” pattern has hit its ceiling inside engineering organisations. The teams that tried it for PLM data — engineering change notices, drawing PDFs, BOM exports, specification documents — kept landing on the same complaint from designers and manufacturing engineers: the answers sound right but cite the wrong revision, the wrong assembly, or a part that was superseded two years ago. The problem is not the LLM. The problem is that PLM knowledge is structured, versioned, and graph-shaped, and a flat vector store throws all three properties away the moment you chunk a STEP file or BOM export into 512-token windows. This post lays out a reference architecture for RAG over CAD and BOM that has held up in production deployments through 2026 — covering ingestion of CAD geometry, BOM-as-graph modelling, hybrid retrieval with reciprocal rank fusion, and evaluation methodology specific to engineering data. The thesis throughout: graph-augmented retrieval is non-optional for PLM, and CAD embeddings only become useful when they are pinned to BOM nodes rather than floating in a generic vector index.

Why Vanilla RAG Falls Apart on Engineering Data

The standard RAG playbook — chunk text, embed with a generic model, retrieve top-k by cosine similarity, stuff into an LLM prompt — was built for FAQs, support tickets, and policy documents. PLM data violates every assumption that playbook depends on. See the failure-mode diagram in the rendered asset ./assets/arch_01.png.

Assumption 1: documents are self-contained units of meaning. They are not, in PLM. A torque specification on page 14 of SPC-7100 is meaningless without knowing which sub-assembly it applies to, what revision is current, and whether ECN-3318 superseded it last quarter. Chunk that page and you’ve stored a paragraph that reads cleanly but no longer points anywhere. The chunk wins on cosine similarity for queries containing the word “torque”, but the answer it generates is structurally wrong.

Assumption 2: lexical similarity correlates with semantic relevance. This works for prose. It collapses for engineering identifiers. ECN-2901 and ECN-3318 have near-identical embeddings because most of the surrounding language (“Engineering Change Notice”, “approved by”, “effective date”, “supersedes”) is template boilerplate. The discriminating tokens are the IDs themselves, which generic embedding models treat as low-information strings. A vanilla retriever happily returns the wrong ECN as the top hit.

Assumption 3: the retrieval unit is the document. In PLM, the retrieval unit is the part-revision-in-context. The same physical part 47-A2 means different things when used in gearbox SA-12 (where it sees axial load) versus in pump module PM-7 (where it sees radial load). A single chunk that says “47-A2 has a load rating of 4.2 kN” is incomplete without the where-used context. Vector search has no native mechanism to traverse where-used relationships.

Assumption 4: text is the only modality. Engineering knowledge lives in geometry. The reason a designer chose a 4.5 mm fillet over a sharp corner is encoded in the CAD model, in the PMI annotations, and in the GD&T tolerances — not in any prose document. Ignore the geometry and you ignore the answer to a large class of “why” questions.

The visible symptoms are familiar: confident-sounding hallucinations that cite the wrong revision, retrieval that returns the marketing whitepaper instead of the spec, answers that drop the alternate-part relationship entirely. The root cause in every case is the same — the index does not preserve the structural, version, and modal richness of the source data.

End-to-End RAG-over-PLM Reference Architecture

The reference architecture flips the vanilla pattern in three places: ingestion becomes schema-aware rather than generic chunking; storage becomes vector + graph + sparse + metadata rather than a single vector index; and retrieval becomes a planned hybrid rather than a single cosine-top-k call. The stack is illustrated in ./assets/arch_02.png.

Schema-aware ingestion

Ingestion has to respect what the source system already knows. PLM systems — Teamcenter, Aras, 3DEXPERIENCE, Windchill — expose structured APIs that return parts, revisions, BOM rows, ECNs, and document attachments with their relationships intact. The mistake teams make is exporting everything to PDF first and then RAG-ing the PDFs. You lose the structure twice: once when the PLM serialises to PDF, and again when the chunker slices the PDF without knowing which paragraph belongs to which part.

The corrected pattern: use the PLM API directly. Each part-revision becomes a node with attributes (part_id, rev, status, classification, owning_org). Each BOM row becomes an edge with attributes (quantity, position, find_number, effectivity). ECNs become event nodes linked to the parts they change. Documents stay as documents, but they’re linked into the graph at the part-revision or assembly level they specify — not just dumped into a flat folder. CAD files get parsed by a headless kernel (OpenCascade, CAD Exchanger, or vendor SDKs) before chunking; never by a generic file-type sniffer.

Hybrid retrieval

The retrieval planner is where the architecture earns its keep. A user query is rewritten by a small DSPy program (Khattab et al., Stanford NLP) that extracts entities — part IDs, ECN numbers, sub-assembly codes, free-text concepts — and then dispatches three parallel retrievals: sparse (BM25 or SPLADE) for exact-token matches on identifiers, dense (ColBERT v2 or a domain-tuned bi-encoder) for semantic concepts, and a graph walk seeded from the extracted entities, traversing two or three hops to surface related ECNs, alternate parts, and parent assemblies. The three result sets get fused with reciprocal rank fusion (RRF), reranked by a cross-encoder, and assembled into a context pack that carries provenance metadata for every chunk.

This is closer to the GraphRAG pattern published by Microsoft Research in 2024 than to classical RAG. The critical difference for PLM is that the graph is not synthesised from unstructured text via entity extraction — it already exists in the PLM. You are not building a knowledge graph; you are exposing the one the engineering organisation has been maintaining for years.

Constrained generation with citation enforcement

The generator is the least exotic part of the stack and that is intentional. A modern frontier LLM, prompted with a structured context pack and a citation schema that requires every claim to carry a (part_id, rev, doc_uri, span) tuple, produces answers that are auditable by a manufacturing engineer in seconds. A lightweight post-hoc verifier checks that every citation maps back to a span in the retrieved context — if it doesn’t, the claim is rejected and either retried or returned as “insufficient evidence”. This is the difference between a demo and a system an engineering organisation will accept liability for.

The whole stack is observable end-to-end: query, planner output, three retrieval sub-results, fused ranking, reranker output, generation, and verifier verdict — each is logged with a trace ID. When a designer complains “the system gave me the wrong torque spec”, you can replay the trace and see exactly which retrieval branch surfaced the wrong chunk. This pairs naturally with the agentic RAG patterns we cover elsewhere when the workflow needs multi-step tool use rather than a single retrieval pass.

CAD Embeddings: Chunking Strategies for Drawings and 3D Models

CAD is the modality that text-only RAG ignores and that engineering users notice the absence of first. The naive approaches — OCR the drawing PDF, or dump the STEP file into a tokeniser — both fail badly. STEP is a structured representation of B-rep topology, not free text; tokenising it produces gibberish embeddings. OCR on drawings recovers some title-block text but loses every geometric relationship that matters. The CAD chunking pipeline is shown in ./assets/arch_03.png.

The working pattern is multi-modal CAD chunking. A headless CAD kernel parses the source file into three derived representations, and each gets its own embedder.

Geometric feature extraction uses the B-rep topology to identify named features — holes, slots, bosses, fillets, ribs — and emits a structured record per feature with its dimensions and the face it lives on. These features get encoded as compact numerical vectors (or natural-language descriptions like “M8 through-hole on face F12, depth 24 mm”) which embed cleanly with a standard text encoder. This is what lets a query like “find me parts with M8 through-holes in steel plates thicker than 10 mm” return useful results.

Multi-view rendering is the trick that makes Vision-Language Models (VLMs) usable for CAD. The tessellated mesh is rendered from 6 to 12 standard views — orthographic front/top/side plus a couple of isometrics, plus exploded views for assemblies — and each rendered image is encoded by a VLM like SigLIP or CLIP-Large. The resulting embeddings capture shape, proportion, and visual style at a level that is good enough to find “the bracket that looks like this one” without ever having indexed a single line of text from the model. For drawings specifically, render the sheet at high resolution and treat each view on the sheet as a separate embedded chunk; this preserves the relationship between an isometric and its dimensioned ortho.

Direct 3D geometry encoding is the emerging third leg. Models like PointBERT and DGCNN (and their 2026 successors) sample the mesh as a point cloud and produce a dense geometric embedding that captures topology in a way 2D renders cannot. This is still a research-leading-edge area — NVIDIA’s text-to-CAD work demonstrates that geometry-native encoders are viable, but the engineering tooling around them lags well behind text and image embeddings. In production today, multi-view VLM embeddings carry most of the weight; point-cloud encoders are useful as a re-ranking signal for queries where geometric similarity matters more than visual or feature-list similarity.

PMI and GD&T annotations are extracted as text — tolerance specifications, surface finish callouts, datum references — and embedded with a standard text model, but with a strong inductive bias toward keeping them attached to the face or feature they annotate. A PMI callout floating in a vector store without its anchor is almost useless; pinned to the feature, it answers a real class of “what tolerance is on this surface” queries.

All four representation embeddings get stored together, keyed by (part_id, revision) and linked into the BOM graph as attributes on the part node. The hybrid retriever can then ask, in a single planned query, “find parts visually similar to this bracket AND with M8 through-holes AND used in any sub-assembly under product line G7” — a query that touches all four CAD modalities plus the BOM graph and would be impossible against a flat text index.

BOM as a Graph: Why Pure Vector Search Misses Parent-Child Context

The BOM is the single most important data structure in PLM and the one most violently flattened by vanilla RAG. A multi-level BOM is a directed acyclic graph: assemblies decompose into sub-assemblies decompose into parts, with quantity, position, effectivity, and alternate-part relationships on every edge. The graph schema with vector chunks attached at part nodes is shown in ./assets/arch_04.png.

Pure vector retrieval cannot answer the simplest BOM questions. “Where is part 47-A2 used?” requires a where-used traversal — walking up the graph from the part to every assembly that consumes it. There is no cosine-similarity operator that performs this walk. “What changed in ECN-3318?” requires finding the ECN node and traversing its changes edges to the affected parts and the documents that triggered the change. “What’s the alternate for part 51-Z1?” requires following the alternate_part edge — a relationship that, in vector space, looks identical to “completely different part with similar geometry”.

The corrected pattern stores BOM in a property graph (Neo4j, Memgraph, or an embedded option like KuzuDB for smaller deployments) with a small, stable schema:

  • Nodes: Assembly, Part, Revision, ECN, Document, Specification, NCR (non-conformance report).
  • Edges: has_child (with quantity, position, effectivity attributes), alternate_part, supersedes, changes, specifies, triggered_by, manufactured_at.
  • Attributes on every node: stable IDs, current revision pointer, classification code, owning organisation.

Vector chunks live as attributes on the part-revision nodes. When you embed the CAD multi-view renders, the embedding’s identity is (part_id=47-A2, rev=D, modality=cad_multiview). When the retriever does a graph walk and lands on node 47-A2 rev D, it gets the attached embeddings for free. When the retriever does a vector search and the top hit is that embedding, it gets the graph node and all its edges for free.

This is the same architectural pattern the broader graph-RAG community has converged on. The graph-augmented retrieval pillar piece goes deeper on the general design; what matters here is that for PLM, the graph already exists and is more reliable than any LLM-extracted graph you could build from documents. The work is not in constructing the graph — it’s in efficiently exposing it to retrieval.

Two pragmatic notes from production. First, do not denormalise the BOM into the vector store as repeated text chunks per part — it bloats the index without adding retrieval signal, since the graph walk is faster and more accurate than vector similarity for where-used queries. Second, version the graph. ECNs change the graph itself (a new revision is a new node and a new edge), and answering “what did this assembly look like on 2026-01-15” requires either a temporal graph or a snapshot-and-replay approach. Most teams start with snapshots and graduate to temporal graphs once it hurts enough.

The graph layer also unlocks a class of queries that pure RAG cannot: structural queries that have no good text representation. “Find all assemblies that consume more than three parts owned by the Pune design centre, where any of those parts has an open ECN” is a graph query. A vector retriever cannot answer it. A graph retriever, hooked into the planner, can — and the LLM then narrates the result.

Hybrid Retrieval: Sparse + Dense + Graph Walks

Retrieval is where the rubber meets the road. The hybrid pattern that has worked in 2026 deployments combines three retrievers run in parallel and fuses their outputs with reciprocal rank fusion (RRF). The flow is shown in ./assets/arch_05.png.

Sparse retrieval (BM25 or its learned-sparse cousin SPLADE) is the unsung hero for engineering data. PLM queries are dense with exact-match tokens — part numbers, ECN IDs, drawing numbers — and sparse retrievers handle these natively without any embedding model knowing what an ECN looks like. Skip sparse and you will pay for it on every query that mentions a specific identifier.

Dense retrieval uses a domain-tuned bi-encoder or, better, ColBERT v2 (Khattab et al.) for late-interaction matching. ColBERT’s per-token embeddings handle the long-tail of engineering vocabulary far better than a sentence-level embedding because the discriminating tokens (material grades, tolerance classes, manufacturing process names) get their own vectors instead of being averaged into a single representation. Where budget allows, a small fine-tuning pass on PLM-specific text — ECN narratives, deviation reports, supplier qualification memos — yields outsized recall gains.

Graph walk is seeded from the entities extracted in query rewriting. Given a query mentioning ECN-3318, the walk seeds at that node and traverses 2-3 hops along typed edges: changes-to-parts, parts-used-in-assemblies, assemblies-specified-by-documents. Each touched node is added to the candidate set with a rank derived from edge distance and edge type weight. Walks should be bounded and typed — unbounded walks devolve into “return the whole graph”, which is the graph-DB equivalent of returning every document.

RRF fusion (Cormack et al.’s 2009 paper, still the workhorse in 2026) combines the three rankings without needing calibrated scores. The standard 1/(k+rank) formula with k=60 works out of the box; tune k only if you have hold-out evaluations showing a need. Fused candidates then pass through a cross-encoder reranker that scores each candidate against the query directly, and the top 8-12 candidates are packed into the LLM context with their provenance metadata.

A subtle point on the planner: the query router should be allowed to skip retrievers. A query that is purely structural (“list all parts in assembly A1 with revision newer than 2025-12-01”) should go directly to the graph and not waste tokens on vector retrieval. A query that is purely conceptual (“explain the difference between austenitic and martensitic stainless steel in the context of our medical fasteners”) should lean dense and skip the graph walk. A DSPy program that learns when to invoke which retriever — trained on a few hundred labelled examples — outperforms a fixed all-three-always pipeline by a meaningful margin and saves compute besides.

Evaluation: How to Measure Engineering RAG Quality

Engineering RAG cannot be evaluated by feel, and it cannot be evaluated by generic benchmarks. The metrics that matter in 2026 deployments are domain-specific and operational.

Recall@k on a held-out query set is the foundation. Build a set of 200-500 real engineering queries with hand-labelled gold-standard answer chunks (the chunks an experienced engineer would point to). Report recall@5 and recall@20 for the fused retriever, and break it down by query type: identifier-lookup, where-used, why-changed, find-similar, structural. The breakdown matters more than the aggregate — a system can have respectable mean recall while being unusable for one important query class.

Exact-attribute match is the metric specific to PLM. For a query like “what is the current revision of part 47-A2”, the answer is either correct or wrong; there is no partial credit. Track exact-match accuracy on a curated set of attribute queries covering revision, material, tolerance, mass, supplier, and effectivity. This is the metric that correlates most directly with engineer trust.

Citation faithfulness measures whether every claim in the generated answer is supported by a chunk in the retrieved context. Compute it automatically with a small judge model (or a deterministic span-matching pass for structured claims) over a sampled stream of production answers. A faithful answer cites real chunks; an unfaithful one cites chunks that don’t actually contain the claimed fact. Track faithfulness as a leading indicator — it usually degrades before users complain.

Hallucination rate, distinct from faithfulness, measures answers that contain claims with no corresponding citation at all. The verifier described earlier should drive this to near-zero; if it’s not, the verifier is too lenient.

Latency budgets matter operationally. A target of P50 < 3 seconds and P95 < 8 seconds for end-to-end query response is achievable with the architecture above; anything slower and engineers will go back to manual PLM search. Profile the graph walks first — they are usually the longest pole — and cap walk depth and breadth aggressively. The closed-loop pattern from field data applies here: instrument production queries, mine the slow ones for tuning opportunities, and feed them back into the retriever-routing program.

Failure Modes and Anti-patterns

A short list of patterns to avoid, drawn from teams that learned the hard way.

Embedding the entire BOM as text. Some teams flatten the BOM into nested text representations and embed each level. The vector index bloats, query latency degrades, and the recall improvement over a proper graph walk is negative. The graph is the right tool; don’t paper over it with vectors.

Single embedding model for all modalities. Using the same encoder for spec prose, ECN narratives, PMI annotations, and CAD drawings is convenient and wrong. The modalities have different signal distributions; one model averages them into mediocrity. Use modality-specific embedders and let the hybrid retriever combine them.

Ignoring revisions. The most common production incident in PLM RAG: the system returns a chunk from rev C of a part when rev E is current. Always filter retrieval by current-revision-as-of-query-time unless the query explicitly asks for history. Make the filter part of the retriever, not the LLM’s responsibility.

No provenance enforcement. Letting the LLM generate answers without a strict citation schema is a one-way ticket to plausible-sounding hallucinations that get cited as fact in change-control meetings. The citation requirement should be enforced at decode time (constrained generation) or post-hoc by the verifier, not left to prompt suggestion.

Treating MBSE/SysML models as documents. SysML v2 models carry structure that should be ingested as graph nodes and relationships, not as text dumps. The SysML v2 and PLM integration patterns cover this in detail — apply the same graph-first thinking as for BOM.

Practical Recommendations

If you are starting a RAG-over-PLM project in mid-2026, the following sequence has the best risk-adjusted return.

Start with the graph, not the embeddings. Stand up a property graph populated from your PLM’s API for one product line. Get the where-used and ECN-traversal queries working end-to-end before you embed anything. Most of the value of the architecture is in the graph.

Add text retrieval next, with sparse before dense. BM25 over your ECN narratives and spec documents will surprise you with how much it handles. Add a dense retriever (ColBERT v2 if you can run it, a strong bi-encoder otherwise) once the sparse baseline is in place.

Bring CAD in last. Multi-view VLM embeddings are powerful but expensive to compute and store. Prove the rest of the stack before you scale CAD embedding to the full part library. Begin with the parts that appear most frequently in user queries.

Instrument from day one. Trace every query through the planner, retrievers, fusion, reranker, and generator. The traces are the data you will use to tune routing, weighting, and prompts. A RAG system without query traces is a black box you cannot improve.

Evaluate against engineers, not benchmarks. A short weekly review with two or three senior engineers, walking through 20 production queries and labelling the answers, will improve the system faster than any public benchmark. The closed-loop continuous-improvement pattern we apply to field data applies just as well to RAG quality. And if you are still selecting the broader stack, our PLM digital transformation tools guide covers the foundation pieces this architecture sits on.

FAQ

Q: Can I use a general-purpose RAG framework like LangChain or LlamaIndex for PLM RAG?
You can use them as glue. The retrievers, graph integration, and CAD ingestion are not in their default abstractions, so you will be writing custom components either way. Treat the framework as orchestration plumbing rather than as a turnkey solution. DSPy is a better fit for the planner/router layer specifically because it lets you optimise the routing program against your own evaluation set.

Q: Do I need a graph database, or can I fake it with relational tables?
For small product lines (a few thousand parts), Postgres with recursive CTEs handles where-used adequately. Past tens of thousands of parts with multi-level BOMs, a graph database starts paying back its operational cost in query latency and developer productivity. The decision is operational, not theoretical.

Q: How do I handle ITAR, EAR, or other export-controlled data in a RAG system?
Apply access control at retrieval time, not at generation time. The retriever must filter candidates by the querying user’s clearance before they reach the LLM. Never let restricted chunks enter the LLM context and rely on the prompt to keep them out — that’s a guaranteed leak. Tag chunks with classification metadata at ingestion, enforce filters in the retriever, and audit retrieval traces against access logs.

Q: What’s the ROI signal that tells me this is working?
The two leading indicators are time-to-answer for engineering change impact analysis (queries like “what does ECN-3318 affect?”) and reduction in PLM search abandonment. Both are measurable from existing PLM telemetry. Adoption among manufacturing engineers — who are the most skeptical user group — is the lagging indicator that matters.

Q: How does this interact with multi-CAD environments?
The architecture is CAD-system-agnostic at the embedding layer because everything goes through neutral formats (STEP, JT) before being parsed. The differences show up in PMI extraction (vendor-specific) and in PLM API integration (very vendor-specific — Teamcenter, Aras, Windchill, and 3DEXPERIENCE each have their own model). Plan for vendor-specific adapters at the ingestion edge and standard schemas inward of that.

Further Reading

References

  • Khattab, O. et al. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. Stanford NLP. https://arxiv.org/abs/2112.01488
  • Khattab, O. et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. Stanford NLP. https://github.com/stanfordnlp/dspy
  • Edge, D. et al. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research, 2024. https://arxiv.org/abs/2404.16130
  • Yu, X. et al. Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling. CVPR 2022. https://arxiv.org/abs/2111.14819
  • Wang, Y. et al. Dynamic Graph CNN for Learning on Point Clouds (DGCNN). ACM TOG 2019.
  • NVIDIA Research. Text-to-CAD generative models for mechanical design (research blog and publications, 2024–2026).
  • Cormack, G. et al. Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. SIGIR 2009.
  • Aras Innovator documentation — BOM and Part Structure modelling. https://www.aras.com/
  • Siemens Teamcenter — BOM management and where-used queries (official docs).
  • Dassault Systèmes 3DEXPERIENCE — Engineering BOM and EBOM-to-MBOM transformation (official docs).

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *