Semantiqv0.5.2
01Home
02Features
03Docs
04Blog
05Changelog
06Support
Get Started
  1. Home
  2. Blog
  3. Vector Embeddings for Code: How AI Really Understands Your Codebase
guides
16 min read

Vector Embeddings for Code: How AI Really Understands Your Codebase

A technical deep-dive into how vector embeddings power semantic code search. Learn how AI transforms code into meaning and why it matters for developers.

Semantiq Team
February 8, 2026|16 min read
Share this article
vector-embeddingssemantic-searchmachine-learningcode-understanding

Vector embeddings transform code into numerical coordinates in high-dimensional space, where semantically similar code sits close together. This enables AI-powered semantic search that understands what your code does, not just what it says. Modern code embedding models like Voyage-3 and local ONNX models power tools like Semantiq to find related functions, detect duplicates, and understand codebases at scale—all by treating code as meaning, not text.

What Are Vector Embeddings?#

Imagine you're trying to explain where a restaurant is located. You could say "near the park, two blocks from Main Street," but coordinates like (40.7589, -73.9851) are more precise. Vector embeddings work the same way for meaning.

Instead of describing code with words, embeddings represent it as a point in high-dimensional space—typically 768, 1024, or even 1536 dimensions. A function that sorts an array might be at coordinates [0.23, -0.45, 0.67, ...], while another sorting function (even in a different language) sits nearby because they share semantic meaning.

The core idea: distance in embedding space correlates with semantic similarity. Functions that do similar things cluster together, regardless of their syntax, variable names, or programming language.

Plain Text
1Text representation:
2"function that sorts an array in ascending order"
3
4Vector representation:
5[0.234, -0.456, 0.678, 0.123, -0.890, 0.345, ...] (1024 dimensions)
6 ↓
7 Numerical coordinates in meaning-space

This transformation from symbols to semantics is what enables AI to "understand" code in a way that keyword search never could.

From Text to Meaning: How Embedding Models Work#

Converting source code to vector embeddings takes several steps, powered by transformer-based neural networks.

Step 1: Tokenization#

First, code is broken into tokens—not just words, but meaningful units including operators, keywords, and special characters. A tokenizer might split getUserById(42) into ['get', 'User', 'By', 'Id', '(', '42', ')'], preserving semantic structure.

Modern code tokenizers understand:

  • CamelCase and snake_case conventions
  • Programming language keywords (async, const, class)
  • Operators and syntax (=>, ::, ?.)
  • Common patterns like function signatures

Step 2: Transformer Encoding#

The tokens pass through a transformer model—the same architecture behind GPT and BERT, but trained specifically on code. Transformers use self-attention mechanisms to understand relationships between tokens:

TypeScript
// The model learns that these tokens are related:
async function fetchUser(id: string): Promise<User> {
// ↑ relates to ↑ ↑ relates to return type

Each transformer layer builds increasingly abstract representations, from syntax to semantics. Early layers capture patterns like "this is a function declaration," while deeper layers understand "this fetches data asynchronously."

Step 3: Pooling to Fixed Dimensions#

Transformer outputs are variable-length (one vector per token), but we need a single fixed-size vector for the entire code snippet. Pooling strategies include:

  • Mean pooling: Average all token vectors
  • CLS token pooling: Use a special classification token
  • Max pooling: Take maximum values across dimensions

The result is a dense vector—a single point in high-dimensional space that represents the code's meaning.

Why Code Needs Specialized Models#

General-purpose text embeddings struggle with code because:

  1. Syntax carries meaning: user.getName() and user?.getName() are semantically different due to one character
  2. Context hierarchies: A variable's meaning depends on its scope (local, class, module)
  3. Cross-language patterns: Promise<T> in TypeScript and Future[T] in Scala represent the same concept
  4. Structure matters: Indentation, brackets, and whitespace are semantic, not decorative

Code embedding models are trained on billions of lines of real code from GitHub, Stack Overflow, and documentation to learn these patterns.

Code Embeddings vs Text Embeddings#

Let's see why specialized code embeddings outperform general text models with concrete examples.

Example 1: Semantically Identical, Textually Different#

These functions are semantically identical but share almost no keywords:

Python
def total_price(items):
return sum(item['price'] for item in items)
JavaScript
const calculateSum = (products) =>
products.reduce((acc, p) => acc + p.cost, 0);

Text embedding similarity: ~0.35 (poor match) Code embedding similarity: ~0.89 (strong match)

Code embeddings recognize both implement "sum of prices from a collection," while text embeddings see different languages, variable names, and keywords.

Example 2: Function Signatures vs Implementations#

TypeScript
1// Declaration
2interface UserRepository {
3 findById(id: string): Promise<User | null>;
4}
5
6// Implementation
7class PostgresUserRepo implements UserRepository {
8 async findById(id: string): Promise<User | null> {
9 const result = await this.db.query(
10 'SELECT * FROM users WHERE id = $1', [id]
11 );
12 return result.rows[0] || null;
13 }
14}

Code embeddings understand that:

  • The interface defines a contract
  • The implementation fulfills that contract
  • Both are related but serve different purposes
  • The SQL query is part of the implementation strategy

Text embeddings would miss these architectural relationships.

Example 3: Cross-Language Relationships#

Go
1// Go
2type Result[T any] struct {
3 Value T
4 Err error
5}
Rust
1// Rust
2enum Result<T, E> {
3 Ok(T),
4 Err(E),
5}

Code embeddings recognize both implement the "Result monad" pattern despite completely different syntax and keywords. This enables finding equivalent patterns across languages.

The Embedding Pipeline#

Here's how a production code embedding system works, step by step:

Step 1: Parse Code Structure#

Use a parser like Tree-sitter to extract syntactic structure:

TypeScript
1import Parser from 'tree-sitter';
2import TypeScript from 'tree-sitter-typescript';
3
4const parser = new Parser();
5parser.setLanguage(TypeScript);
6
7const sourceCode = `
8function calculateDiscount(price: number, rate: number): number {
9 return price * (1 - rate);
10}
11`;
12
13const tree = parser.parse(sourceCode);
14// Tree contains AST with function boundaries, parameters, types

Step 2: Chunk Intelligently#

Split code into meaningful units—not arbitrary character limits, but semantic boundaries:

TypeScript
1// Good chunking (by function)
2chunk1 = "function calculateDiscount(price: number, rate: number): number { ... }"
3chunk2 = "function applyTax(amount: number, taxRate: number): number { ... }"
4
5// Bad chunking (by character count)
6chunk1 = "function calculateDiscount(price: nu"
7chunk2 = "mber, rate: number): number { return"

Respect:

  • Function/method boundaries
  • Class definitions
  • Module boundaries
  • Comment blocks (docstrings)

Step 3: Generate Embeddings#

Run chunks through an embedding model (local ONNX example):

TypeScript
1import { InferenceSession, Tensor } from 'onnxruntime-node';
2
3async function embedCode(code: string): Promise<number[]> {
4 // Load model
5 const session = await InferenceSession.create('code-embedding-model.onnx');
6
7 // Tokenize
8 const tokens = tokenize(code); // [101, 2339, 2129, ...]
9 const inputTensor = new Tensor('int64', tokens, [1, tokens.length]);
10
11 // Run inference
12 const outputs = await session.run({ input_ids: inputTensor });
13 const embedding = outputs.last_hidden_state.data;
14
15 // Mean pooling
16 return meanPool(embedding); // [0.234, -0.456, ...] (768 dims)
17}

Step 4: Store in Vector Database#

Index embeddings for fast similarity search:

SQL
1-- Using SQLite with vector extension (FTS5 + custom similarity)
2CREATE VIRTUAL TABLE code_embeddings USING vec0(
3 chunk_id TEXT PRIMARY KEY,
4 file_path TEXT,
5 code TEXT,
6 embedding FLOAT[768]
7);
8
9INSERT INTO code_embeddings VALUES (
10 'func_123',
11 'src/utils/math.ts',
12 'function calculateDiscount(...)',
13 vec_f32('[0.234, -0.456, ...]')
14);

Step 5: Query with Cosine Similarity#

Search for semantically similar code:

TypeScript
1async function searchSimilarCode(query: string, topK = 10) {
2 // Embed the query
3 const queryEmbedding = await embedCode(query);
4
5 // Find nearest neighbors by cosine similarity
6 const results = await db.query(`
7 SELECT
8 chunk_id,
9 file_path,
10 code,
11 vec_cosine_distance(embedding, ?) as similarity
12 FROM code_embeddings
13 ORDER BY similarity DESC
14 LIMIT ?
15 `, [queryEmbedding, topK]);
16
17 return results.filter(r => r.similarity > 0.7); // Threshold
18}

Cosine Similarity: Finding Related Code#

Cosine similarity measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical). For code embeddings, it's the gold standard metric.

The Math (Simplified)#

Given two embedding vectors A and B:

Plain Text
1cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
2
3Where:
4- A · B = dot product (sum of element-wise multiplication)
5- ||A|| = magnitude of A (square root of sum of squares)

Visual Analogy#

Imagine two arrows in 3D space:

  • Same direction (parallel arrows): similarity = 1.0
  • Perpendicular (90° angle): similarity = 0.0
  • Opposite direction (180° angle): similarity = -1.0

In 768-dimensional space, the same principle applies. Similar code "points" in similar semantic directions.

Practical Thresholds#

From real-world code search systems:

SimilarityInterpretationUse Case
0.95-1.0Near duplicatesDetect copy-paste code
0.85-0.95Highly relatedFind alternative implementations
0.75-0.85RelatedDiscover similar patterns
0.65-0.75Loosely relatedExplore related concepts
< 0.65UnrelatedFilter out noise

Why It Beats Keyword Matching#

TypeScript
1// Query: "validate email address"
2
3// Keyword match: LOW (different words)
4function checkEmailFormat(str: string): boolean {
5 return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(str);
6}
7
8// Cosine similarity: 0.87 (HIGH)
9// Model understands: regex validation + email pattern = email validation

Embeddings capture intent and meaning, not just token overlap.

Leading Code Embedding Models in 2026#

Code embedding models have improved a lot in the past two years. Here's a comparison of leading options:

ModelDimensionsContext LengthDeploymentBest For
Voyage-3-large2048 (flexible: 256-2048)32K tokensAPI (Cloud)Highest accuracy, large contexts
OpenAI text-embedding-3-large3072 (flexible: 256-3072)8K tokensAPI (Cloud)General code + docs, high quality
EmbeddingGemma (308M)768 (flexible: 128-768)2K tokensLocal ONNXPrivacy, on-device, fast inference
Nomic Embed v1.5768 (flexible: 64-768)8K tokensLocal ONNXOpen source, reproducible
StarEncoder7681K tokensLocalCode-native, 86 languages
CodeBERT768512 tokensLocalLegacy, smaller contexts

Choosing the Right Model#

For maximum accuracy (cloud OK):

  • Voyage-3-large or OpenAI text-embedding-3-large
  • Best semantic understanding, supports long files

For privacy and local deployment:

  • EmbeddingGemma or Nomic Embed
  • Run entirely offline with ONNX Runtime
  • Semantiq uses this approach

For specialized code tasks:

  • StarCoder embeddings for multi-language codebases
  • Fine-tune on your domain (security, web, systems)

Performance considerations:

  • Embedding 1000 functions:
    • Cloud API: ~2-5 seconds (parallelized)
    • Local ONNX (CPU): ~15-30 seconds
    • Local ONNX (GPU): ~3-8 seconds

How Semantiq Uses Embeddings#

Semantiq uses vector embeddings as part of a hybrid search strategy, combining semantic understanding with traditional text search for optimal results.

ONNX-Powered Local Inference#

Semantiq runs embedding models entirely on your machine using ONNX Runtime:

TypeScript
1// Simplified architecture
2class SemanticIndexer {
3 private model: InferenceSession;
4 private vectorDB: VectorStore;
5
6 async initialize() {
7 // Load optimized ONNX model (~300MB)
8 this.model = await InferenceSession.create(
9 'models/embedding-gemma-308m.onnx',
10 { executionProviders: ['cuda', 'cpu'] } // GPU if available
11 );
12 }
13
14 async indexRepository(repoPath: string) {
15 // Parse with Tree-sitter
16 const chunks = await parseCodeStructure(repoPath);
17
18 // Batch embed (efficient)
19 const embeddings = await this.batchEmbed(chunks);
20
21 // Store locally (SQLite + FTS5)
22 await this.vectorDB.insert(chunks, embeddings);
23 }
24}

Benefits:

  • No data leaves your machine
  • Works offline
  • No API costs
  • Fast local inference (~5ms per chunk on modern CPUs)

Hybrid Search Strategy#

Semantiq doesn't rely solely on embeddings. It combines:

  1. Vector search (semantic): Find conceptually similar code
  2. Ripgrep (exact): Match precise patterns and symbols
  3. FTS5 (full-text): Index identifiers and comments
TypeScript
1async function hybridSearch(query: string) {
2 const [semanticResults, exactResults, textResults] = await Promise.all([
3 vectorSearch(query), // Embedding similarity
4 ripgrepSearch(query), // Regex + literal matches
5 fts5Search(query) // Token-based text search
6 ]);
7
8 // Merge and rank by combined score
9 return mergeResults(semanticResults, exactResults, textResults);
10}

This hybrid approach achieves:

  • Recall: Embeddings find semantically related code you'd miss with keywords
  • Precision: Exact search eliminates false positives
  • Speed: FTS5 provides instant identifier lookup

Adaptive ML Thresholds#

Semantiq learns optimal similarity thresholds based on your codebase:

TypeScript
1// Analyze codebase structure
2const stats = analyzeCodebase(repo);
3
4// Adjust thresholds based on:
5// - Code duplication level (high dup → raise threshold)
6// - Language diversity (multi-lang → lower threshold)
7// - Project size (large → stricter filtering)
8
9const threshold = baseThreshold * (1 + stats.duplicationFactor * 0.3)
10 * (1 - stats.languageDiversity * 0.2);

Smaller, focused codebases use lower thresholds (0.65) to surface more results. Large monorepos use higher thresholds (0.80) to filter noise.

Practical Applications#

Vector embeddings enable several code intelligence features:

1. Semantic Code Search#

Find functions by describing what they do:

Plain Text
1Query: "parse JWT token and extract user claims"
2
3Results (even if no keywords match):
4- decodeAuthToken(token: string): UserClaims
5- extractJWTPayload(jwt: string): Claims
6- parseBearer(authHeader: string): TokenData

2. Duplicate Detection#

Find copy-pasted or reimplemented code:

TypeScript
1// Original
2function calculateShipping(weight: number, distance: number) {
3 const baseRate = 5.0;
4 return baseRate + (weight * 0.5) + (distance * 0.1);
5}
6
7// Detected duplicate (0.94 similarity)
8function getDeliveryCost(kg: number, km: number) {
9 const base = 5.0;
10 return base + kg * 0.5 + km * 0.1;
11}

3. Refactoring Suggestions#

Identify candidates for extraction:

Plain Text
1High similarity cluster detected:
2- processUserData() [similarity: 0.92]
3- handleCustomerInfo() [similarity: 0.91]
4- transformAccountDetails() [similarity: 0.90]
5
6Suggestion: Extract common pattern into shared utility

4. Vulnerability Detection#

Find similar patterns to known vulnerabilities:

TypeScript
1// Known SQL injection pattern (in training data)
2const query = "SELECT * FROM users WHERE id = " + userId;
3
4// Detected similar pattern (0.88 similarity)
5const sql = `DELETE FROM sessions WHERE user = ${userInput}`;
6// ↑ Flagged as risky

5. Codebase Clustering#

Visualize code organization and identify architectural boundaries:

Plain Text
1Cluster 1 (Authentication): 45 functions, avg similarity 0.82
2Cluster 2 (Database Access): 67 functions, avg similarity 0.79
3Cluster 3 (API Handlers): 89 functions, avg similarity 0.75
4
5Outliers: 12 functions with low cluster similarity
6→ Potential candidates for refactoring or better organization

Building Your Own: A Minimal Example#

Here's a simplified educational example in TypeScript showing how to build a basic code embedding search system:

TypeScript
1// minimal-code-search.ts
2import { InferenceSession, Tensor } from 'onnxruntime-node';
3import Database from 'better-sqlite3';
4
5// Simple tokenizer (real systems use proper tokenizers like tiktoken)
6function tokenize(code: string): number[] {
7 // Simplified: split on whitespace and map to token IDs
8 const words = code.toLowerCase().split(/\s+/);
9 return words.map(w => w.charCodeAt(0) % 1000); // Dummy mapping
10}
11
12// Load embedding model
13async function createEmbedder(modelPath: string) {
14 const session = await InferenceSession.create(modelPath);
15
16 return async (code: string): Promise<number[]> => {
17 const tokens = tokenize(code);
18 const inputTensor = new Tensor('int64',
19 new BigInt64Array(tokens.map(t => BigInt(t))),
20 [1, tokens.length]
21 );
22
23 const outputs = await session.run({ input_ids: inputTensor });
24 const embedding = Array.from(outputs.last_hidden_state.data);
25
26 // Mean pooling
27 const dims = 768;
28 const pooled = new Array(dims).fill(0);
29 for (let i = 0; i < embedding.length; i++) {
30 pooled[i % dims] += embedding[i];
31 }
32 return pooled.map(x => x / (embedding.length / dims));
33 };
34}
35
36// Vector database (SQLite)
37class VectorDB {
38 private db: Database.Database;
39
40 constructor(dbPath: string) {
41 this.db = new Database(dbPath);
42 this.db.exec(`
43 CREATE TABLE IF NOT EXISTS embeddings (
44 id INTEGER PRIMARY KEY,
45 code TEXT,
46 embedding BLOB
47 )
48 `);
49 }
50
51 insert(code: string, embedding: number[]) {
52 const blob = Buffer.from(new Float32Array(embedding).buffer);
53 this.db.prepare('INSERT INTO embeddings (code, embedding) VALUES (?, ?)')
54 .run(code, blob);
55 }
56
57 search(queryEmbedding: number[], topK = 5) {
58 const rows = this.db.prepare('SELECT id, code, embedding FROM embeddings').all();
59
60 const results = rows.map(row => ({
61 code: row.code,
62 similarity: cosineSimilarity(
63 queryEmbedding,
64 Array.from(new Float32Array(row.embedding.buffer))
65 )
66 }));
67
68 return results
69 .sort((a, b) => b.similarity - a.similarity)
70 .slice(0, topK);
71 }
72}
73
74// Cosine similarity
75function cosineSimilarity(a: number[], b: number[]): number {
76 let dotProduct = 0, magA = 0, magB = 0;
77 for (let i = 0; i < a.length; i++) {
78 dotProduct += a[i] * b[i];
79 magA += a[i] * a[i];
80 magB += b[i] * b[i];
81 }
82 return dotProduct / (Math.sqrt(magA) * Math.sqrt(magB));
83}
84
85// Usage
86async function main() {
87 const embed = await createEmbedder('model.onnx');
88 const db = new VectorDB('code.db');
89
90 // Index some code
91 const snippets = [
92 'function sort(arr) { return arr.sort(); }',
93 'const sum = (nums) => nums.reduce((a,b) => a+b, 0);',
94 'function multiply(a, b) { return a * b; }'
95 ];
96
97 for (const code of snippets) {
98 const embedding = await embed(code);
99 db.insert(code, embedding);
100 }
101
102 // Search
103 const query = 'add numbers together';
104 const queryEmbedding = await embed(query);
105 const results = db.search(queryEmbedding);
106
107 console.log('Results for:', query);
108 results.forEach(r =>
109 console.log(`${r.similarity.toFixed(3)}: ${r.code}`)
110 );
111}
112
113main();

This example demonstrates the core concepts. Production systems add:

  • Proper tokenization (WordPiece, BPE)
  • Batch processing for efficiency
  • Incremental index updates
  • Advanced vector stores (FAISS, Milvus)
  • Query optimization

The Future of Code Embeddings#

Here's what's coming next:

Multi-Modal Code Understanding#

Future models will embed not just code, but:

  • Architecture diagrams: Relate UML to implementations
  • Documentation: Link prose explanations to code
  • Runtime traces: Connect behavior to source
  • Commit messages: Understand intent and evolution
Plain Text
1Query: "authentication flow with OAuth"
2
3Results:
4- Code: OAuthHandler.authenticate()
5- Diagram: auth-flow.png (sequence diagram)
6- Docs: "OAuth Integration Guide" (page 12)
7- Test: test_oauth_flow.ts

Cross-Repository Understanding#

Imagine embeddings that span:

  • Your codebase + dependencies
  • Public libraries on npm/PyPI
  • Stack Overflow solutions
  • GitHub code examples

Search for "rate limiting middleware" and find:

  1. Your existing implementation
  2. Express.js rate-limiter package
  3. Similar patterns in other repos
  4. Related Stack Overflow answers

All semantically ranked and compared.

Real-Time Incremental Updates#

Current systems reindex entire codebases. Future systems will:

  • Embed files as you edit (under 50ms latency)
  • Update only changed functions
  • Maintain consistency across refactors
  • Propagate changes through dependency graphs

Specialized Domain Models#

We'll see embedding models fine-tuned for:

  • Security: Recognize vulnerability patterns
  • Performance: Identify optimization opportunities
  • Testing: Suggest test cases based on code coverage
  • Migration: Map deprecated APIs to modern equivalents

Smaller, Faster Models#

The trend toward edge computing will drive:

  • Sub-100MB models running on laptops
  • Hardware acceleration (NPU, GPU)
  • Quantization to 4-bit and 8-bit precision
  • Embeddings generated in under 1ms per function

Conclusion#

Vector embeddings changed how AI understands code. By converting syntax into semantics, they enable search systems that grasp intent, find patterns across languages, and surface insights impossible with keyword matching.

Whether you're building code search tools, analyzing security vulnerabilities, or exploring a new codebase, embeddings provide a foundation for intelligent code understanding. Tools like Semantiq bring this technology to your local machine—no cloud required, no data shared, just fast, semantic code search.


Try Semantiq to experience semantic code search powered by local vector embeddings. Learn more →

← Back to Blog

Related Posts

guidesFeatured

What Is Semantic Code Search? A Developer's Guide

Learn how semantic code search uses AI and embeddings to understand code meaning, not just text patterns. A practical guide for developers.

Feb 10, 202610 min read
guidesFeatured

Agentic AI Coding: How Autonomous Agents Are Changing Software Development

From code completion to autonomous agents: how agentic AI is changing software development in 2026, with real case studies and practical insights.

Feb 12, 202620 min read
guides

The AI Code Quality Crisis: Why Defective Code Is Rising in 2026

Data reveals AI-generated code creates 1.7x more issues than human code. Explore the quality crisis, its causes, and how semantic code analysis helps.

Feb 11, 202615 min read
Semantiq

One MCP Server for every AI coding tool. Powered by Rust and Tree-sitter.

GitHub

Product

  • Features
  • Documentation
  • Changelog

Resources

  • Quick Start
  • CLI Reference
  • MCP Integration
  • Blog

Connect

  • Support
  • GitHub
// 19 languages supported
Rust
TypeScript
JavaScript
Python
Go
Java
C
C++
PHP
Ruby
C#
Kotlin
Scala
Bash
Elixir
HTML
JSON
YAML
TOML
© 2026 Semantiq.|v0.5.2|connected
MIT·Built with Rust & Tree-sitter