Vector embeddings transform code into numerical coordinates in high-dimensional space, where semantically similar code sits close together. This enables AI-powered semantic search that understands what your code does, not just what it says. Modern code embedding models like Voyage-3 and local ONNX models power tools like Semantiq to find related functions, detect duplicates, and understand codebases at scale—all by treating code as meaning, not text.

What Are Vector Embeddings?

Imagine you're trying to explain where a restaurant is located. You could say "near the park, two blocks from Main Street," but coordinates like (40.7589, -73.9851) are more precise. Vector embeddings work the same way for meaning.

Instead of describing code with words, embeddings represent it as a point in high-dimensional space—typically 768, 1024, or even 1536 dimensions. A function that sorts an array might be at coordinates [0.23, -0.45, 0.67, ...], while another sorting function (even in a different language) sits nearby because they share semantic meaning.

The core idea: distance in embedding space correlates with semantic similarity. Functions that do similar things cluster together, regardless of their syntax, variable names, or programming language.

Plain Text

1Text representation:
2"function that sorts an array in ascending order"
3
4Vector representation:
5[0.234, -0.456, 0.678, 0.123, -0.890, 0.345, ...] (1024 dimensions)
6                    ↓
7         Numerical coordinates in meaning-space

This transformation from symbols to semantics is what enables AI to "understand" code in a way that keyword search never could.

From Text to Meaning: How Embedding Models Work

Converting source code to vector embeddings takes several steps, powered by transformer-based neural networks.

Step 1: Tokenization

First, code is broken into tokens—not just words, but meaningful units including operators, keywords, and special characters. A tokenizer might split getUserById(42) into ['get', 'User', 'By', 'Id', '(', '42', ')'], preserving semantic structure.

Modern code tokenizers understand:

CamelCase and snake_case conventions
Programming language keywords (async, const, class)
Operators and syntax (=>, ::, ?.)
Common patterns like function signatures

Step 2: Transformer Encoding

The tokens pass through a transformer model—the same architecture behind GPT and BERT, but trained specifically on code. Transformers use self-attention mechanisms to understand relationships between tokens:

TypeScript

// The model learns that these tokens are related:
async function fetchUser(id: string): Promise<User> {
//     ↑ relates to ↑       ↑ relates to return type

Each transformer layer builds increasingly abstract representations, from syntax to semantics. Early layers capture patterns like "this is a function declaration," while deeper layers understand "this fetches data asynchronously."

Step 3: Pooling to Fixed Dimensions

Transformer outputs are variable-length (one vector per token), but we need a single fixed-size vector for the entire code snippet. Pooling strategies include:

Mean pooling: Average all token vectors
CLS token pooling: Use a special classification token
Max pooling: Take maximum values across dimensions

The result is a dense vector—a single point in high-dimensional space that represents the code's meaning.

Why Code Needs Specialized Models

General-purpose text embeddings struggle with code because:

Syntax carries meaning: user.getName() and user?.getName() are semantically different due to one character
Context hierarchies: A variable's meaning depends on its scope (local, class, module)
Cross-language patterns: Promise<T> in TypeScript and Future[T] in Scala represent the same concept
Structure matters: Indentation, brackets, and whitespace are semantic, not decorative

Code embedding models are trained on billions of lines of real code from GitHub, Stack Overflow, and documentation to learn these patterns.

Code Embeddings vs Text Embeddings

Let's see why specialized code embeddings outperform general text models with concrete examples.

Example 1: Semantically Identical, Textually Different

These functions are semantically identical but share almost no keywords:

Python

def total_price(items):
    return sum(item['price'] for item in items)

JavaScript

const calculateSum = (products) =>
    products.reduce((acc, p) => acc + p.cost, 0);

Text embedding similarity: ~0.35 (poor match) Code embedding similarity: ~0.89 (strong match)

Code embeddings recognize both implement "sum of prices from a collection," while text embeddings see different languages, variable names, and keywords.

Example 2: Function Signatures vs Implementations

TypeScript

1// Declaration
2interface UserRepository {
3    findById(id: string): Promise<User | null>;
4}
5
6// Implementation
7class PostgresUserRepo implements UserRepository {
8    async findById(id: string): Promise<User | null> {
9        const result = await this.db.query(
10            'SELECT * FROM users WHERE id = $1', [id]
11        );
12        return result.rows[0] || null;
13    }
14}

Code embeddings understand that:

The interface defines a contract
The implementation fulfills that contract
Both are related but serve different purposes
The SQL query is part of the implementation strategy

Text embeddings would miss these architectural relationships.

Example 3: Cross-Language Relationships

1// Go
2type Result[T any] struct {
3    Value T
4    Err   error
5}

Rust

1// Rust
2enum Result<T, E> {
3    Ok(T),
4    Err(E),
5}

Code embeddings recognize both implement the "Result monad" pattern despite completely different syntax and keywords. This enables finding equivalent patterns across languages.

The Embedding Pipeline

Here's how a production code embedding system works, step by step:

Step 1: Parse Code Structure

Use a parser like Tree-sitter to extract syntactic structure:

TypeScript

1import Parser from 'tree-sitter';
2import TypeScript from 'tree-sitter-typescript';
3
4const parser = new Parser();
5parser.setLanguage(TypeScript);
6
7const sourceCode = `
8function calculateDiscount(price: number, rate: number): number {
9    return price * (1 - rate);
10}
11`;
12
13const tree = parser.parse(sourceCode);
14// Tree contains AST with function boundaries, parameters, types

Step 2: Chunk Intelligently

Split code into meaningful units—not arbitrary character limits, but semantic boundaries:

TypeScript

1// Good chunking (by function)
2chunk1 = "function calculateDiscount(price: number, rate: number): number { ... }"
3chunk2 = "function applyTax(amount: number, taxRate: number): number { ... }"
4
5// Bad chunking (by character count)
6chunk1 = "function calculateDiscount(price: nu"
7chunk2 = "mber, rate: number): number { return"

Respect:

Function/method boundaries
Class definitions
Module boundaries
Comment blocks (docstrings)

Step 3: Generate Embeddings

Run chunks through an embedding model (local ONNX example):

TypeScript

1import { InferenceSession, Tensor } from 'onnxruntime-node';
2
3async function embedCode(code: string): Promise<number[]> {
4    // Load model
5    const session = await InferenceSession.create('code-embedding-model.onnx');
6
7    // Tokenize
8    const tokens = tokenize(code); // [101, 2339, 2129, ...]
9    const inputTensor = new Tensor('int64', tokens, [1, tokens.length]);
10
11    // Run inference
12    const outputs = await session.run({ input_ids: inputTensor });
13    const embedding = outputs.last_hidden_state.data;
14
15    // Mean pooling
16    return meanPool(embedding); // [0.234, -0.456, ...] (768 dims)
17}

Step 4: Store in Vector Database

Index embeddings for fast similarity search:

SQL

1-- Using SQLite with vector extension (FTS5 + custom similarity)
2CREATE VIRTUAL TABLE code_embeddings USING vec0(
3    chunk_id TEXT PRIMARY KEY,
4    file_path TEXT,
5    code TEXT,
6    embedding FLOAT[768]
7);
8
9INSERT INTO code_embeddings VALUES (
10    'func_123',
11    'src/utils/math.ts',
12    'function calculateDiscount(...)',
13    vec_f32('[0.234, -0.456, ...]')
14);

Step 5: Query with Cosine Similarity

Search for semantically similar code:

TypeScript

1async function searchSimilarCode(query: string, topK = 10) {
2    // Embed the query
3    const queryEmbedding = await embedCode(query);
4
5    // Find nearest neighbors by cosine similarity
6    const results = await db.query(`
7        SELECT
8            chunk_id,
9            file_path,
10            code,
11            vec_cosine_distance(embedding, ?) as similarity
12        FROM code_embeddings
13        ORDER BY similarity DESC
14        LIMIT ?
15    `, [queryEmbedding, topK]);
16
17    return results.filter(r => r.similarity > 0.7); // Threshold
18}

Cosine similarity measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical). For code embeddings, it's the gold standard metric.

The Math (Simplified)

Given two embedding vectors A and B:

Plain Text

1cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
2
3Where:
4- A · B = dot product (sum of element-wise multiplication)
5- ||A|| = magnitude of A (square root of sum of squares)

Visual Analogy

Imagine two arrows in 3D space:

Same direction (parallel arrows): similarity = 1.0
Perpendicular (90° angle): similarity = 0.0
Opposite direction (180° angle): similarity = -1.0

In 768-dimensional space, the same principle applies. Similar code "points" in similar semantic directions.

Practical Thresholds

From real-world code search systems:

Similarity	Interpretation	Use Case
0.95-1.0	Near duplicates	Detect copy-paste code
0.85-0.95	Highly related	Find alternative implementations
0.75-0.85	Related	Discover similar patterns
0.65-0.75	Loosely related	Explore related concepts
< 0.65	Unrelated	Filter out noise

Why It Beats Keyword Matching

TypeScript

1// Query: "validate email address"
2
3// Keyword match: LOW (different words)
4function checkEmailFormat(str: string): boolean {
5    return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(str);
6}
7
8// Cosine similarity: 0.87 (HIGH)
9// Model understands: regex validation + email pattern = email validation

Embeddings capture intent and meaning, not just token overlap.

Leading Code Embedding Models in 2026

Code embedding models have improved a lot in the past two years. Here's a comparison of leading options:

Model	Dimensions	Context Length	Deployment	Best For
Voyage-3-large	2048 (flexible: 256-2048)	32K tokens	API (Cloud)	Highest accuracy, large contexts
OpenAI text-embedding-3-large	3072 (flexible: 256-3072)	8K tokens	API (Cloud)	General code + docs, high quality
EmbeddingGemma (308M)	768 (flexible: 128-768)	2K tokens	Local ONNX	Privacy, on-device, fast inference
Nomic Embed v1.5	768 (flexible: 64-768)	8K tokens	Local ONNX	Open source, reproducible
StarEncoder	768	1K tokens	Local	Code-native, 86 languages
CodeBERT	768	512 tokens	Local	Legacy, smaller contexts

Choosing the Right Model

For maximum accuracy (cloud OK):

Voyage-3-large or OpenAI text-embedding-3-large
Best semantic understanding, supports long files

For privacy and local deployment:

EmbeddingGemma or Nomic Embed
Run entirely offline with ONNX Runtime
Semantiq uses this approach

For specialized code tasks:

StarCoder embeddings for multi-language codebases
Fine-tune on your domain (security, web, systems)

Performance considerations:

Embedding 1000 functions:
- Cloud API: ~2-5 seconds (parallelized)
- Local ONNX (CPU): ~15-30 seconds
- Local ONNX (GPU): ~3-8 seconds

How Semantiq Uses Embeddings

Semantiq uses vector embeddings as part of a hybrid search strategy, combining semantic understanding with traditional text search for optimal results.

ONNX-Powered Local Inference

Semantiq runs embedding models entirely on your machine using ONNX Runtime:

TypeScript

1// Simplified architecture
2class SemanticIndexer {
3    private model: InferenceSession;
4    private vectorDB: VectorStore;
5
6    async initialize() {
7        // Load optimized ONNX model (~300MB)
8        this.model = await InferenceSession.create(
9            'models/embedding-gemma-308m.onnx',
10            { executionProviders: ['cuda', 'cpu'] } // GPU if available
11        );
12    }
13
14    async indexRepository(repoPath: string) {
15        // Parse with Tree-sitter
16        const chunks = await parseCodeStructure(repoPath);
17
18        // Batch embed (efficient)
19        const embeddings = await this.batchEmbed(chunks);
20
21        // Store locally (SQLite + FTS5)
22        await this.vectorDB.insert(chunks, embeddings);
23    }
24}

Benefits:

No data leaves your machine
Works offline
No API costs
Fast local inference (~5ms per chunk on modern CPUs)

Hybrid Search Strategy

Semantiq doesn't rely solely on embeddings. It combines:

Vector search (semantic): Find conceptually similar code
Ripgrep (exact): Match precise patterns and symbols
FTS5 (full-text): Index identifiers and comments

TypeScript

1async function hybridSearch(query: string) {
2    const [semanticResults, exactResults, textResults] = await Promise.all([
3        vectorSearch(query),      // Embedding similarity
4        ripgrepSearch(query),     // Regex + literal matches
5        fts5Search(query)         // Token-based text search
6    ]);
7
8    // Merge and rank by combined score
9    return mergeResults(semanticResults, exactResults, textResults);
10}

This hybrid approach achieves:

Recall: Embeddings find semantically related code you'd miss with keywords
Precision: Exact search eliminates false positives
Speed: FTS5 provides instant identifier lookup

Adaptive ML Thresholds

Semantiq learns optimal similarity thresholds based on your codebase:

TypeScript

1// Analyze codebase structure
2const stats = analyzeCodebase(repo);
3
4// Adjust thresholds based on:
5// - Code duplication level (high dup → raise threshold)
6// - Language diversity (multi-lang → lower threshold)
7// - Project size (large → stricter filtering)
8
9const threshold = baseThreshold * (1 + stats.duplicationFactor * 0.3)
10                                 * (1 - stats.languageDiversity * 0.2);

Smaller, focused codebases use lower thresholds (0.65) to surface more results. Large monorepos use higher thresholds (0.80) to filter noise.

Practical Applications

Vector embeddings enable several code intelligence features:

1. Semantic Code Search

Find functions by describing what they do:

Plain Text

1Query: "parse JWT token and extract user claims"
2
3Results (even if no keywords match):
4- decodeAuthToken(token: string): UserClaims
5- extractJWTPayload(jwt: string): Claims
6- parseBearer(authHeader: string): TokenData

2. Duplicate Detection

Find copy-pasted or reimplemented code:

TypeScript

1// Original
2function calculateShipping(weight: number, distance: number) {
3    const baseRate = 5.0;
4    return baseRate + (weight * 0.5) + (distance * 0.1);
5}
6
7// Detected duplicate (0.94 similarity)
8function getDeliveryCost(kg: number, km: number) {
9    const base = 5.0;
10    return base + kg * 0.5 + km * 0.1;
11}

3. Refactoring Suggestions

Identify candidates for extraction:

Plain Text

1High similarity cluster detected:
2- processUserData() [similarity: 0.92]
3- handleCustomerInfo() [similarity: 0.91]
4- transformAccountDetails() [similarity: 0.90]
5
6Suggestion: Extract common pattern into shared utility

4. Vulnerability Detection

Find similar patterns to known vulnerabilities:

TypeScript

1// Known SQL injection pattern (in training data)
2const query = "SELECT * FROM users WHERE id = " + userId;
3
4// Detected similar pattern (0.88 similarity)
5const sql = `DELETE FROM sessions WHERE user = ${userInput}`;
6//                                              ↑ Flagged as risky

5. Codebase Clustering

Visualize code organization and identify architectural boundaries:

Plain Text

1Cluster 1 (Authentication): 45 functions, avg similarity 0.82
2Cluster 2 (Database Access): 67 functions, avg similarity 0.79
3Cluster 3 (API Handlers): 89 functions, avg similarity 0.75
4
5Outliers: 12 functions with low cluster similarity
6→ Potential candidates for refactoring or better organization

Building Your Own: A Minimal Example

Here's a simplified educational example in TypeScript showing how to build a basic code embedding search system:

TypeScript

1// minimal-code-search.ts
2import { InferenceSession, Tensor } from 'onnxruntime-node';
3import Database from 'better-sqlite3';
4
5// Simple tokenizer (real systems use proper tokenizers like tiktoken)
6function tokenize(code: string): number[] {
7    // Simplified: split on whitespace and map to token IDs
8    const words = code.toLowerCase().split(/\s+/);
9    return words.map(w => w.charCodeAt(0) % 1000); // Dummy mapping
10}
11
12// Load embedding model
13async function createEmbedder(modelPath: string) {
14    const session = await InferenceSession.create(modelPath);
15
16    return async (code: string): Promise<number[]> => {
17        const tokens = tokenize(code);
18        const inputTensor = new Tensor('int64',
19            new BigInt64Array(tokens.map(t => BigInt(t))),
20            [1, tokens.length]
21        );
22
23        const outputs = await session.run({ input_ids: inputTensor });
24        const embedding = Array.from(outputs.last_hidden_state.data);
25
26        // Mean pooling
27        const dims = 768;
28        const pooled = new Array(dims).fill(0);
29        for (let i = 0; i < embedding.length; i++) {
30            pooled[i % dims] += embedding[i];
31        }
32        return pooled.map(x => x / (embedding.length / dims));
33    };
34}
35
36// Vector database (SQLite)
37class VectorDB {
38    private db: Database.Database;
39
40    constructor(dbPath: string) {
41        this.db = new Database(dbPath);
42        this.db.exec(`
43            CREATE TABLE IF NOT EXISTS embeddings (
44                id INTEGER PRIMARY KEY,
45                code TEXT,
46                embedding BLOB
47            )
48        `);
49    }
50
51    insert(code: string, embedding: number[]) {
52        const blob = Buffer.from(new Float32Array(embedding).buffer);
53        this.db.prepare('INSERT INTO embeddings (code, embedding) VALUES (?, ?)')
54            .run(code, blob);
55    }
56
57    search(queryEmbedding: number[], topK = 5) {
58        const rows = this.db.prepare('SELECT id, code, embedding FROM embeddings').all();
59
60        const results = rows.map(row => ({
61            code: row.code,
62            similarity: cosineSimilarity(
63                queryEmbedding,
64                Array.from(new Float32Array(row.embedding.buffer))
65            )
66        }));
67
68        return results
69            .sort((a, b) => b.similarity - a.similarity)
70            .slice(0, topK);
71    }
72}
73
74// Cosine similarity
75function cosineSimilarity(a: number[], b: number[]): number {
76    let dotProduct = 0, magA = 0, magB = 0;
77    for (let i = 0; i < a.length; i++) {
78        dotProduct += a[i] * b[i];
79        magA += a[i] * a[i];
80        magB += b[i] * b[i];
81    }
82    return dotProduct / (Math.sqrt(magA) * Math.sqrt(magB));
83}
84
85// Usage
86async function main() {
87    const embed = await createEmbedder('model.onnx');
88    const db = new VectorDB('code.db');
89
90    // Index some code
91    const snippets = [
92        'function sort(arr) { return arr.sort(); }',
93        'const sum = (nums) => nums.reduce((a,b) => a+b, 0);',
94        'function multiply(a, b) { return a * b; }'
95    ];
96
97    for (const code of snippets) {
98        const embedding = await embed(code);
99        db.insert(code, embedding);
100    }
101
102    // Search
103    const query = 'add numbers together';
104    const queryEmbedding = await embed(query);
105    const results = db.search(queryEmbedding);
106
107    console.log('Results for:', query);
108    results.forEach(r =>
109        console.log(`${r.similarity.toFixed(3)}: ${r.code}`)
110    );
111}
112
113main();

This example demonstrates the core concepts. Production systems add:

Proper tokenization (WordPiece, BPE)
Batch processing for efficiency
Incremental index updates
Advanced vector stores (FAISS, Milvus)
Query optimization

The Future of Code Embeddings

Here's what's coming next:

Future models will embed not just code, but:

Architecture diagrams: Relate UML to implementations
Documentation: Link prose explanations to code
Runtime traces: Connect behavior to source
Commit messages: Understand intent and evolution

Plain Text

1Query: "authentication flow with OAuth"
2
3Results:
4- Code: OAuthHandler.authenticate()
5- Diagram: auth-flow.png (sequence diagram)
6- Docs: "OAuth Integration Guide" (page 12)
7- Test: test_oauth_flow.ts

Cross-Repository Understanding

Imagine embeddings that span:

Your codebase + dependencies
Public libraries on npm/PyPI
Stack Overflow solutions
GitHub code examples

Search for "rate limiting middleware" and find:

Your existing implementation
Express.js rate-limiter package
Similar patterns in other repos
Related Stack Overflow answers

All semantically ranked and compared.

Real-Time Incremental Updates

Current systems reindex entire codebases. Future systems will:

Embed files as you edit (under 50ms latency)
Update only changed functions
Maintain consistency across refactors
Propagate changes through dependency graphs

Specialized Domain Models

We'll see embedding models fine-tuned for:

Security: Recognize vulnerability patterns
Performance: Identify optimization opportunities
Testing: Suggest test cases based on code coverage
Migration: Map deprecated APIs to modern equivalents

Smaller, Faster Models

The trend toward edge computing will drive:

Sub-100MB models running on laptops
Hardware acceleration (NPU, GPU)
Quantization to 4-bit and 8-bit precision
Embeddings generated in under 1ms per function

Conclusion

Vector embeddings changed how AI understands code. By converting syntax into semantics, they enable search systems that grasp intent, find patterns across languages, and surface insights impossible with keyword matching.

Whether you're building code search tools, analyzing security vulnerabilities, or exploring a new codebase, embeddings provide a foundation for intelligent code understanding. Tools like Semantiq bring this technology to your local machine—no cloud required, no data shared, just fast, semantic code search.

Try Semantiq to experience semantic code search powered by local vector embeddings. Learn more →

What Are Vector Embeddings?

Plain Text

1Text representation:
2"function that sorts an array in ascending order"
3
4Vector representation:
5[0.234, -0.456, 0.678, 0.123, -0.890, 0.345, ...] (1024 dimensions)
6                    ↓
7         Numerical coordinates in meaning-space

This transformation from symbols to semantics is what enables AI to "understand" code in a way that keyword search never could.

From Text to Meaning: How Embedding Models Work

Converting source code to vector embeddings takes several steps, powered by transformer-based neural networks.

Step 1: Tokenization

Modern code tokenizers understand:

CamelCase and snake_case conventions
Programming language keywords (async, const, class)
Operators and syntax (=>, ::, ?.)
Common patterns like function signatures

Step 2: Transformer Encoding

TypeScript

// The model learns that these tokens are related:
async function fetchUser(id: string): Promise<User> {
//     ↑ relates to ↑       ↑ relates to return type

Step 3: Pooling to Fixed Dimensions

Transformer outputs are variable-length (one vector per token), but we need a single fixed-size vector for the entire code snippet. Pooling strategies include:

Mean pooling: Average all token vectors
CLS token pooling: Use a special classification token
Max pooling: Take maximum values across dimensions

The result is a dense vector—a single point in high-dimensional space that represents the code's meaning.

Why Code Needs Specialized Models

General-purpose text embeddings struggle with code because:

Syntax carries meaning: user.getName() and user?.getName() are semantically different due to one character
Context hierarchies: A variable's meaning depends on its scope (local, class, module)
Cross-language patterns: Promise<T> in TypeScript and Future[T] in Scala represent the same concept
Structure matters: Indentation, brackets, and whitespace are semantic, not decorative

Code embedding models are trained on billions of lines of real code from GitHub, Stack Overflow, and documentation to learn these patterns.

Code Embeddings vs Text Embeddings

Let's see why specialized code embeddings outperform general text models with concrete examples.

Example 1: Semantically Identical, Textually Different

These functions are semantically identical but share almost no keywords:

Python

def total_price(items):
    return sum(item['price'] for item in items)

JavaScript

const calculateSum = (products) =>
    products.reduce((acc, p) => acc + p.cost, 0);

Text embedding similarity: ~0.35 (poor match) Code embedding similarity: ~0.89 (strong match)

Code embeddings recognize both implement "sum of prices from a collection," while text embeddings see different languages, variable names, and keywords.

Example 2: Function Signatures vs Implementations

TypeScript

1// Declaration
2interface UserRepository {
3    findById(id: string): Promise<User | null>;
4}
5
6// Implementation
7class PostgresUserRepo implements UserRepository {
8    async findById(id: string): Promise<User | null> {
9        const result = await this.db.query(
10            'SELECT * FROM users WHERE id = $1', [id]
11        );
12        return result.rows[0] || null;
13    }
14}

Code embeddings understand that:

The interface defines a contract
The implementation fulfills that contract
Both are related but serve different purposes
The SQL query is part of the implementation strategy

Text embeddings would miss these architectural relationships.

Example 3: Cross-Language Relationships

1// Go
2type Result[T any] struct {
3    Value T
4    Err   error
5}

Rust

1// Rust
2enum Result<T, E> {
3    Ok(T),
4    Err(E),
5}

Code embeddings recognize both implement the "Result monad" pattern despite completely different syntax and keywords. This enables finding equivalent patterns across languages.

The Embedding Pipeline

Here's how a production code embedding system works, step by step:

Step 1: Parse Code Structure

Use a parser like Tree-sitter to extract syntactic structure:

TypeScript

1import Parser from 'tree-sitter';
2import TypeScript from 'tree-sitter-typescript';
3
4const parser = new Parser();
5parser.setLanguage(TypeScript);
6
7const sourceCode = `
8function calculateDiscount(price: number, rate: number): number {
9    return price * (1 - rate);
10}
11`;
12
13const tree = parser.parse(sourceCode);
14// Tree contains AST with function boundaries, parameters, types

Step 2: Chunk Intelligently

Split code into meaningful units—not arbitrary character limits, but semantic boundaries:

TypeScript

1// Good chunking (by function)
2chunk1 = "function calculateDiscount(price: number, rate: number): number { ... }"
3chunk2 = "function applyTax(amount: number, taxRate: number): number { ... }"
4
5// Bad chunking (by character count)
6chunk1 = "function calculateDiscount(price: nu"
7chunk2 = "mber, rate: number): number { return"

Respect:

Function/method boundaries
Class definitions
Module boundaries
Comment blocks (docstrings)

Step 3: Generate Embeddings

Run chunks through an embedding model (local ONNX example):

TypeScript

1import { InferenceSession, Tensor } from 'onnxruntime-node';
2
3async function embedCode(code: string): Promise<number[]> {
4    // Load model
5    const session = await InferenceSession.create('code-embedding-model.onnx');
6
7    // Tokenize
8    const tokens = tokenize(code); // [101, 2339, 2129, ...]
9    const inputTensor = new Tensor('int64', tokens, [1, tokens.length]);
10
11    // Run inference
12    const outputs = await session.run({ input_ids: inputTensor });
13    const embedding = outputs.last_hidden_state.data;
14
15    // Mean pooling
16    return meanPool(embedding); // [0.234, -0.456, ...] (768 dims)
17}

Step 4: Store in Vector Database

Index embeddings for fast similarity search:

SQL

1-- Using SQLite with vector extension (FTS5 + custom similarity)
2CREATE VIRTUAL TABLE code_embeddings USING vec0(
3    chunk_id TEXT PRIMARY KEY,
4    file_path TEXT,
5    code TEXT,
6    embedding FLOAT[768]
7);
8
9INSERT INTO code_embeddings VALUES (
10    'func_123',
11    'src/utils/math.ts',
12    'function calculateDiscount(...)',
13    vec_f32('[0.234, -0.456, ...]')
14);

Step 5: Query with Cosine Similarity

Search for semantically similar code:

TypeScript

1async function searchSimilarCode(query: string, topK = 10) {
2    // Embed the query
3    const queryEmbedding = await embedCode(query);
4
5    // Find nearest neighbors by cosine similarity
6    const results = await db.query(`
7        SELECT
8            chunk_id,
9            file_path,
10            code,
11            vec_cosine_distance(embedding, ?) as similarity
12        FROM code_embeddings
13        ORDER BY similarity DESC
14        LIMIT ?
15    `, [queryEmbedding, topK]);
16
17    return results.filter(r => r.similarity > 0.7); // Threshold
18}

Cosine similarity measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical). For code embeddings, it's the gold standard metric.

The Math (Simplified)

Given two embedding vectors A and B:

Plain Text

1cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
2
3Where:
4- A · B = dot product (sum of element-wise multiplication)
5- ||A|| = magnitude of A (square root of sum of squares)

Visual Analogy

Imagine two arrows in 3D space:

Same direction (parallel arrows): similarity = 1.0
Perpendicular (90° angle): similarity = 0.0
Opposite direction (180° angle): similarity = -1.0

In 768-dimensional space, the same principle applies. Similar code "points" in similar semantic directions.

Practical Thresholds

From real-world code search systems:

Similarity	Interpretation	Use Case
0.95-1.0	Near duplicates	Detect copy-paste code
0.85-0.95	Highly related	Find alternative implementations
0.75-0.85	Related	Discover similar patterns
0.65-0.75	Loosely related	Explore related concepts
< 0.65	Unrelated	Filter out noise

Why It Beats Keyword Matching

TypeScript

1// Query: "validate email address"
2
3// Keyword match: LOW (different words)
4function checkEmailFormat(str: string): boolean {
5    return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(str);
6}
7
8// Cosine similarity: 0.87 (HIGH)
9// Model understands: regex validation + email pattern = email validation

Embeddings capture intent and meaning, not just token overlap.

Leading Code Embedding Models in 2026

Code embedding models have improved a lot in the past two years. Here's a comparison of leading options:

Model	Dimensions	Context Length	Deployment	Best For
Voyage-3-large	2048 (flexible: 256-2048)	32K tokens	API (Cloud)	Highest accuracy, large contexts
OpenAI text-embedding-3-large	3072 (flexible: 256-3072)	8K tokens	API (Cloud)	General code + docs, high quality
EmbeddingGemma (308M)	768 (flexible: 128-768)	2K tokens	Local ONNX	Privacy, on-device, fast inference
Nomic Embed v1.5	768 (flexible: 64-768)	8K tokens	Local ONNX	Open source, reproducible
StarEncoder	768	1K tokens	Local	Code-native, 86 languages
CodeBERT	768	512 tokens	Local	Legacy, smaller contexts

Choosing the Right Model

For maximum accuracy (cloud OK):

Voyage-3-large or OpenAI text-embedding-3-large
Best semantic understanding, supports long files

For privacy and local deployment:

EmbeddingGemma or Nomic Embed
Run entirely offline with ONNX Runtime
Semantiq uses this approach

For specialized code tasks:

StarCoder embeddings for multi-language codebases
Fine-tune on your domain (security, web, systems)

Performance considerations:

Embedding 1000 functions:
- Cloud API: ~2-5 seconds (parallelized)
- Local ONNX (CPU): ~15-30 seconds
- Local ONNX (GPU): ~3-8 seconds

How Semantiq Uses Embeddings

Semantiq uses vector embeddings as part of a hybrid search strategy, combining semantic understanding with traditional text search for optimal results.

ONNX-Powered Local Inference

Semantiq runs embedding models entirely on your machine using ONNX Runtime:

TypeScript

1// Simplified architecture
2class SemanticIndexer {
3    private model: InferenceSession;
4    private vectorDB: VectorStore;
5
6    async initialize() {
7        // Load optimized ONNX model (~300MB)
8        this.model = await InferenceSession.create(
9            'models/embedding-gemma-308m.onnx',
10            { executionProviders: ['cuda', 'cpu'] } // GPU if available
11        );
12    }
13
14    async indexRepository(repoPath: string) {
15        // Parse with Tree-sitter
16        const chunks = await parseCodeStructure(repoPath);
17
18        // Batch embed (efficient)
19        const embeddings = await this.batchEmbed(chunks);
20
21        // Store locally (SQLite + FTS5)
22        await this.vectorDB.insert(chunks, embeddings);
23    }
24}

Benefits:

No data leaves your machine
Works offline
No API costs
Fast local inference (~5ms per chunk on modern CPUs)

Hybrid Search Strategy

Semantiq doesn't rely solely on embeddings. It combines:

Vector search (semantic): Find conceptually similar code
Ripgrep (exact): Match precise patterns and symbols
FTS5 (full-text): Index identifiers and comments

TypeScript

1async function hybridSearch(query: string) {
2    const [semanticResults, exactResults, textResults] = await Promise.all([
3        vectorSearch(query),      // Embedding similarity
4        ripgrepSearch(query),     // Regex + literal matches
5        fts5Search(query)         // Token-based text search
6    ]);
7
8    // Merge and rank by combined score
9    return mergeResults(semanticResults, exactResults, textResults);
10}

This hybrid approach achieves:

Recall: Embeddings find semantically related code you'd miss with keywords
Precision: Exact search eliminates false positives
Speed: FTS5 provides instant identifier lookup

Adaptive ML Thresholds

Semantiq learns optimal similarity thresholds based on your codebase:

TypeScript

1// Analyze codebase structure
2const stats = analyzeCodebase(repo);
3
4// Adjust thresholds based on:
5// - Code duplication level (high dup → raise threshold)
6// - Language diversity (multi-lang → lower threshold)
7// - Project size (large → stricter filtering)
8
9const threshold = baseThreshold * (1 + stats.duplicationFactor * 0.3)
10                                 * (1 - stats.languageDiversity * 0.2);

Smaller, focused codebases use lower thresholds (0.65) to surface more results. Large monorepos use higher thresholds (0.80) to filter noise.

Practical Applications

Vector embeddings enable several code intelligence features:

1. Semantic Code Search

Find functions by describing what they do:

Plain Text

1Query: "parse JWT token and extract user claims"
2
3Results (even if no keywords match):
4- decodeAuthToken(token: string): UserClaims
5- extractJWTPayload(jwt: string): Claims
6- parseBearer(authHeader: string): TokenData

2. Duplicate Detection

Find copy-pasted or reimplemented code:

TypeScript

1// Original
2function calculateShipping(weight: number, distance: number) {
3    const baseRate = 5.0;
4    return baseRate + (weight * 0.5) + (distance * 0.1);
5}
6
7// Detected duplicate (0.94 similarity)
8function getDeliveryCost(kg: number, km: number) {
9    const base = 5.0;
10    return base + kg * 0.5 + km * 0.1;
11}

3. Refactoring Suggestions

Identify candidates for extraction:

Plain Text

1High similarity cluster detected:
2- processUserData() [similarity: 0.92]
3- handleCustomerInfo() [similarity: 0.91]
4- transformAccountDetails() [similarity: 0.90]
5
6Suggestion: Extract common pattern into shared utility

4. Vulnerability Detection

Find similar patterns to known vulnerabilities:

TypeScript

1// Known SQL injection pattern (in training data)
2const query = "SELECT * FROM users WHERE id = " + userId;
3
4// Detected similar pattern (0.88 similarity)
5const sql = `DELETE FROM sessions WHERE user = ${userInput}`;
6//                                              ↑ Flagged as risky

5. Codebase Clustering

Visualize code organization and identify architectural boundaries:

Plain Text

1Cluster 1 (Authentication): 45 functions, avg similarity 0.82
2Cluster 2 (Database Access): 67 functions, avg similarity 0.79
3Cluster 3 (API Handlers): 89 functions, avg similarity 0.75
4
5Outliers: 12 functions with low cluster similarity
6→ Potential candidates for refactoring or better organization

Building Your Own: A Minimal Example

Here's a simplified educational example in TypeScript showing how to build a basic code embedding search system:

TypeScript

1// minimal-code-search.ts
2import { InferenceSession, Tensor } from 'onnxruntime-node';
3import Database from 'better-sqlite3';
4
5// Simple tokenizer (real systems use proper tokenizers like tiktoken)
6function tokenize(code: string): number[] {
7    // Simplified: split on whitespace and map to token IDs
8    const words = code.toLowerCase().split(/\s+/);
9    return words.map(w => w.charCodeAt(0) % 1000); // Dummy mapping
10}
11
12// Load embedding model
13async function createEmbedder(modelPath: string) {
14    const session = await InferenceSession.create(modelPath);
15
16    return async (code: string): Promise<number[]> => {
17        const tokens = tokenize(code);
18        const inputTensor = new Tensor('int64',
19            new BigInt64Array(tokens.map(t => BigInt(t))),
20            [1, tokens.length]
21        );
22
23        const outputs = await session.run({ input_ids: inputTensor });
24        const embedding = Array.from(outputs.last_hidden_state.data);
25
26        // Mean pooling
27        const dims = 768;
28        const pooled = new Array(dims).fill(0);
29        for (let i = 0; i < embedding.length; i++) {
30            pooled[i % dims] += embedding[i];
31        }
32        return pooled.map(x => x / (embedding.length / dims));
33    };
34}
35
36// Vector database (SQLite)
37class VectorDB {
38    private db: Database.Database;
39
40    constructor(dbPath: string) {
41        this.db = new Database(dbPath);
42        this.db.exec(`
43            CREATE TABLE IF NOT EXISTS embeddings (
44                id INTEGER PRIMARY KEY,
45                code TEXT,
46                embedding BLOB
47            )
48        `);
49    }
50
51    insert(code: string, embedding: number[]) {
52        const blob = Buffer.from(new Float32Array(embedding).buffer);
53        this.db.prepare('INSERT INTO embeddings (code, embedding) VALUES (?, ?)')
54            .run(code, blob);
55    }
56
57    search(queryEmbedding: number[], topK = 5) {
58        const rows = this.db.prepare('SELECT id, code, embedding FROM embeddings').all();
59
60        const results = rows.map(row => ({
61            code: row.code,
62            similarity: cosineSimilarity(
63                queryEmbedding,
64                Array.from(new Float32Array(row.embedding.buffer))
65            )
66        }));
67
68        return results
69            .sort((a, b) => b.similarity - a.similarity)
70            .slice(0, topK);
71    }
72}
73
74// Cosine similarity
75function cosineSimilarity(a: number[], b: number[]): number {
76    let dotProduct = 0, magA = 0, magB = 0;
77    for (let i = 0; i < a.length; i++) {
78        dotProduct += a[i] * b[i];
79        magA += a[i] * a[i];
80        magB += b[i] * b[i];
81    }
82    return dotProduct / (Math.sqrt(magA) * Math.sqrt(magB));
83}
84
85// Usage
86async function main() {
87    const embed = await createEmbedder('model.onnx');
88    const db = new VectorDB('code.db');
89
90    // Index some code
91    const snippets = [
92        'function sort(arr) { return arr.sort(); }',
93        'const sum = (nums) => nums.reduce((a,b) => a+b, 0);',
94        'function multiply(a, b) { return a * b; }'
95    ];
96
97    for (const code of snippets) {
98        const embedding = await embed(code);
99        db.insert(code, embedding);
100    }
101
102    // Search
103    const query = 'add numbers together';
104    const queryEmbedding = await embed(query);
105    const results = db.search(queryEmbedding);
106
107    console.log('Results for:', query);
108    results.forEach(r =>
109        console.log(`${r.similarity.toFixed(3)}: ${r.code}`)
110    );
111}
112
113main();

This example demonstrates the core concepts. Production systems add:

Proper tokenization (WordPiece, BPE)
Batch processing for efficiency
Incremental index updates
Advanced vector stores (FAISS, Milvus)
Query optimization

The Future of Code Embeddings

Here's what's coming next:

Future models will embed not just code, but:

Architecture diagrams: Relate UML to implementations
Documentation: Link prose explanations to code
Runtime traces: Connect behavior to source
Commit messages: Understand intent and evolution

Plain Text

1Query: "authentication flow with OAuth"
2
3Results:
4- Code: OAuthHandler.authenticate()
5- Diagram: auth-flow.png (sequence diagram)
6- Docs: "OAuth Integration Guide" (page 12)
7- Test: test_oauth_flow.ts

Cross-Repository Understanding

Imagine embeddings that span:

Your codebase + dependencies
Public libraries on npm/PyPI
Stack Overflow solutions
GitHub code examples

Search for "rate limiting middleware" and find:

Your existing implementation
Express.js rate-limiter package
Similar patterns in other repos
Related Stack Overflow answers

All semantically ranked and compared.

Real-Time Incremental Updates

Current systems reindex entire codebases. Future systems will:

Embed files as you edit (under 50ms latency)
Update only changed functions
Maintain consistency across refactors
Propagate changes through dependency graphs

Specialized Domain Models

We'll see embedding models fine-tuned for:

Security: Recognize vulnerability patterns
Performance: Identify optimization opportunities
Testing: Suggest test cases based on code coverage
Migration: Map deprecated APIs to modern equivalents

Smaller, Faster Models

The trend toward edge computing will drive:

Sub-100MB models running on laptops
Hardware acceleration (NPU, GPU)
Quantization to 4-bit and 8-bit precision
Embeddings generated in under 1ms per function

Conclusion

Try Semantiq to experience semantic code search powered by local vector embeddings. Learn more →

What Are Vector Embeddings?#

From Text to Meaning: How Embedding Models Work#

Step 1: Tokenization#

Step 2: Transformer Encoding#

Step 3: Pooling to Fixed Dimensions#

Why Code Needs Specialized Models#

Code Embeddings vs Text Embeddings#

Example 1: Semantically Identical, Textually Different#

Example 2: Function Signatures vs Implementations#

Example 3: Cross-Language Relationships#

The Embedding Pipeline#

Step 1: Parse Code Structure#

Step 2: Chunk Intelligently#

Step 3: Generate Embeddings#

Step 4: Store in Vector Database#

Step 5: Query with Cosine Similarity#

Cosine Similarity: Finding Related Code#

The Math (Simplified)#

Visual Analogy#

Practical Thresholds#

Why It Beats Keyword Matching#

Leading Code Embedding Models in 2026#

Choosing the Right Model#

How Semantiq Uses Embeddings#

ONNX-Powered Local Inference#

Hybrid Search Strategy#

Adaptive ML Thresholds#

Practical Applications#

1. Semantic Code Search#

2. Duplicate Detection#

3. Refactoring Suggestions#

4. Vulnerability Detection#

5. Codebase Clustering#

Building Your Own: A Minimal Example#

The Future of Code Embeddings#

Multi-Modal Code Understanding#

Cross-Repository Understanding#

Real-Time Incremental Updates#

Specialized Domain Models#

Smaller, Faster Models#

Conclusion#

Related Posts

What Is Semantic Code Search? A Developer's Guide

Agentic AI Coding: How Autonomous Agents Are Changing Software Development

The AI Code Quality Crisis: Why Defective Code Is Rising in 2026

What Are Vector Embeddings?#

From Text to Meaning: How Embedding Models Work#

Step 1: Tokenization#

Step 2: Transformer Encoding#

Step 3: Pooling to Fixed Dimensions#

Why Code Needs Specialized Models#

Code Embeddings vs Text Embeddings#

Example 1: Semantically Identical, Textually Different#

Example 2: Function Signatures vs Implementations#

Example 3: Cross-Language Relationships#

The Embedding Pipeline#

Step 1: Parse Code Structure#

Step 2: Chunk Intelligently#

Step 3: Generate Embeddings#

Step 4: Store in Vector Database#

Step 5: Query with Cosine Similarity#

Cosine Similarity: Finding Related Code#

The Math (Simplified)#

Visual Analogy#

Practical Thresholds#

Why It Beats Keyword Matching#

Leading Code Embedding Models in 2026#

Choosing the Right Model#

How Semantiq Uses Embeddings#

ONNX-Powered Local Inference#

Hybrid Search Strategy#

Adaptive ML Thresholds#

Practical Applications#

1. Semantic Code Search#

2. Duplicate Detection#

3. Refactoring Suggestions#

4. Vulnerability Detection#

5. Codebase Clustering#

Building Your Own: A Minimal Example#

The Future of Code Embeddings#

What Are Vector Embeddings?

From Text to Meaning: How Embedding Models Work

Step 1: Tokenization

Step 2: Transformer Encoding

Step 3: Pooling to Fixed Dimensions

Why Code Needs Specialized Models

Code Embeddings vs Text Embeddings

Example 1: Semantically Identical, Textually Different

Example 2: Function Signatures vs Implementations

Example 3: Cross-Language Relationships

The Embedding Pipeline

Step 1: Parse Code Structure

Step 2: Chunk Intelligently

Step 3: Generate Embeddings

Step 4: Store in Vector Database

Step 5: Query with Cosine Similarity

Cosine Similarity: Finding Related Code

The Math (Simplified)

Visual Analogy

Practical Thresholds

Why It Beats Keyword Matching

Leading Code Embedding Models in 2026

Choosing the Right Model

How Semantiq Uses Embeddings

ONNX-Powered Local Inference

Hybrid Search Strategy

Adaptive ML Thresholds

Practical Applications

1. Semantic Code Search

2. Duplicate Detection

3. Refactoring Suggestions

4. Vulnerability Detection

5. Codebase Clustering

Building Your Own: A Minimal Example

The Future of Code Embeddings

Multi-Modal Code Understanding

Cross-Repository Understanding

Real-Time Incremental Updates

Specialized Domain Models

Smaller, Faster Models

Conclusion

What Are Vector Embeddings?

From Text to Meaning: How Embedding Models Work

Step 1: Tokenization

Step 2: Transformer Encoding

Step 3: Pooling to Fixed Dimensions

Why Code Needs Specialized Models

Code Embeddings vs Text Embeddings

Example 1: Semantically Identical, Textually Different

Example 2: Function Signatures vs Implementations

Example 3: Cross-Language Relationships

The Embedding Pipeline

Step 1: Parse Code Structure

Step 2: Chunk Intelligently

Step 3: Generate Embeddings

Step 4: Store in Vector Database

Step 5: Query with Cosine Similarity

Cosine Similarity: Finding Related Code

The Math (Simplified)

Visual Analogy

Practical Thresholds

Why It Beats Keyword Matching

Leading Code Embedding Models in 2026

Choosing the Right Model

How Semantiq Uses Embeddings

ONNX-Powered Local Inference

Hybrid Search Strategy

Adaptive ML Thresholds

Practical Applications

1. Semantic Code Search

2. Duplicate Detection

3. Refactoring Suggestions

4. Vulnerability Detection

5. Codebase Clustering

Building Your Own: A Minimal Example

The Future of Code Embeddings

Multi-Modal Code Understanding

Cross-Repository Understanding

Real-Time Incremental Updates

Specialized Domain Models