Optimize tool discovery

When Virtual MCP Server (vMCP) aggregates many backend MCP servers, the total number of tools exposed to clients can grow quickly. The optimizer addresses this by filtering tools per request, reducing token usage and improving tool selection accuracy.

For the desktop/CLI approach using the MCP Optimizer container, see the MCP Optimizer tutorial. This guide covers the Kubernetes operator implementation using the VirtualMCPServer and EmbeddingServer CRDs.

Overview

Benefits

Reduced token usage: Only relevant tools are included in context, not the entire toolset
Improved tool selection: Hybrid semantic and keyword search surfaces the best tools for each query
Simplified clients: Clients see only two tools (find_tool and call_tool) regardless of how many backends exist

How it works

An AI client sends a prompt that requires tool assistance
The AI calls find_tool with keywords extracted from the prompt
vMCP performs hybrid semantic and keyword search across all backend tools
Only the most relevant tools (up to 8 by default) are returned
The AI calls call_tool to execute the selected tool, and vMCP routes the request to the appropriate backend

Quick start

Step 1: Create an EmbeddingServer

Create an EmbeddingServer with default settings. This deploys a text embeddings inference (TEI) server using the BAAI/bge-small-en-v1.5 model:

embedding-server.yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
  name: my-embedding
  namespace: toolhive-system
spec: {}

tip

Wait for the EmbeddingServer to reach the Running phase before proceeding. The first startup may take a few minutes while the model downloads.

kubectl get embeddingserver my-embedding -n toolhive-system -w

Step 2: Add the embedding reference to VirtualMCPServer

Update your existing VirtualMCPServer to include embeddingServerRef. This is the only change needed to enable the optimizer. When you set embeddingServerRef, the operator automatically enables the optimizer with sensible defaults. You only need to add an explicit optimizer block if you want to tune the parameters.

VirtualMCPServer resource
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
  name: my-vmcp
  namespace: toolhive-system
spec:
  embeddingServerRef:
    name: my-embedding
  config:
    groupRef: my-group
  incomingAuth:
    type: anonymous

Step 3: Verify

Check that the VirtualMCPServer is ready:

kubectl get virtualmcpserver my-vmcp -n toolhive-system

Look for READY: True in the output. Once ready, clients connecting to the vMCP endpoint see only find_tool and call_tool instead of the full backend toolset.

EmbeddingServer resource

The EmbeddingServer CRD manages the lifecycle of a text embeddings inference server. An empty spec: {} uses all defaults, which is sufficient for most deployments. For the complete field reference, see the EmbeddingServer CRD specification.

warning

The default TEI image (ghcr.io/huggingface/text-embeddings-inference) is amd64-only. If you are running on ARM64 (for example, Apple Silicon with Kind), you must pre-load or build an ARM64-compatible image.

Tune the optimizer

To customize optimizer behavior, add the optimizer block under spec.config in your VirtualMCPServer resource:

VirtualMCPServer resource
spec:
  config:
    groupRef: my-group
    optimizer:
      embeddingServiceTimeout: 30s
      maxToolsToReturn: 8
      hybridSearchSemanticRatio: '0.5'
      semanticDistanceThreshold: '1.0'

Parameter reference

Parameter	Description	Default
`embeddingServiceTimeout`	HTTP request timeout for calls to the embedding service	`30s`
`maxToolsToReturn`	Maximum number of tools returned per search (1-50)	`8`
`hybridSearchSemanticRatio`	Balance between semantic and keyword search. `0.0` = all keyword, `1.0` = all semantic	`"0.5"`
`semanticDistanceThreshold`	Maximum distance for semantic results. `0` = identical, `2` = completely unrelated. Results beyond this threshold are filtered out	`"1.0"`

note

hybridSearchSemanticRatio and semanticDistanceThreshold are string-encoded floats (for example, "0.5" not 0.5). This is a Kubernetes CRD limitation, as CRDs do not support float types portably.

Tuning guidance

Lower semanticDistanceThreshold (for example, "0.6") for higher precision: only very close matches are returned
Raise semanticDistanceThreshold (for example, "1.4") for higher recall: broader matches are included
Increase maxToolsToReturn if the AI frequently cannot find the right tool; decrease it to save tokens
Adjust hybridSearchSemanticRatio toward "1.0" if tool names are not descriptive, or toward "0.0" if exact keyword matching is more useful

Advanced example

A production-ready configuration with high availability for the embedding server, persistent model caching, and tuned optimizer parameters.

The EmbeddingServer runs two replicas with resource limits and a persistent volume for model caching, so restarts don't re-download the model:

embedding-server-advanced.yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: EmbeddingServer
metadata:
  name: prod-embedding
  namespace: toolhive-system
spec:
  replicas: 2
  resources:
    requests:
      cpu: '500m'
      memory: '512Mi'
    limits:
      cpu: '2'
      memory: '1Gi'
  modelCache:
    enabled: true
    storageSize: 5Gi

The VirtualMCPServer uses a shorter embedding timeout (15s) because the EmbeddingServer is co-located in the same namespace with low-latency access. Increase this value if the embedding service is remote or under high load:

vmcp-with-optimizer.yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
  name: prod-vmcp
  namespace: toolhive-system
spec:
  embeddingServerRef:
    name: prod-embedding
  config:
    groupRef: prod-tools
    optimizer:
      embeddingServiceTimeout: 15s
      maxToolsToReturn: 10
      hybridSearchSemanticRatio: '0.6'
      semanticDistanceThreshold: '0.8'
  incomingAuth:
    type: oidc
    oidcConfig:
      type: inline
      inline:
        issuer: https://auth.example.com
        audience: vmcp-prod

MCP Optimizer tutorial — desktop/CLI setup
Optimizing LLM context — background on tool filtering and context pollution
Configure vMCP servers
EmbeddingServer CRD specification
Virtual MCP Server overview — conceptual overview of vMCP
VirtualMCPServer CRD specification

Overview​

Benefits​

How it works​

Quick start​

Step 1: Create an EmbeddingServer​

Step 2: Add the embedding reference to VirtualMCPServer​

Step 3: Verify​

EmbeddingServer resource​

Tune the optimizer​

Parameter reference​

Advanced example​

Related information​