Overview: This practical guide walks .NET Core developers from zero to production-ready LLM integration. It covers architecture, coding patterns, OpenAI/Azure/Anthropic/Ollama examples, streaming, retrieval-augmented generation (RAG), security, testing, deployment, observability, cost control, prompt engineering, FAQs and common interview questions.

Prerequisites

  • .NET 6.0 / 7.0 / 8.0 SDK installed
  • Visual Studio 2022 / VS Code
  • API keys for the target providers (OpenAI, Azure OpenAI, Anthropic) or a local Ollama instance
  • Basic knowledge of async/await, dependency injection, HttpClientFactory, and REST
  • Optional: Redis or other caching backend for production

Architecture Patterns

Design your LLM integration to be provider-agnostic, testable, and resilient. Key patterns:

  • Adapter / Provider pattern: Create an ILLMService interface and multiple provider adapters (OpenAI, Azure, Anthropic, Local) implementing the interface.
  • Gateway / Facade: Expose a single application gateway that routes requests, enforces quotas, and handles fallbacks.
  • RAG pipeline: A retrieval layer (vector DB) + reranker + generation model.
  • Streaming: Use SSE or WebSockets via SignalR to stream tokens to web clients.
  • Queueing & Background Processing: Offload expensive tasks to background workers using message queues (RabbitMQ, Azure Service Bus).

Project Setup (.NET)

Create a new Web API project and a solution organized into projects:

  • MyApp.Api — API endpoints
  • MyApp.Core — Interfaces, DTOs, domain logic
  • MyApp.Infrastructure — Provider implementations, HTTP clients, storage
  • MyApp.Workers — Background workers

Essential NuGet packages

dotnet add package Microsoft.Extensions.Http
dotnet add package Microsoft.Extensions.Configuration
dotnet add package Newtonsoft.Json
dotnet add package Serilog.AspNetCore
dotnet add package StackExchange.Redis # optional

Core LLM Service Interface

Create a provider-agnostic contract used by the application.

namespace MyApp.Core.AI
{
public record LlmRequest(string Prompt, int MaxTokens = 1024, double Temperature = 0.7);
public record LlmResponse(string Text, int TokensUsed, IDictionary<string, object> Metadata = null);
public interface ILLMService
{
    Task<LlmResponse> GenerateAsync(LlmRequest request, CancellationToken ct = default);
    IAsyncEnumerable<LlmResponse> StreamGenerateAsync(LlmRequest request, CancellationToken ct = default);
}

This keeps business logic independent from provider specifics and simplifies testing with mocks.

OpenAI Integration (Example)

Basic non-streaming implementation using HttpClientFactory.

public class OpenAIService : ILLMService
{
private readonly HttpClient _http;
private readonly ILogger _logger;
private readonly string _apiKey;
public OpenAIService(HttpClient http, IConfiguration config, ILogger<OpenAIService> logger)
{
    _http = http;
    _logger = logger;
    _apiKey = config["OpenAI:ApiKey"];
    _http.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", _apiKey);
}

public async Task<LlmResponse> GenerateAsync(LlmRequest request, CancellationToken ct = default)
{
    var payload = new
    {
        model = "gpt-4o-mini", // choose as needed
        messages = new[] { new { role = "user", content = request.Prompt } },
        max_tokens = request.MaxTokens,
        temperature = request.Temperature
    };

    var resp = await _http.PostAsJsonAsync("https://api.openai.com/v1/chat/completions", payload, ct);
    resp.EnsureSuccessStatusCode();
    var json = await resp.Content.ReadFromJsonAsync<JsonElement>(cancellationToken: ct);

    // parse response
    var text = json.GetProperty("choices")[0].GetProperty("message").GetProperty("content").GetString();
    var tokens = json.GetProperty("usage").GetProperty("total_tokens").GetInt32();
    return new LlmResponse(text, tokens);
}

public async IAsyncEnumerable<LlmResponse> StreamGenerateAsync(LlmRequest request, [EnumeratorCancellation] CancellationToken ct = default)
{
    // Implement SSE streaming parsing—see streaming section below.
    yield break;
}

Notes:

  1. Use typed HttpClientFactory registration in Program.cs.
  2. Never hardcode the API key—use environment variables or managed secrets (Key Vault).

Azure OpenAI Integration

Azure OpenAI uses deployment names and region endpoints. Example endpoint format: https://<your-resource>.openai.azure.com/openai/deployments/<deployment-name>/chat/completions?api-version=2023-05-15.

Authentication can be either API key or Azure AD—prefer managed identity in production.

Anthropic (Claude) Integration

Anthropic uses a different message schema and headers. When implementing an adapter, map your ILLMService to the provider-specific payload. Be aware of differences in role names, streaming format, and rate limits.

Local LLMs (Ollama, Llama2, Mistral)

Local deployment is ideal for privacy, reduced API spend, or offline usage. Ollama exposes an HTTP API—your adapter should call the local endpoint. Manage model lifecycle carefully: preload models, monitor memory, and fallback to cloud if needed.

Streaming: SSE and SignalR

Streaming tokens to the client improves perceived latency. Two common approaches:

  • SSE (Server-Sent Events): Good for one-way streams (server → client) and simple to implement with HttpResponse.BodyWriter.
  • SignalR / WebSockets: Use when you need bi-directional communication or more control.

Example: Basic SSE streaming controller

[ApiController]
[Route("api/stream")]
public class StreamController : ControllerBase
{
private readonly ILLMService _llm;
public StreamController(ILLMService llm) => _llm = llm;

[HttpGet]
public async Task Stream([FromQuery] string prompt)
{
    Response.ContentType = "text/event-stream";
    var req = new LlmRequest(prompt);
    await foreach (var token in _llm.StreamGenerateAsync(req))
    {
        var bytes = Encoding.UTF8.GetBytes($"data: {JsonSerializer.Serialize(token)}\n\n");
        await Response.BodyWriter.WriteAsync(bytes);
        await Response.Body.FlushAsync();
    }
}

On the client, parse SSE events and append tokens as they arrive.

Retrieval-Augmented Generation (RAG)

RAG combines a vector search (semantic retrieval) with LLM generation. High-level steps:

  1. Ingest documents and create vector embeddings (OpenAI embeddings, E5, or local embedder).
  2. Store vectors in a vector DB (Pinecone, Milvus, Weaviate, Qdrant, or a DB with vector extension).
  3. On query: embed the query → retrieve top-K documents → optionally rerank → build a prompt that includes retrieved passages → call LLM.

Example RAG prompt template

System: You are an assistant that answers using only the provided documents. If the answer is not in the documents, say "I don't know".
Context:
{# for each doc in retrieved_docs #}

Doc {index}:
{doc.content}
{# end #}

User: {user_query}
Answer:

Important: truncate context to fit model context window; prefer short extractive summaries; consider using chunking and overlapping windows during ingestion.

Security Best Practices

  • Secrets: Use Azure Key Vault / AWS Secrets Manager / environment variables; do not commit keys.
  • Input validation & prompt injection: Sanitize user inputs, use instruction separation, and implement guardrails.
  • Audit logging: Log requests, responses metadata, tokens used, timestamps—avoid logging user PII in plain text.
  • Least privilege: Use scoped API keys and rate limits per key.
  • Data residency: For regulated data, prefer region-locked endpoints or local models.

Testing & CI

Because LLM outputs are nondeterministic, focus testing on behavior and infrastructure:

  • Unit tests: Mock ILLMService with deterministic responses using Moq or NSubstitute.
  • Integration tests: Run with test keys and use snapshot assertions for important interactions.
  • Contract tests: Verify expected JSON shapes from provider endpoints.
  • Load & chaos testing: Ensure resilience when provider latency spikes or throttling occurs.

Deployment Considerations

Deployment checklist:

  • Containerize with Docker; tune resource limits.
  • Configure health checks to verify LLM connectivity and model availability.
  • Use blue/green or canary deployments for safe rollouts.
  • Enable feature flags to toggle LLM features.
  • Use autoscaling with CPU/memory and queue length triggers; monitor request latency and error rates.

Monitoring & Cost Management

  • Track token usage per feature and user; expose cost dashboards to stakeholders.
  • Implement model cascading: cheap model → better model if needed.
  • Cache frequent prompts & completions; memoize deterministic outputs.
  • Alert on usage spikes and unexpected errors.

Advanced Topics

Chain-of-Thought & Reasoning

Use chain-of-thought or stepwise prompting for complex reasoning tasks. Consider using smaller specialized models for internal chain-of-thought processing and pass final distilled content to the user-facing model.

Hybrid Models & Multi-Provider Failover

Implement multi-provider strategies to reduce vendor lock-in and improve uptime: try OpenAI → Anthropic → Local model, with intelligent retries and result merging.

Data Labeling & Feedback Loops

Collect user feedback to create supervised fine-tuning datasets or to improve prompt templates. Monitor hallucinations and build a human-in-the-loop review path for high-risk decisions.

FAQ

Q: Which model should I pick for production?
A: It depends on latency, cost, accuracy, and data-safety requirements. Use cheaper models for trivial tasks and reserve large models for complex reasoning. Consider local models for strict privacy.
Q: How do I prevent prompt injection?
A: Isolate instructions from user data, sanitize inputs, and implement allowlists/deny-lists for commands. Use system messages that explicitly disallow following embedded instructions in retrieved text.
Q: How do I handle long conversations?
A: Use summarization, sliding windows, or RAG to keep context within token limits. Persist conversation state and only include the most relevant history.
Q: What are typical cost-reduction tactics?
A: Model cascading, caching, batching requests, offloading non-real-time jobs to cheaper models, and optimizing prompt length.
Q: How to test LLMs deterministically?
A: Use mocked provider responses for unit testing. For integration tests, use deterministic system prompts and snapshot common outputs.

Interview Questions (and model answers)

Basic

  1. Q: What is an LLM and how does it differ from a regular ML model?
    A: An LLM is a transformer-based model trained on large corpora for language understanding/generation. Compared to smaller models, LLMs have larger parameter counts and are optimized for generalization across language tasks.
  2. Q: Why use an adapter pattern for provider integrations?
    A: It decouples application logic from provider specifics, making it simple to swap or test providers and implement provider-specific behavior behind a uniform interface.

Intermediate

  1. Q: Explain RAG and its benefits.
    A: RAG augments generation with retrieved, relevant documents from a vector DB, improving factuality and enabling up-to-date or domain-specific answers without retraining the model.
  2. Q: How would you implement streaming from an LLM to a Web UI in .NET?
    A: Use SSE or SignalR. The server reads streaming tokens from provider’s SSE/stream API and forwards parsed token events to the client; handle reconnection/backpressure and token buffering.

Advanced

  1. Q: How do you prevent sensitive data leakage to 3rd-party LLM providers?
    A: Use data minimization, anonymization, on-prem or private model hosting, region-specific endpoints, and token policies to avoid sending PII. Use local models for the highest privacy guarantees.
  2. Q: Design an architecture to support 1M monthly LLM requests while controlling cost and latency.
    A: Use autoscaling APIs, implement caching, model cascading, priority queuing, background processing for non-real-time tasks, vector DB for RAG to reduce calls, and multi-region deployments with traffic shaping.

References

  • Based on the original guide: Complete Guide: Integrating LLMs into .NET Core Applications.