What is the best way to integrate LLMs into .NET Core applications?

The best approach is to integrate LLMs through a dedicated service layer in .NET Core, using async HTTP clients, abstraction interfaces, and provider-agnostic architecture. This ensures scalability, security, and maintainability.

Is .NET Core suitable for production-grade LLM applications?

Yes. .NET Core is well suited for production LLM systems due to its high-performance async model, strong dependency injection, cloud-native support, and excellent observability tooling.

How do I prevent hallucinations when using LLMs in .NET?

Hallucinations can be reduced by using Retrieval-Augmented Generation (RAG), injecting verified context into prompts, validating outputs, and limiting open-ended generation in enterprise workflows.

Should LLM calls be made directly from controllers in ASP.NET Core?

No. LLM calls should be routed through a dedicated service layer that handles prompt construction, validation, retries, logging, and provider abstraction.

What are common enterprise use cases for LLMs in .NET Core?

Common use cases include AI-powered chatbots, semantic search, document summarization, SEO content analysis, automated reporting, and internal developer assistants.

How can I optimize LLM performance and cost in .NET applications?

Optimization strategies include response caching, prompt deduplication, batching background jobs, monitoring token usage, and separating synchronous and asynchronous LLM workloads.

Is Retrieval-Augmented Generation mandatory for enterprise LLM systems?

While not mandatory, RAG is strongly recommended for enterprise systems that require factual accuracy, domain-specific knowledge, and reduced hallucination risk.

Can LLMs be safely used in regulated industries with .NET Core?

Yes, provided that strict controls are implemented, including PII masking, prompt sanitization, audit logging, access control, and compliance with data residency requirements.

Complete Guide: Integrating LLMs into .NET Core Applications

Code + Architecture

Large Language Models (LLMs) have transformed software development, enabling applications to understand natural language, generate human-like responses, and automate complex tasks. For .NET Core developers, integrating LLMs opens new possibilities for building intelligent applications, chatbots, code assistants, and content generation tools. This comprehensive guide provides production-ready code, architectural patterns, and best practices for implementing LLM integration in .NET Core applications.

Understanding LLM Integration in .NET Core Applications

Before diving into implementation, it’s essential to understand what LLM integration entails. LLMs architecture and training forms the foundation of these powerful models. An LLM (Large Language Model) is a neural network trained on vast amounts of text data to understand and generate human-like language. Popular LLMs include OpenAI’s GPT models, Anthropic’s Claude, Google’s Gemini, and Meta’s Llama.

Integrating LLMs into .NET Core applications involves connecting your application to these AI models through APIs or local implementations, managing prompts and responses, handling authentication, implementing error handling, and optimizing performance for production environments. The integration typically follows a client-server architecture where your .NET application acts as the client, making requests to LLM providers.

Prerequisites and Environment Setup

To begin integrating LLMs into your .NET Core application, you need .NET 6.0 or later installed, an IDE like Visual Studio 2022 or Visual Studio Code, API keys from your chosen LLM provider (OpenAI, Azure OpenAI, Anthropic, etc.), and basic understanding of async/await patterns in C#. You’ll also need NuGet packages for HTTP client management and JSON serialization.

Start by creating a new .NET Core Web API or Console application. For this guide, we’ll use a Web API project that can serve as a backend for web applications, mobile apps, or microservices architectures.

Setting Up the .NET Core Project Structure

A well-organized project structure is crucial for maintainable LLM integration. Create the following folder structure: Services for LLM service implementations, Models for request and response DTOs, Configuration for settings and API key management, Interfaces for abstraction layers, and Middleware for request/response handling.

First, install the required NuGet packages. Open your terminal in the project directory and run these commands to install necessary dependencies for HTTP client functionality, configuration management, dependency injection, and JSON processing.

Implementing Core LLM Service Architecture

The core architecture involves creating service interfaces, implementing concrete services for different LLM providers, configuring dependency injection, and managing API credentials securely. Let’s start with the base interface that defines common LLM operations.

Create an interface that abstracts LLM operations. This interface will support both streaming and non-streaming responses, allowing your application to handle real-time token generation for better user experience. The interface should include methods for generating completions, streaming responses, and managing conversation context.

Next, implement configuration models to store API keys and endpoints securely. Never hardcode API keys directly in your source code. Instead, use the .NET configuration system with environment variables or Azure Key Vault for production deployments.

OpenAI Integration Implementation

OpenAI provides one of the most popular LLM APIs, including GPT-4, GPT-3.5-turbo, and GPT-4-turbo models. The implementation requires proper HTTP client configuration, request formatting, response parsing, and error handling. Create a service class that implements the LLM interface specifically for OpenAI’s API.

The OpenAI service implementation handles authentication via bearer tokens, constructs proper request payloads with messages array format, manages temperature and token parameters, implements retry logic for transient failures, and parses JSON responses into strongly-typed models. The service should support both chat completions and streaming responses.

For streaming implementations, use Server-Sent Events (SSE) to receive tokens as they’re generated. This provides a better user experience by displaying responses progressively rather than waiting for the entire response. The streaming implementation uses HttpClient with response streaming enabled, processes each JSON line as it arrives, and invokes callbacks for each token received.

Azure OpenAI Service Integration

Azure OpenAI Service offers enterprise-grade LLM integration with enhanced security, compliance features, and regional deployment options. The integration is similar to standard OpenAI but with Azure-specific authentication and endpoints. Azure uses API keys or Azure Active Directory authentication, region-specific endpoints, deployment names instead of model names, and potentially different rate limits.

The Azure OpenAI implementation requires configuring the endpoint URL with your Azure resource name and deployment ID, using the correct API version in the query string, and implementing Azure-specific error handling for throttling and quota management.

Anthropic Claude Integration

Anthropic’s Claude models offer excellent reasoning capabilities and longer context windows. AI assistants like Claude, ChatGPT, and others each have unique strengths for different use cases. The Claude API uses a different message format than OpenAI, requiring specific implementation adaptations.

Claude’s API uses anthropic-version headers, different message role formatting, system prompts as separate parameters, and streaming via Server-Sent Events. The implementation must handle Claude’s specific response format and error codes while maintaining the same interface contract as other providers.

Local LLM Integration with Ollama

For applications requiring offline capabilities, data privacy, or reduced API costs, local LLM deployment using Ollama provides an excellent solution. Ollama enables running models like Llama 2, Mistral, and CodeLlama on local infrastructure. The integration differs from cloud-based services as it requires local model management, different endpoint structures, and hardware consideration for performance optimization.

The Ollama integration connects to a locally running Ollama server, manages model loading and unloading, handles model-specific configurations, and implements fallback logic when models aren’t available. The local approach offers zero API costs, complete data privacy, and no internet dependency but requires adequate hardware resources.

Implementing Advanced Features

Production-ready LLM integration requires several advanced features beyond basic API calls. These include conversation history management, token counting and optimization, rate limiting and quota management, caching strategies, and error handling with retry logic.

For conversation history management, implement a context manager that maintains message history, manages context window limits, implements sliding window or summarization for long conversations, and persists conversation state across requests. Token counting helps optimize costs and stay within model limits by counting tokens before sending requests, truncating messages when necessary, and implementing smart context pruning strategies.

Rate limiting protects your application from exceeding API quotas. Implement token bucket or leaky bucket algorithms, track requests per minute/hour, implement queue-based request handling, and provide user feedback when limits are approached. Caching frequently used responses reduces costs and improves response times through memory or Redis-based caching, cache invalidation strategies, and partial response caching for common queries.

Security Best Practices

Security is paramount when integrating LLMs, especially when handling user data and API keys. Never expose API keys in client-side code or version control. Store credentials in Azure Key Vault, AWS Secrets Manager, or environment variables. Use managed identities when deploying to cloud platforms.

Implement input validation and sanitization to prevent prompt injection attacks. Validate all user inputs before including them in prompts, implement content filtering for inappropriate requests, and use parameterized prompts to separate instructions from user content. For production applications, implement audit logging of all LLM interactions, monitor for unusual patterns or potential abuse, and maintain compliance with data protection regulations.

Performance Optimization Strategies

LLM API calls can be slow and expensive, making optimization crucial for production applications. AI engineering best practices for production systems emphasize performance and reliability. Implement several optimization techniques: parallel processing for multiple requests, request batching where supported, connection pooling for HTTP clients, and response caching for identical requests.

For streaming implementations, use asynchronous processing throughout the pipeline, implement backpressure handling to prevent memory issues, and optimize chunk sizes for efficient data transfer. Consider using gRPC instead of REST for high-throughput scenarios, implementing circuit breakers to prevent cascade failures, and using CDNs for static prompt templates.

Testing LLM Integrations

Testing LLM integrations presents unique challenges due to non-deterministic outputs and API dependencies. Implement comprehensive testing strategies including unit tests with mocked responses, integration tests with test API keys, load testing for performance validation, and chaos testing for resilience verification.

For unit testing, create mock implementations of the LLM service interface, define expected behaviors for various scenarios, and test error handling and edge cases. Use libraries like Moq or NSubstitute for mocking. Integration testing should use dedicated test API keys, implement snapshot testing for response validation, and test rate limiting and retry logic. Consider using recorded responses to reduce API costs during development.

Monitoring and Observability

Production LLM integrations require comprehensive monitoring to track performance, costs, and issues. Implement logging for all LLM requests and responses including request timestamps, token usage, latency metrics, and error rates. Use structured logging with tools like Serilog or NLog for better queryability.

Track key metrics including average response time, token consumption per request, error rates by error type, and cost per request. Implement alerts for anomalies such as sudden cost spikes, error rate increases, or unusual latency patterns. Use Application Performance Monitoring (APM) tools like Application Insights, Datadog, or New Relic for comprehensive visibility.

Cost Management and Optimization

LLM API usage can become expensive quickly, especially at scale. Implement cost management strategies including token budget tracking, usage analytics per user/feature, cost allocation by department or project, and automated spending alerts. Monitor token consumption trends, identify expensive operations, and optimize prompt engineering to reduce token usage.

Consider implementing tiered service levels where premium users get faster models, free tier uses smaller/cheaper models, and background tasks use the most cost-effective options. Use model cascading by starting with cheaper models and escalating to more powerful models only when necessary. This approach can reduce costs by 60-80% while maintaining quality for most requests.

Deployment Considerations

Deploying LLM-integrated applications requires careful planning for scalability, reliability, and cost management. Consider containerization using Docker for consistent deployments, orchestration with Kubernetes for scaling, and managed services like Azure Container Apps or AWS ECS for simplified operations.

Implement health checks that verify LLM API connectivity, validate configuration settings, and monitor service dependencies. Use deployment strategies like blue-green deployments for zero-downtime updates, canary releases to test changes with subset of users, and feature flags to enable/disable LLM features dynamically. Ensure your deployment pipeline includes automated testing, security scanning, and performance validation.

Real-World Implementation Examples

Let’s explore practical implementations for common use cases. For a customer support chatbot, the system needs to maintain conversation context, integrate with knowledge bases, handle multi-turn conversations, and escalate to human agents when necessary. The implementation uses the conversation history manager to maintain context, retrieval-augmented generation (RAG) for knowledge base integration, and sentiment analysis to detect frustrated users.

For code generation and review tools, similar to Claude Code and other AI coding assistants, implement specialized prompts for code generation, syntax validation for generated code, security scanning for vulnerabilities, and integration with development workflows. The system can generate code from natural language descriptions, review existing code for issues, suggest optimizations and improvements, and generate unit tests automatically.

Content generation systems can create blog posts, product descriptions, and marketing copy. Implement template-based generation for consistency, style guidelines enforcement, plagiarism checking, and SEO optimization. The system maintains brand voice, generates variations for A/B testing, and optimizes content for search engines.

Error Handling and Resilience Patterns

Robust error handling is essential for production LLM integrations. Implement the circuit breaker pattern to prevent cascade failures by opening the circuit after consecutive failures, allowing time for service recovery, and gradually testing service availability. Use the retry pattern with exponential backoff for transient failures, different strategies for different error types, and maximum retry limits to prevent infinite loops.

Implement fallback mechanisms including degraded functionality when LLM is unavailable, cached responses for common queries, and user-friendly error messages. Use the bulkhead pattern to isolate LLM failures from other application components, prevent resource exhaustion, and maintain partial functionality during outages. Monitor all errors with detailed logging, alert on critical failures, and track error trends for proactive resolution.

Compliance and Data Privacy

When integrating LLMs, especially in regulated industries, compliance with data protection laws is crucial. Implement data minimization by sending only necessary data to LLMs, removing personally identifiable information, and using anonymization techniques. Ensure compliance with GDPR for European users, CCPA for California residents, and HIPAA for healthcare applications.

Implement user consent mechanisms for AI processing, provide data deletion capabilities, and maintain audit trails of all data processing. Consider data residency requirements by using region-specific endpoints, storing data in compliant locations, and documenting data flows for audits. For sensitive data, consider using local models or private deployments instead of public APIs.

Scaling LLM Integration

As your application grows, scaling LLM integration becomes critical. Implement horizontal scaling by adding more service instances, load balancing across multiple regions, and distributing load across multiple LLM providers. Use message queues for asynchronous processing including background job processing, priority queue for urgent requests, and dead letter queues for failed requests.

Implement caching at multiple levels including CDN caching for static content, application-level caching for common responses, and database caching for structured data. Use distributed caching with Redis or Memcached for high availability and fast access times. Consider implementing a multi-model strategy using different models for different tasks, selecting models based on request complexity, and balancing cost versus quality requirements.

Integration with .NET Core Features

Leverage .NET Core’s powerful features to enhance your LLM integration. Use minimal APIs for lightweight endpoints, SignalR for real-time streaming updates, and gRPC for high-performance inter-service communication. Implement background services for long-running LLM tasks, hosted services for scheduled operations, and worker services for queue processing.

Utilize .NET’s dependency injection for service registration, configuration binding for settings management, and options pattern for configuration validation. Implement middleware for request logging, authentication, and rate limiting. Use IHttpClientFactory for efficient HTTP client management, named clients for different LLM providers, and typed clients for better type safety.

Advanced Prompt Engineering

Effective prompt engineering significantly impacts LLM performance and cost. Optimizing AI chatbot interactions requires sophisticated prompting techniques. Implement template-based prompts with parameter substitution, few-shot learning with examples, and chain-of-thought prompting for complex reasoning.

Create a prompt library with reusable templates, version control for prompts, and A/B testing for optimization. Implement dynamic prompts that adapt based on user context, conversation history, and task complexity. Use system messages effectively to set behavior and constraints, define output formats, and establish safety guidelines. Test prompts thoroughly across different scenarios, measure effectiveness with metrics, and iterate based on performance data.

Building RAG Systems

Retrieval-Augmented Generation (RAG) enhances LLM responses with external knowledge. In .NET Core, implement RAG by integrating vector databases like Pinecone or Weaviate, embedding APIs for text vectorization, and search algorithms for relevant context retrieval. The RAG pipeline involves processing documents into chunks, generating embeddings for each chunk, storing vectors in a database, and retrieving relevant context for queries.

For implementation, create a document processor that splits documents into semantic chunks, generates embeddings using APIs or local models, and stores chunks with metadata. Build a retrieval service that converts queries to embeddings, performs similarity searches, and ranks results by relevance. Integrate retrieved context into LLM prompts by formatting context clearly, handling context window limits, and citing sources in responses.

Future-Proofing Your Integration

The LLM landscape evolves rapidly with new models, capabilities, and providers emerging regularly. Future-proof your integration by maintaining abstraction layers that separate business logic from provider-specific code, using interfaces for all LLM operations, and implementing provider-agnostic data models. Stay updated with AI tools and best practices as the field advances.

Design for extensibility by supporting multiple LLM providers simultaneously, allowing runtime provider switching, and implementing feature flags for new capabilities. Monitor the LLM ecosystem for new developments, participate in communities and forums, and experiment with emerging models and techniques. Plan for model upgrades with version management strategies, backward compatibility considerations, and gradual migration paths.

Conclusion

Integrating LLMs into .NET Core applications opens tremendous possibilities for building intelligent, user-friendly applications. This guide covered the complete integration process from basic setup to advanced production patterns. Key takeaways include implementing robust architecture with proper abstraction layers, securing API keys and sensitive data, optimizing for performance and cost, implementing comprehensive error handling, and maintaining observability through logging and monitoring.

Start with a simple integration for a specific use case, gradually add advanced features as you gain experience, and continuously monitor and optimize your implementation. The field of LLMs is rapidly evolving, so staying current with new developments and best practices is essential. With proper implementation, AI application development in .NET Core can transform your applications and deliver exceptional value to users.

Remember that successful LLM integration is not just about technical implementation but also about user experience, cost management, and ethical considerations. Build systems that are reliable, scalable, and responsible. Test thoroughly, monitor continuously, and iterate based on real-world usage. Whether you’re building chatbots, code assistants, content generators, or entirely new categories of AI-powered applications, .NET Core provides a robust foundation for LLM integration.

Frequently Asked Questions

What are the best LLM providers for .NET Core integration?

The best LLM provider depends on your specific requirements. OpenAI GPT-4 offers excellent general capabilities and broad language support. Azure OpenAI Service provides enterprise features, compliance, and regional deployment options. Anthropic Claude excels at reasoning and longer context windows. For cost-sensitive or offline applications, local models through Ollama offer complete control and privacy. Many production applications use multiple providers with fallback strategies to ensure reliability and optimize costs.

How much does it cost to integrate LLMs into a .NET application?

LLM integration costs vary significantly based on usage volume, model selection, and implementation approach. OpenAI GPT-4 costs approximately $0.03 per 1K input tokens and $0.06 per 1K output tokens. GPT-3.5-turbo is more economical at $0.0015 per 1K tokens. Azure OpenAI has similar pricing with added infrastructure costs. Local deployment with Ollama has zero per-request costs but requires hardware investment (GPU servers cost $1000-$5000). For typical applications, expect $100-$1000 monthly API costs for moderate usage, scaling with user base and feature complexity.

How do I handle rate limits when integrating LLMs?

Implement rate limiting strategies including request queuing to manage concurrent requests, exponential backoff for retry logic, and token bucket algorithms to enforce limits. Monitor rate limit headers in API responses, implement user-level quotas for fair usage, and provide feedback when limits are approached. For high-traffic applications, consider using multiple API keys to increase throughput, implementing request batching where possible, and caching frequent responses to reduce API calls. Most providers offer rate limit increases for production applications with proven use cases.

What security measures should I implement for LLM integration?

Security measures should include storing API keys in secure vaults like Azure Key Vault, never exposing keys in client-side code or version control, and using managed identities in cloud deployments. Implement input validation to prevent prompt injection attacks, content filtering for inappropriate requests, and audit logging for all LLM interactions. Ensure data encryption in transit using HTTPS, implement authentication and authorization for API endpoints, and comply with data protection regulations like GDPR and CCPA. Regular security audits and penetration testing help identify vulnerabilities.

How can I optimize LLM costs in production?

Cost optimization strategies include using smaller models for simpler tasks, implementing aggressive caching for repeated queries, and optimizing prompt engineering to reduce token usage. Use model cascading by starting with cheaper models and escalating only when necessary. Implement token budgets per user or feature, monitor costs continuously with alerts, and analyze usage patterns to identify optimization opportunities. Consider local deployment for high-volume use cases, batch processing for non-urgent requests, and prompt compression techniques to reduce token counts while maintaining quality.

Can I use multiple LLM providers in the same application?

Yes, using multiple LLM providers is a recommended best practice for production applications. Implement a provider abstraction layer with a common interface, use factory patterns for provider instantiation, and implement fallback logic when primary provider fails. Different providers offer different strengths: use GPT-4 for complex reasoning, Claude for long documents, and local models for privacy-sensitive data. This multi-provider strategy improves reliability, optimizes costs by routing to appropriate providers, and prevents vendor lock-in. Implement provider selection logic based on request characteristics, costs, and availability.

How do I test LLM integrations effectively?

Testing LLM integrations requires multiple strategies due to non-deterministic outputs. Implement unit tests with mocked responses using frameworks like Moq, integration tests with dedicated test API keys, and snapshot testing for response validation. Use recorded HTTP interactions to reduce API costs during development. Implement golden dataset testing with known good responses, measure response quality with automated metrics, and conduct A/B testing for prompt variations. Load testing validates performance under scale, while chaos engineering tests resilience. Maintain test prompts in version control and continuously validate against production patterns.

What’s the difference between streaming and non-streaming responses?

Non-streaming responses wait for the complete LLM output before returning, which can take 10-30 seconds for long responses. Streaming responses deliver tokens as they’re generated, providing immediate user feedback and better perceived performance. Streaming is essential for chat applications where users expect real-time responses. Implement streaming in .NET Core using Server-Sent Events (SSE) or WebSockets, process tokens as they arrive, and update UI progressively. Streaming increases implementation complexity but significantly improves user experience. Most modern LLM integrations support both modes, allowing selection based on use case requirements.

How do I handle context window limitations?

Context window limitations restrict how much text you can send to an LLM. GPT-4-turbo supports 128K tokens while GPT-3.5-turbo supports 16K tokens. Implement token counting before sending requests using libraries like SharpToken, truncate messages when approaching limits, and use sliding window approaches for long conversations. Summarize older messages to maintain context while reducing token usage, implement conversation segmentation for very long interactions, and consider using models with larger context windows for document analysis. Store full conversation history separately from what’s sent to the LLM.

What are the best practices for prompt engineering in .NET applications?

Effective prompt engineering involves creating structured templates with clear instructions, using few-shot learning with relevant examples, and implementing chain-of-thought prompting for complex tasks. Maintain a prompt library with version control, separate system messages from user content, and use parameter substitution for dynamic prompts. Test prompts across diverse scenarios, measure effectiveness with metrics like accuracy and relevance, and iterate based on real usage data. Implement prompt compression techniques to reduce token usage, use role-based prompts for different scenarios, and maintain consistency across the application. Document prompt patterns and share learnings across teams.

If you want a deep conceptual understanding of how transformer-based models work internally, you should read
Transformer Models Explained: The Architecture Behind ChatGPT and Modern LLMs.

Code Examples and Implementation Samples

Below are production-ready code samples demonstrating the concepts discussed throughout this guide. These examples can be directly integrated into your .NET Core applications.

1. LLM Service Interface Definition

using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;

namespace LLMIntegration.Interfaces
{
    public interface ILLMService
    {
        Task<LLMResponse> GenerateCompletionAsync(
            string prompt,
            LLMOptions options = null,
            CancellationToken cancellationToken = default);

        IAsyncEnumerable<string> StreamCompletionAsync(
            string prompt,
            LLMOptions options = null,
            CancellationToken cancellationToken = default);

        Task<LLMResponse> GenerateChatCompletionAsync(
            List<ChatMessage> messages,
            LLMOptions options = null,
            CancellationToken cancellationToken = default);
    }

    public class LLMOptions
    {
        public string Model { get; set; } = "gpt-3.5-turbo";
        public double Temperature { get; set; } = 0.7;
        public int MaxTokens { get; set; } = 1000;
        public double TopP { get; set; } = 1.0;
        public int FrequencyPenalty { get; set; } = 0;
        public int PresencePenalty { get; set; } = 0;
    }

    public class ChatMessage
    {
        public string Role { get; set; }
        public string Content { get; set; }
    }

    public class LLMResponse
    {
        public string Content { get; set; }
        public int TokensUsed { get; set; }
        public string Model { get; set; }
        public TimeSpan ResponseTime { get; set; }
    }
}

2. OpenAI Service Implementation

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Net.Http;
using System.Net.Http.Json;
using System.Runtime.CompilerServices;
using System.Text.Json;
using System.Threading;
using System.Threading.Tasks;
using LLMIntegration.Interfaces;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.Logging;

namespace LLMIntegration.Services
{
    public class OpenAIService : ILLMService
    {
        private readonly HttpClient _httpClient;
        private readonly ILogger<OpenAIService> _logger;
        private readonly string _apiKey;
        private readonly string _baseUrl = "https://api.openai.com/v1";

        public OpenAIService(
            HttpClient httpClient,
            IConfiguration configuration,
            ILogger<OpenAIService> logger)
        {
            _httpClient = httpClient;
            _logger = logger;
            _apiKey = configuration["OpenAI:ApiKey"];
            
            _httpClient.DefaultRequestHeaders.Add("Authorization", $"Bearer {_apiKey}");
        }

        public async Task<LLMResponse> GenerateChatCompletionAsync(
            List<ChatMessage> messages,
            LLMOptions options = null,
            CancellationToken cancellationToken = default)
        {
            options ??= new LLMOptions();
            var stopwatch = Stopwatch.StartNew();

            var requestBody = new
            {
                model = options.Model,
                messages = messages.Select(m => new { role = m.Role, content = m.Content }),
                temperature = options.Temperature,
                max_tokens = options.MaxTokens,
                top_p = options.TopP,
                frequency_penalty = options.FrequencyPenalty,
                presence_penalty = options.PresencePenalty
            };

            try
            {
                var response = await _httpClient.PostAsJsonAsync(
                    $"{_baseUrl}/chat/completions",
                    requestBody,
                    cancellationToken);

                response.EnsureSuccessStatusCode();

                var result = await response.Content.ReadFromJsonAsync<OpenAIResponse>(
                    cancellationToken: cancellationToken);

                stopwatch.Stop();

                _logger.LogInformation(
                    "OpenAI completion generated. Tokens: {Tokens}, Time: {Time}ms",
                    result.Usage.TotalTokens,
                    stopwatch.ElapsedMilliseconds);

                return new LLMResponse
                {
                    Content = result.Choices[0].Message.Content,
                    TokensUsed = result.Usage.TotalTokens,
                    Model = result.Model,
                    ResponseTime = stopwatch.Elapsed
                };
            }
            catch (HttpRequestException ex)
            {
                _logger.LogError(ex, "HTTP error calling OpenAI API");
                throw;
            }
        }

        public async IAsyncEnumerable<string> StreamCompletionAsync(
            string prompt,
            LLMOptions options = null,
            [EnumeratorCancellation] CancellationToken cancellationToken = default)
        {
            options ??= new LLMOptions();

            var messages = new List<ChatMessage>
            {
                new ChatMessage { Role = "user", Content = prompt }
            };

            var requestBody = new
            {
                model = options.Model,
                messages = messages.Select(m => new { role = m.Role, content = m.Content }),
                temperature = options.Temperature,
                max_tokens = options.MaxTokens,
                stream = true
            };

            var request = new HttpRequestMessage(HttpMethod.Post, $"{_baseUrl}/chat/completions")
            {
                Content = JsonContent.Create(requestBody)
            };

            using var response = await _httpClient.SendAsync(
                request,
                HttpCompletionOption.ResponseHeadersRead,
                cancellationToken);

            response.EnsureSuccessStatusCode();

            using var stream = await response.Content.ReadAsStreamAsync(cancellationToken);
            using var reader = new System.IO.StreamReader(stream);

            while (!reader.EndOfStream && !cancellationToken.IsCancellationRequested)
            {
                var line = await reader.ReadLineAsync();

                if (string.IsNullOrWhiteSpace(line) || !line.StartsWith("data: "))
                    continue;

                var data = line.Substring(6);

                if (data == "[DONE]")
                    break;

                try
                {
                    var json = JsonSerializer.Deserialize<StreamResponse>(data);
                    var content = json?.Choices?[0]?.Delta?.Content;

                    if (!string.IsNullOrEmpty(content))
                    {
                        yield return content;
                    }
                }
                catch (JsonException ex)
                {
                    _logger.LogWarning(ex, "Failed to parse streaming response");
                }
            }
        }

        public Task<LLMResponse> GenerateCompletionAsync(
            string prompt,
            LLMOptions options = null,
            CancellationToken cancellationToken = default)
        {
            var messages = new List<ChatMessage>
            {
                new ChatMessage { Role = "user", Content = prompt }
            };

            return GenerateChatCompletionAsync(messages, options, cancellationToken);
        }

        private class OpenAIResponse
        {
            public string Model { get; set; }
            public List<Choice> Choices { get; set; }
            public Usage Usage { get; set; }
        }

        private class Choice
        {
            public Message Message { get; set; }
        }

        private class Message
        {
            public string Content { get; set; }
        }

        private class Usage
        {
            public int TotalTokens { get; set; }
        }

        private class StreamResponse
        {
            public List<StreamChoice> Choices { get; set; }
        }

        private class StreamChoice
        {
            public Delta Delta { get; set; }
        }

        private class Delta
        {
            public string Content { get; set; }
        }
    }
}

3. Configuration Models and Settings

using System.ComponentModel.DataAnnotations;

namespace LLMIntegration.Configuration
{
    public class LLMConfiguration
    {
        [Required]
        public string Provider { get; set; }

        public OpenAISettings OpenAI { get; set; }
        public AzureOpenAISettings AzureOpenAI { get; set; }
        public AnthropicSettings Anthropic { get; set; }
        public OllamaSettings Ollama { get; set; }
    }

    public class OpenAISettings
    {
        [Required]
        public string ApiKey { get; set; }
        public string BaseUrl { get; set; } = "https://api.openai.com/v1";
        public string DefaultModel { get; set; } = "gpt-3.5-turbo";
    }

    public class AzureOpenAISettings
    {
        [Required]
        public string ApiKey { get; set; }
        [Required]
        public string Endpoint { get; set; }
        [Required]
        public string DeploymentName { get; set; }
        public string ApiVersion { get; set; } = "2023-12-01-preview";
    }

    public class AnthropicSettings
    {
        [Required]
        public string ApiKey { get; set; }
        public string BaseUrl { get; set; } = "https://api.anthropic.com/v1";
        public string DefaultModel { get; set; } = "claude-3-sonnet-20240229";
    }

    public class OllamaSettings
    {
        public string BaseUrl { get; set; } = "http://localhost:11434";
        public string DefaultModel { get; set; } = "llama2";
    }
}

4. Dependency Injection Setup in Program.cs

using LLMIntegration.Configuration;
using LLMIntegration.Interfaces;
using LLMIntegration.Services;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Polly;
using Polly.Extensions.Http;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddControllers();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

builder.Services.Configure<LLMConfiguration>(
    builder.Configuration.GetSection("LLM"));

builder.Services.AddHttpClient<ILLMService, OpenAIService>()
    .SetHandlerLifetime(TimeSpan.FromMinutes(5))
    .AddPolicyHandler(GetRetryPolicy())
    .AddPolicyHandler(GetCircuitBreakerPolicy());

builder.Services.AddMemoryCache();
builder.Services.AddSingleton<IConversationManager, ConversationManager>();
builder.Services.AddSingleton<IRateLimiter, TokenBucketRateLimiter>();

var app = builder.Build();

if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI();
}

app.UseHttpsRedirection();
app.UseAuthorization();
app.MapControllers();
app.Run();

static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
{
    return HttpPolicyExtensions
        .HandleTransientHttpError()
        .OrResult(msg => msg.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
        .WaitAndRetryAsync(
            3,
            retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)));
}

static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
{
    return HttpPolicyExtensions
        .HandleTransientHttpError()
        .CircuitBreakerAsync(5, TimeSpan.FromSeconds(30));
}

5. Conversation Manager Implementation

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using LLMIntegration.Interfaces;

namespace LLMIntegration.Services
{
    public interface IConversationManager
    {
        void AddMessage(string conversationId, ChatMessage message);
        List<ChatMessage> GetMessages(string conversationId, int maxMessages = 10);
        void ClearConversation(string conversationId);
        int GetTokenCount(string conversationId);
    }

    public class ConversationManager : IConversationManager
    {
        private readonly ConcurrentDictionary<string, List<ChatMessage>> _conversations;
        private const int MaxTokensPerConversation = 4000;

        public ConversationManager()
        {
            _conversations = new ConcurrentDictionary<string, List<ChatMessage>>();
        }

        public void AddMessage(string conversationId, ChatMessage message)
        {
            var messages = _conversations.GetOrAdd(conversationId, _ => new List<ChatMessage>());

            lock (messages)
            {
                messages.Add(message);
                TrimConversation(messages);
            }
        }

        public List<ChatMessage> GetMessages(string conversationId, int maxMessages = 10)
        {
            if (_conversations.TryGetValue(conversationId, out var messages))
            {
                lock (messages)
                {
                    return messages.TakeLast(maxMessages).ToList();
                }
            }

            return new List<ChatMessage>();
        }

        public void ClearConversation(string conversationId)
        {
            _conversations.TryRemove(conversationId, out _);
        }

        public int GetTokenCount(string conversationId)
        {
            if (_conversations.TryGetValue(conversationId, out var messages))
            {
                lock (messages)
                {
                    return messages.Sum(m => EstimateTokens(m.Content));
                }
            }

            return 0;
        }

        private void TrimConversation(List<ChatMessage> messages)
        {
            while (GetTotalTokens(messages) > MaxTokensPerConversation && messages.Count > 1)
            {
                messages.RemoveAt(0);
            }
        }

        private int GetTotalTokens(List<ChatMessage> messages)
        {
            return messages.Sum(m => EstimateTokens(m.Content));
        }

        private int EstimateTokens(string text)
        {
            return (int)Math.Ceiling(text.Length / 4.0);
        }
    }
}

6. Rate Limiter Implementation

using System;
using System.Collections.Concurrent;
using System.Threading;
using System.Threading.Tasks;

namespace LLMIntegration.Services
{
    public interface IRateLimiter
    {
        Task<bool> TryAcquireAsync(string key, CancellationToken cancellationToken = default);
    }

    public class TokenBucketRateLimiter : IRateLimiter
    {
        private readonly ConcurrentDictionary<string, TokenBucket> _buckets;
        private readonly int _capacity;
        private readonly TimeSpan _refillInterval;
        private readonly int _tokensPerRefill;

        public TokenBucketRateLimiter(
            int capacity = 10,
            TimeSpan? refillInterval = null,
            int tokensPerRefill = 1)
        {
            _buckets = new ConcurrentDictionary<string, TokenBucket>();
            _capacity = capacity;
            _refillInterval = refillInterval ?? TimeSpan.FromMinutes(1);
            _tokensPerRefill = tokensPerRefill;
        }

        public async Task<bool> TryAcquireAsync(string key, CancellationToken cancellationToken = default)
        {
            var bucket = _buckets.GetOrAdd(key, _ => new TokenBucket(_capacity));
            return await bucket.TryConsumeAsync(_tokensPerRefill, _refillInterval, cancellationToken);
        }

        private class TokenBucket
        {
            private readonly int _capacity;
            private int _tokens;
            private DateTime _lastRefill;
            private readonly SemaphoreSlim _semaphore;

            public TokenBucket(int capacity)
            {
                _capacity = capacity;
                _tokens = capacity;
                _lastRefill = DateTime.UtcNow;
                _semaphore = new SemaphoreSlim(1, 1);
            }

            public async Task<bool> TryConsumeAsync(
                int tokensToConsume,
                TimeSpan refillInterval,
                CancellationToken cancellationToken)
            {
                await _semaphore.WaitAsync(cancellationToken);

                try
                {
                    Refill(refillInterval);

                    if (_tokens >= tokensToConsume)
                    {
                        _tokens -= tokensToConsume;
                        return true;
                    }

                    return false;
                }
                finally
                {
                    _semaphore.Release();
                }
            }

            private void Refill(TimeSpan refillInterval)
            {
                var now = DateTime.UtcNow;
                var timePassed = now - _lastRefill;
                var refillCount = (int)(timePassed.TotalMilliseconds / refillInterval.TotalMilliseconds);

                if (refillCount > 0)
                {
                    _tokens = Math.Min(_capacity, _tokens + refillCount);
                    _lastRefill = now;
                }
            }
        }
    }
}

7. API Controller Implementation

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using LLMIntegration.Interfaces;
using LLMIntegration.Services;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Logging;

namespace LLMIntegration.Controllers
{
    [ApiController]
    [Route("api/[controller]")]
    public class ChatController : ControllerBase
    {
        private readonly ILLMService _llmService;
        private readonly IConversationManager _conversationManager;
        private readonly IRateLimiter _rateLimiter;
        private readonly ILogger<ChatController> _logger;

        public ChatController(
            ILLMService llmService,
            IConversationManager conversationManager,
            IRateLimiter rateLimiter,
            ILogger<ChatController> logger)
        {
            _llmService = llmService;
            _conversationManager = conversationManager;
            _rateLimiter = rateLimiter;
            _logger = logger;
        }

        [HttpPost("completion")]
        public async Task<ActionResult<ChatResponse>> GenerateCompletion(
            [FromBody] ChatRequest request,
            CancellationToken cancellationToken)
        {
            var userId = GetUserId();

            if (!await _rateLimiter.TryAcquireAsync(userId, cancellationToken))
            {
                return StatusCode(429, "Rate limit exceeded. Please try again later.");
            }

            try
            {
                _conversationManager.AddMessage(
                    request.ConversationId,
                    new ChatMessage { Role = "user", Content = request.Message });

                var messages = _conversationManager.GetMessages(request.ConversationId);

                var options = new LLMOptions
                {
                    Model = request.Model ?? "gpt-3.5-turbo",
                    Temperature = request.Temperature ?? 0.7,
                    MaxTokens = request.MaxTokens ?? 1000
                };

                var response = await _llmService.GenerateChatCompletionAsync(
                    messages,
                    options,
                    cancellationToken);

                _conversationManager.AddMessage(
                    request.ConversationId,
                    new ChatMessage { Role = "assistant", Content = response.Content });

                return Ok(new ChatResponse
                {
                    Message = response.Content,
                    TokensUsed = response.TokensUsed,
                    Model = response.Model,
                    ConversationId = request.ConversationId
                });
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Error generating completion");
                return StatusCode(500, "An error occurred processing your request");
            }
        }

        [HttpPost("stream")]
        public async Task StreamCompletion(
            [FromBody] ChatRequest request,
            CancellationToken cancellationToken)
        {
            var userId = GetUserId();

            if (!await _rateLimiter.TryAcquireAsync(userId, cancellationToken))
            {
                Response.StatusCode = 429;
                await Response.WriteAsync("Rate limit exceeded");
                return;
            }

            Response.ContentType = "text/event-stream";

            await foreach (var token in _llmService.StreamCompletionAsync(
                request.Message,
                new LLMOptions { Model = request.Model ?? "gpt-3.5-turbo" },
                cancellationToken))
            {
                await Response.WriteAsync($"data: {token}\n\n");
                await Response.Body.FlushAsync(cancellationToken);
            }
        }

        [HttpDelete("conversation/{conversationId}")]
        public IActionResult ClearConversation(string conversationId)
        {
            _conversationManager.ClearConversation(conversationId);
            return NoContent();
        }

        private string GetUserId()
        {
            return User?.Identity?.Name ?? HttpContext.Connection.RemoteIpAddress?.ToString() ?? "anonymous";
        }
    }

    public class ChatRequest
    {
        public string ConversationId { get; set; }
        public string Message { get; set; }
        public string Model { get; set; }
        public double? Temperature { get; set; }
        public int? MaxTokens { get; set; }
    }

    public class ChatResponse
    {
        public string Message { get; set; }
        public int TokensUsed { get; set; }
        public string Model { get; set; }
        public string ConversationId { get; set; }
    }
}

8. appsettings.json Configuration

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*",
  "LLM": {
    "Provider": "OpenAI",
    "OpenAI": {
      "ApiKey": "your-openai-api-key-here",
      "BaseUrl": "https://api.openai.com/v1",
      "DefaultModel": "gpt-3.5-turbo"
    },
    "AzureOpenAI": {
      "ApiKey": "your-azure-openai-key-here",
      "Endpoint": "https://your-resource.openai.azure.com",
      "DeploymentName": "your-deployment-name",
      "ApiVersion": "2023-12-01-preview"
    },
    "Anthropic": {
      "ApiKey": "your-anthropic-key-here",
      "BaseUrl": "https://api.anthropic.com/v1",
      "DefaultModel": "claude-3-sonnet-20240229"
    },
    "Ollama": {
      "BaseUrl": "http://localhost:11434",
      "DefaultModel": "llama2"
    }
  }
}

9. Caching Service with Decorator Pattern

using System;
using System.Collections.Generic;
using System.Linq;
using System.Security.Cryptography;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using LLMIntegration.Interfaces;
using Microsoft.Extensions.Caching.Memory;
using Microsoft.Extensions.Logging;

namespace LLMIntegration.Services
{
    public interface ICachedLLMService : ILLMService
    {
        void ClearCache();
    }

    public class CachedLLMService : ICachedLLMService
    {
        private readonly ILLMService _innerService;
        private readonly IMemoryCache _cache;
        private readonly ILogger<CachedLLMService> _logger;
        private readonly TimeSpan _cacheDuration = TimeSpan.FromHours(1);

        public CachedLLMService(
            ILLMService innerService,
            IMemoryCache cache,
            ILogger<CachedLLMService> logger)
        {
            _innerService = innerService;
            _cache = cache;
            _logger = logger;
        }

        public async Task<LLMResponse> GenerateCompletionAsync(
            string prompt,
            LLMOptions options = null,
            CancellationToken cancellationToken = default)
        {
            var cacheKey = GenerateCacheKey(prompt, options);

            if (_cache.TryGetValue(cacheKey, out LLMResponse cachedResponse))
            {
                _logger.LogInformation("Cache hit for prompt");
                return cachedResponse;
            }

            var response = await _innerService.GenerateCompletionAsync(prompt, options, cancellationToken);

            _cache.Set(cacheKey, response, _cacheDuration);

            return response;
        }

        public async Task<LLMResponse> GenerateChatCompletionAsync(
            List<ChatMessage> messages,
            LLMOptions options = null,
            CancellationToken cancellationToken = default)
        {
            var cacheKey = GenerateCacheKey(string.Join("|", messages.Select(m => $"{m.Role}:{m.Content}")), options);

            if (_cache.TryGetValue(cacheKey, out LLMResponse cachedResponse))
            {
                _logger.LogInformation("Cache hit for chat completion");
                return cachedResponse;
            }

            var response = await _innerService.GenerateChatCompletionAsync(messages, options, cancellationToken);

            _cache.Set(cacheKey, response, _cacheDuration);

            return response;
        }

        public IAsyncEnumerable<string> StreamCompletionAsync(
            string prompt,
            LLMOptions options = null,
            CancellationToken cancellationToken = default)
        {
            return _innerService.StreamCompletionAsync(prompt, options, cancellationToken);
        }

        public void ClearCache()
        {
            if (_cache is MemoryCache memoryCache)
            {
                memoryCache.Compact(1.0);
            }
        }

        private string GenerateCacheKey(string content, LLMOptions options)
        {
            var keyContent = $"{content}|{options?.Model}|{options?.Temperature}|{options?.MaxTokens}";
            using var sha256 = SHA256.Create();
            var hashBytes = sha256.ComputeHash(Encoding.UTF8.GetBytes(keyContent));
            return Convert.ToBase64String(hashBytes);
        }
    }
}

10. Usage Example in a Console Application

using System;
using System.Threading.Tasks;
using LLMIntegration.Interfaces;
using LLMIntegration.Services;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;

namespace LLMIntegration.Examples
{
    class Program
    {
        static async Task Main(string[] args)
        {
            var configuration = new ConfigurationBuilder()
                .AddJsonFile("appsettings.json")
                .AddEnvironmentVariables()
                .Build();

            var services = new ServiceCollection();
            services.AddLogging(builder => builder.AddConsole());
            services.AddHttpClient();
            services.AddSingleton<IConfiguration>(configuration);
            services.AddSingleton<ILLMService, OpenAIService>();

            var serviceProvider = services.BuildServiceProvider();
            var llmService = serviceProvider.GetRequiredService<ILLMService>();

            Console.WriteLine("Enter your prompt (or 'exit' to quit):");
            
            while (true)
            {
                Console.Write("\nYou: ");
                var prompt = Console.ReadLine();

                if (prompt?.ToLower() == "exit")
                    break;

                Console.Write("Assistant: ");

                await foreach (var token in llmService.StreamCompletionAsync(prompt))
                {
                    Console.Write(token);
                }

                Console.WriteLine();
            }
        }
    }
}