When I first started experimenting with AI features, I naturally tried to do everything in Rails-that’s just what felt comfortable. But honestly? It was a mess. All code from this tutorial is available as a fully functional demo at github.com/bullrico/code_examples. You can clone and run it in under 2 minutes with docker compose up. The demo includes mock AI responses so you can test without a GPU or model installation.

Why I Stopped Trying to Make Rails Do Everything

When I first started experimenting with AI features, I naturally tried to do everything in Rails-that’s just what felt comfortable. But honestly? It was a mess. I spent weeks wrestling with Ruby’s ML libraries (which, let’s be real, aren’t great), watching my Rails app’s memory usage balloon, and getting response times that made me cringe. I kept thinking “there has to be a better way.” Eventually, I gave up on the “everything in Rails” approach and split things up: Rails for what it’s good at, Python for AI stuff. This post is basically my notes on how that worked out - spoiler alert: pretty well, but not without some bumps along the way.

What You’ll Need

Here’s what I’m working with - you might need to adjust based on your setup:
  • Rails 8 (I’m using Solid Queue and Solid Cache, which turned out way simpler than Redis)
  • Python 3.11+
  • Some kind of GPU (I’ve tested this on an NVIDIA RTX 4070)
  • Decent RAM (16GB seems to be the sweet spot)
  • Docker and Docker Compose (honestly makes everything easier)
  • PostgreSQL (whatever recent version you have)
Fair warning: I originally tried to run this on CPU-only and it was painfully slow. GPU really does make a difference for local models.

The Architecture That Actually Worked

After some trial and error, here’s the setup that actually worked for me: Rails 8 handles the web layer, admin dashboard, and business logic using Solid Queue for background jobs and Solid Cache for caching-no Redis needed. The FastAPI service handles all AI processing on port 8001. PostgreSQL stores everything. Each service runs in its own Docker container with shared networking. This looks clean in theory, but getting here took some iteration. My first version had the Rails app talking directly to a Python script, then I tried making everything run on the same server, then I overcomplicated it with message queues. This Docker Compose approach is what finally clicked.

What I Tried First

Before I get into the solution that worked, let me share what I tried first-maybe you’ll recognize some of these mistakes: First, I tried using ruby-openai and various ML gems. The integration was clean, but performance was terrible and I kept running into memory issues with larger models. Then I actually tried calling Python scripts via system calls. It worked for about 5 minutes until I realized how brittle and slow it was. After these failures, I finally accepted that maybe Rails doesn’t need to do everything. That’s when I started exploring separate services.

Building the Python Service

I’ll start with the Python side since that’s where I spent most of my debugging time. Fair warning: this took me a few iterations to get right.
mkdir rails_ai_service  # I called mine "ai_experiment" first, but this is clearer
cd rails_ai_service
Here are the dependencies I settled on after trying a bunch of different combinations:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/ai_service/requirements.txt

fastapi==0.115.0
uvicorn==0.32.0
pydantic==2.10.3
python-dotenv==1.0.1
python-jose[cryptography]==3.3.0
slowapi==0.1.9
# llama-cpp-python is installed separately in Docker with GPU support
Install the dependencies:
pip install -r requirements.txt
Now create your FastAPI application in main.py:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/ai_service/main.py

import os
import logging
from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import BaseModel
from typing import Optional
import time
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
import hashlib
import hmac
import random

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Rate limiting
limiter = Limiter(key_func=get_remote_address)
app = FastAPI(title="Rails AI Service", version="1.0.0")
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

# Security
security = HTTPBearer()
API_SECRET = os.getenv("API_SECRET", "default-secret-change-in-production")

def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
    """Verify the API token"""
    token = credentials.credentials
    expected_token = hashlib.sha256(API_SECRET.encode()).hexdigest()
    if not hmac.compare_digest(token, expected_token):
        raise HTTPException(status_code=403, detail="Invalid authentication")
    return token

# Mock mode for testing without actual model
MOCK_MODE = os.getenv("MOCK_MODE", "true").lower() == "true"

if MOCK_MODE:
    logger.info("Running in MOCK MODE - no actual model loaded")
    llm = None
else:
    # This would load the actual model in production
    try:
        from llama_cpp import Llama
        model_path = os.getenv("MODEL_PATH", "/models/llama-2-7b-chat.Q4_K_M.gguf")
        n_gpu_layers = int(os.getenv("N_GPU_LAYERS", "-1"))
        
        if n_gpu_layers == -1:
            logger.info("Using GPU acceleration with all layers")
        elif n_gpu_layers == 0:
            logger.warning("GPU disabled, using CPU (will be slow)")
        else:
            logger.info(f"Using GPU acceleration with {n_gpu_layers} layers")
        
        llm = Llama(
            model_path=model_path,
            n_ctx=4096,  # Context window
            n_gpu_layers=n_gpu_layers,
            n_threads=8,
            verbose=False
        )
        logger.info(f"Loaded model from {model_path}")
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        logger.info("Falling back to mock mode")
        MOCK_MODE = True
        llm = None

class TextAnalysisRequest(BaseModel):
    text: str
    prompt: Optional[str] = "Analyze the sentiment of this text:"
    max_tokens: Optional[int] = 500
    temperature: Optional[float] = 0.3

class TextAnalysisResponse(BaseModel):
    result: str
    model_used: str
    tokens_used: int
    processing_time: float
    success: bool

@app.get("/health")
async def health_check():
    """Health check endpoint for monitoring"""
    return {"status": "healthy", "service": "rails-ai-service"}

@app.post("/analyze-text", response_model=TextAnalysisResponse)
@limiter.limit("10/minute")
async def analyze_text(
    request: Request,
    analysis_request: TextAnalysisRequest,
    token: str = Depends(verify_token)
):
    """Analyze text using local Llama model with rate limiting and authentication"""
    start_time = time.time()
    
    try:
        logger.info(f"Processing text analysis")
        
        if MOCK_MODE or llm is None:
            # Generate a mock response for testing
            sentiments = ["positive", "negative", "neutral", "mixed"]
            themes = ["technology", "business", "personal growth", "innovation", "collaboration"]
            
            selected_sentiment = random.choice(sentiments)
            selected_themes = random.sample(themes, k=min(3, len(themes)))
            
            result_text = f"""Based on the analysis of the provided text:

Sentiment: The overall sentiment appears to be {selected_sentiment}.

Key Themes Identified:
{chr(10).join(f'- {theme.capitalize()}' for theme in selected_themes)}

Summary: The text contains approximately {len(analysis_request.text.split())} words.

[Note: This is a mock response for testing]"""
            
            tokens = len(analysis_request.text.split()) + 150
            model_name = "mock-model-for-testing"
        else:
            # Llama 3 Chat format
            full_prompt = f"""[INST] <<SYS>>
{analysis_request.prompt}
<</SYS>>

{analysis_request.text} [/INST]"""
            
            # Generate response
            response = llm(
                full_prompt,
                max_tokens=analysis_request.max_tokens,
                temperature=analysis_request.temperature,
                stop=["[INST]", "</s>"],
                echo=False
            )
            
            result_text = response['choices'][0]['text'].strip()
            tokens = response['usage']['total_tokens']
            model_name = "llama-2-7b-chat"
        
        processing_time = time.time() - start_time
        
        result = TextAnalysisResponse(
            result=result_text,
            model_used=model_name,
            tokens_used=tokens,
            processing_time=processing_time,
            success=True
        )
        
        logger.info(f"Analysis completed in {processing_time:.2f}s")
        return result
        
    except Exception as e:
        processing_time = time.time() - start_time
        logger.error(f"Error processing request: {str(e)}")
        
        raise HTTPException(
            status_code=500,
            detail={
                "error": str(e),
                "processing_time": processing_time,
                "success": False
            }
        )

if __name__ == "__main__":
    import uvicorn
    port = int(os.getenv("PORT", 8001))
    uvicorn.run(app, host="0.0.0.0", port=port)
Create a .env file:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/.env.example

API_SECRET=your-secure-random-secret-here
MODEL_PATH=/models/llama-2-7b-chat.Q4_K_M.gguf
Start your service:
python main.py
Your FastAPI service is now running on http://localhost:8001 with GPU acceleration. You can test it by visiting http://localhost:8001/docs to see the automatic API documentation. The first request will be slower as the model loads into GPU memory. Subsequent requests will be much faster.

Integrating with Rails Using Faraday

Faraday is the preferred HTTP client for production Rails apps-it’s 87% faster than alternatives and has excellent middleware support. Add to your Gemfile:
gem 'faraday', '~> 2.7'
gem 'faraday-retry', '~> 2.2'
gem 'connection_pool', '~> 2.4'
Create a service object in app/services/ai_service.rb:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/rails/app/services/ai_service.rb

require 'digest'

class AiService
  class << self
    def analyze_text(text, prompt: nil)
      body = {
        text: text,
        prompt: prompt
      }.compact
      
      Rails.logger.info "Sending AI analysis request for #{text.length} characters"
      
      response = connection.post('/analyze-text', body.to_json)
      
      if response.success?
        result = JSON.parse(response.body)
        Rails.logger.info "AI analysis completed in #{result['processing_time']}s"
        result.with_indifferent_access
      else
        Rails.logger.error "AI service error: #{response.status} - #{response.body}"
        handle_error_response(response)
      end
    rescue Faraday::TimeoutError => e
      Rails.logger.error "AI service timeout: #{e.message}"
      { success: false, error: 'Request timed out - try again' }
    rescue Faraday::Error => e
      Rails.logger.error "AI service connection error: #{e.message}"
      { success: false, error: 'Service temporarily unavailable' }
    end
    
    private
    
    def connection
      @connection ||= Faraday.new(
        url: ENV.fetch('AI_SERVICE_URL', 'http://localhost:8001'),
        headers: { 
          'Content-Type' => 'application/json',
          'Authorization' => "Bearer #{api_token}"
        }
      ) do |f|
        f.request :retry, 
                  max: 3, 
                  interval: 0.5,
                  backoff_factor: 2,
                  exceptions: [Faraday::TimeoutError, Faraday::ConnectionFailed]
        f.options.timeout = 30
        f.options.open_timeout = 10
        f.adapter Faraday.default_adapter
      end
    end
    
    def handle_error_response(response)
      case response.status
      when 500
        { success: false, error: 'AI service error - please try again' }
      when 400
        { success: false, error: 'Invalid request format' }
      when 503
        { success: false, error: 'AI service temporarily unavailable' }
      else
        { success: false, error: "Unexpected error: #{response.status}" }
      end
    end
    
    def api_token
      # Generate token from shared secret
      Digest::SHA256.hexdigest(ENV.fetch('API_SECRET', 'default-secret-change-in-production'))
    end
  end
end
Add to your Rails .env file:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/.env.example

AI_SERVICE_URL=http://localhost:8001
API_SECRET=your-secure-random-secret-here
Use it in your controllers:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/rails/app/controllers/posts_controller.rb

class PostsController < ApplicationController
  def analyze
    @post = Post.find(params[:id])
    
    result = AiService.analyze_text(
      @post.content,
      prompt: "Analyze the sentiment and key themes of this blog post:"
    )
    
    if result[:success]
      @post.update!(
        ai_analysis: result[:result],
        analysis_tokens_used: result[:tokens_used]
      )
      render json: { status: 'success', analysis: result[:result] }
    else
      render json: { status: 'error', message: result[:error] }, status: 422
    end
  end
end

Background Processing with Solid Queue

For longer-running AI tasks, Rails 8’s Solid Queue works perfectly without Redis:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/rails/app/jobs/ai_analysis_job.rb

class AiAnalysisJob < ApplicationJob
  queue_as :ai_processing
  
  def perform(post_id)
    post = Post.find(post_id)
    result = AiService.analyze_text(post.content)
    
    if result[:success]
      post.update!(ai_analysis: result[:result])
    end
  end
end
Queue it from your controller:
# In your controller
AiAnalysisJob.perform_later(@post.id)
That’s it - Solid Queue handles retries, concurrency limits, and persistence automatically.

Production Lessons from Real-World AI Integration

After building several AI-powered features in production Rails apps (including an LLM routing platform I’m currently working on), here are the key architectural insights:

Always Set Timeouts

AI services can be slow. Set aggressive timeouts to keep your Rails app responsive:
# Set appropriate timeouts (already included in ai_service.rb)
f.options.timeout = 15  # For synchronous calls
f.options.timeout = 60  # For background jobs

Monitor Token Usage

AI APIs charge by tokens. Track usage to avoid surprise bills:
# Example: app/models/post.rb (not included in demo)
class Post < ApplicationRecord
  after_update :track_token_usage, if: :saved_change_to_analysis_tokens_used?
  
  private
  
  def track_token_usage
    Rails.logger.info "Post #{id} used #{analysis_tokens_used} tokens"
    # Send to your monitoring service
  end
end

Cache with Solid Cache

Rails 8’s Solid Cache stores cache data in PostgreSQL-perfect for AI responses:
# Example: app/services/cached_ai_service.rb (not included in demo)
def analyze_with_cache(text)
  cache_key = "ai_analysis:#{Digest::MD5.hexdigest(text)}"
  
  # Solid Cache handles this seamlessly
  Rails.cache.fetch(cache_key, expires_in: 1.day) do
    result = AiService.analyze_text(text)
    result[:success] ? result : raise("Do not cache failed results")
  end
rescue StandardError => e
  Rails.logger.error "Cache error: #{e.message}"
  AiService.analyze_text(text)
end
Configure Solid Cache in config/application.rb:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/rails/config/application.rb
config.cache_store = :solid_cache_store

Handle Failures Gracefully

AI services fail. Plan for it:
# Example usage (already handled in ai_service.rb)
def analyze_text_safe(text)
  result = AiService.analyze_text(text)
  
  if result[:success]
    result[:result]
  else
    "Analysis temporarily unavailable"
  end
end

Docker Compose Setup for Development

The best way to manage multiple services is with Docker Compose. Create a docker-compose.yml:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/docker-compose.yml

services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-app_user}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-app_password}
      POSTGRES_DB: ${POSTGRES_DB:-app_development}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-app_user}"]
      interval: 5s
      timeout: 5s
      retries: 5

  rails:
    image: ruby:3.3.0
    working_dir: /app
    command: >
      bash -c "
      gem install rails &&
      rails new . --api --database=postgresql --skip-git --force &&
      bundle add faraday faraday-retry connection_pool &&
      cp /custom_code/ai_service.rb app/services/ &&
      cp /custom_code/test_controller.rb app/controllers/ &&
      cp /custom_code/routes.rb config/ &&
      rails db:create 2>/dev/null || true &&
      rails server -b 0.0.0.0"
    volumes:
      - ./rails_app:/custom_code:ro
      - rails_storage:/app
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://${POSTGRES_USER:-app_user}:${POSTGRES_PASSWORD:-app_password}@postgres:5432/${POSTGRES_DB:-app_development}
      AI_SERVICE_URL: http://fastapi:8001
      API_SECRET: ${API_SECRET:-default-secret-change-in-production}
      RAILS_ENV: development
      RAILS_LOG_TO_STDOUT: "true"
      SOLID_QUEUE_IN_PROCESS: "true"
    depends_on:
      postgres:
        condition: service_healthy

  fastapi:
    build:
      context: ./ai_service
      dockerfile: Dockerfile
    command: uvicorn main:app --host 0.0.0.0 --port 8001 --reload
    volumes:
      - ./ai_service:/app
      - ./models:/models
    ports:
      - "8001:8001"
    environment:
      API_SECRET: ${API_SECRET:-default-secret-change-in-production}
      MODEL_PATH: /models/llama-2-7b-chat.Q4_K_M.gguf
      LOG_LEVEL: info
      MOCK_MODE: "true"

volumes:
  postgres_data:
  rails_storage:
Create a .env file for your secrets:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/.env.example

API_SECRET=your-secure-random-secret-here
POSTGRES_USER=app_user
POSTGRES_PASSWORD=secure_password_here
POSTGRES_DB=app_development
Download a Llama 3 model (7B parameter version, publicly available):
mkdir models
cd models
# Download Llama 3 7B Chat model (Q4_K_M quantization for good balance of size/quality)
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
cd ..
For production deployments, you can use a proper Dockerfile. View examples at github.com/bullrico/code_examples. Start everything with one command:
docker compose up
Your services are now available at: Your complete AI-powered Rails stack is now running with local GPU acceleration. The Llama model runs entirely on your hardware - no API keys, no external dependencies, no usage limits.

Dockerfile for Production FastAPI Service

For production deployments with GPU support, see the complete Dockerfile examples at github.com/bullrico/code_examples.

Production Deployment

For production with GPU support: For cloud GPU providers, RunPod is great for NVIDIA GPUs on pay-per-hour billing, Replicate offers easy deployment for AI models, and Modal provides serverless GPU compute.

On-Premise Deployment

If you have your own GPU server, deploy using the same Docker setup. The repository includes a simple Dockerfile:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/ai_service/Dockerfile

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8001

CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
For GPU support, you would modify this to use CUDA base image and install llama-cpp-python with GPU support.

Performance Considerations

With local models on GPU:
  • Response time: 200-500ms for most queries (after model is loaded)
  • First request: 10-30 seconds while model loads into GPU memory
  • Concurrent requests: Limited by GPU memory (typically 3-5 for 7B model)
  • Use background jobs for bulk processing or when response time isn’t critical

Final Thoughts

Running Llama 3 locally changes everything. No API keys, no rate limits, no per-token costs-just pure performance on your own hardware. The combination of Rails 8’s simplified stack (Solid Queue, Solid Cache) with local AI models gives you complete control over your AI infrastructure. This architecture has proven invaluable in production. By separating Rails for web and FastAPI for AI, each tool does what it does best. Your data never leaves your servers, you can customize models for your specific use case, and you’re not dependent on any external AI provider. Rails developers don’t need to become ML experts - you just need a GPU and this simple integration pattern. Your Rails app remains the conductor, while Python handles the AI heavy lifting. Have questions about integrating AI with Rails? Want to share your own experiences? Drop me a line – I’d love to hear from you.