Stop forcing Rails to do AI (at least for now). Learn how to build a FastAPI service for AI calls and integrate it properly with your Rails app, with real production code.
Your First AI Feature in Rails: Calling Python Services the Right Way
Complete Working Code AvailableAll code from this tutorial is available as a fully functional demo at github.com/bullrico/code_examples.Clone and run it in under 2 minutes:
git clone https://github.com/bullrico/code_examples.gitcd code_examples/01_rails_python_ai_servicesdocker compose up
The demo includes mock AI responses so you can test without a GPU or model installation.
Why Rails Developers Should Stop Trying to Make Rails Do Everything
After 17 years with Rails and two years deep in AI development, I’ve learned something that might surprise you: Rails isn’t the right tool for AI workloads. And that’s perfectly fine.When I started building AI features for production applications, my first instinct was to cram everything into Rails. After all, if you have a hammer, everything looks like a nail, right? But after months of fighting with Ruby’s ML ecosystem, dealing with memory bloat from AI libraries, and watching response times crawl, I made a decision that changed everything: I moved AI processing to Python services and never looked back.This isn’t about abandoning Rails—it’s about using the right tool for each job. Rails excels at web applications, databases, and business logic. Python excels at AI, data science, and machine learning. Let’s build a system that leverages both.
Rails 8 handles the web layer, admin dashboard, and business logic using Solid Queue for background jobs and Solid Cache for caching—no Redis needed. The FastAPI service handles all AI processing. PostgreSQL stores everything.
Let’s start with the Python side. Create a new directory for your AI service:
mkdir rails_ai_servicecd rails_ai_service
Create a requirements.txt file:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/ai_service/requirements.txtfastapi==0.115.0uvicorn==0.32.0pydantic==2.10.3python-dotenv==1.0.1python-jose[cryptography]==3.3.0slowapi==0.1.9# llama-cpp-python is installed separately in Docker with GPU support
Install the dependencies:
pip install -r requirements.txt
Now create your FastAPI application in main.py:
# https://github.com/bullrico/code_examples/blob/main/01_rails_python_ai_services/ai_service/main.pyimport osimport loggingfrom fastapi import FastAPI, HTTPException, Depends, Requestfrom fastapi.security import HTTPBearer, HTTPAuthorizationCredentialsfrom pydantic import BaseModelfrom typing import Optionalimport timefrom slowapi import Limiter, _rate_limit_exceeded_handlerfrom slowapi.util import get_remote_addressfrom slowapi.errors import RateLimitExceededimport hashlibimport hmacimport random# Configure logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)# Rate limitinglimiter = Limiter(key_func=get_remote_address)app = FastAPI(title="Rails AI Service", version="1.0.0")app.state.limiter = limiterapp.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)# Securitysecurity = HTTPBearer()API_SECRET = os.getenv("API_SECRET", "default-secret-change-in-production")def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)): """Verify the API token""" token = credentials.credentials expected_token = hashlib.sha256(API_SECRET.encode()).hexdigest() if not hmac.compare_digest(token, expected_token): raise HTTPException(status_code=403, detail="Invalid authentication") return token# Mock mode for testing without actual modelMOCK_MODE = os.getenv("MOCK_MODE", "true").lower() == "true"if MOCK_MODE: logger.info("Running in MOCK MODE - no actual model loaded") llm = Noneelse: # This would load the actual model in production try: from llama_cpp import Llama model_path = os.getenv("MODEL_PATH", "/models/llama-2-7b-chat.Q4_K_M.gguf") n_gpu_layers = int(os.getenv("N_GPU_LAYERS", "-1")) if n_gpu_layers == -1: logger.info("Using GPU acceleration with all layers") elif n_gpu_layers == 0: logger.warning("GPU disabled, using CPU (will be slow)") else: logger.info(f"Using GPU acceleration with {n_gpu_layers} layers") llm = Llama( model_path=model_path, n_ctx=4096, # Context window n_gpu_layers=n_gpu_layers, n_threads=8, verbose=False ) logger.info(f"Loaded model from {model_path}") except Exception as e: logger.error(f"Failed to load model: {e}") logger.info("Falling back to mock mode") MOCK_MODE = True llm = Noneclass TextAnalysisRequest(BaseModel): text: str prompt: Optional[str] = "Analyze the sentiment of this text:" max_tokens: Optional[int] = 500 temperature: Optional[float] = 0.3class TextAnalysisResponse(BaseModel): result: str model_used: str tokens_used: int processing_time: float success: bool@app.get("/health")async def health_check(): """Health check endpoint for monitoring""" return {"status": "healthy", "service": "rails-ai-service"}@app.post("/analyze-text", response_model=TextAnalysisResponse)@limiter.limit("10/minute")async def analyze_text( request: Request, analysis_request: TextAnalysisRequest, token: str = Depends(verify_token)): """Analyze text using local Llama model with rate limiting and authentication""" start_time = time.time() try: logger.info(f"Processing text analysis") if MOCK_MODE or llm is None: # Generate a mock response for testing sentiments = ["positive", "negative", "neutral", "mixed"] themes = ["technology", "business", "personal growth", "innovation", "collaboration"] selected_sentiment = random.choice(sentiments) selected_themes = random.sample(themes, k=min(3, len(themes))) result_text = f"""Based on the analysis of the provided text:Sentiment: The overall sentiment appears to be {selected_sentiment}.Key Themes Identified:{chr(10).join(f'- {theme.capitalize()}' for theme in selected_themes)}Summary: The text contains approximately {len(analysis_request.text.split())} words.[Note: This is a mock response for testing]""" tokens = len(analysis_request.text.split()) + 150 model_name = "mock-model-for-testing" else: # Llama 2 Chat format full_prompt = f"""[INST] <<SYS>>{analysis_request.prompt}<</SYS>>{analysis_request.text} [/INST]""" # Generate response response = llm( full_prompt, max_tokens=analysis_request.max_tokens, temperature=analysis_request.temperature, stop=["[INST]", "</s>"], echo=False ) result_text = response['choices'][0]['text'].strip() tokens = response['usage']['total_tokens'] model_name = "llama-2-7b-chat" processing_time = time.time() - start_time result = TextAnalysisResponse( result=result_text, model_used=model_name, tokens_used=tokens, processing_time=processing_time, success=True ) logger.info(f"Analysis completed in {processing_time:.2f}s") return result except Exception as e: processing_time = time.time() - start_time logger.error(f"Error processing request: {str(e)}") raise HTTPException( status_code=500, detail={ "error": str(e), "processing_time": processing_time, "success": False } )if __name__ == "__main__": import uvicorn port = int(os.getenv("PORT", 8001)) uvicorn.run(app, host="0.0.0.0", port=port)
Your FastAPI service is now running on http://localhost:8001 with GPU acceleration. You can test it by visiting http://localhost:8001/docs to see the automatic API documentation.Note: The first request will be slower as the model loads into GPU memory. Subsequent requests will be much faster.
Faraday is the preferred HTTP client for production Rails apps—it’s 87% faster than alternatives and has excellent middleware support. Add to your Gemfile:
After building several AI-powered features in production Rails apps (including an LLM routing platform I’m currently working on), here are the key architectural insights:
AI services can be slow. Set aggressive timeouts to keep your Rails app responsive:
# Set appropriate timeouts (already included in ai_service.rb)f.options.timeout = 15 # For synchronous callsf.options.timeout = 60 # For background jobs
AI APIs charge by tokens. Track usage to avoid surprise bills:
# Example: app/models/post.rb (not included in demo)class Post < ApplicationRecord after_update :track_token_usage, if: :saved_change_to_analysis_tokens_used? private def track_token_usage Rails.logger.info "Post #{id} used #{analysis_tokens_used} tokens" # Send to your monitoring service endend
# Example usage (already handled in ai_service.rb)def analyze_text_safe(text) result = AiService.analyze_text(text) if result[:success] result[:result] else "Analysis temporarily unavailable" endend
Download a Llama 2 model (7B parameter version, publicly available):
mkdir modelscd models# Download Llama 2 7B Chat model (Q4_K_M quantization for good balance of size/quality)wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.ggufcd ..
For production deployments, you can use a proper Dockerfile. View examples at github.com/bullrico/code_examples.Start everything with one command:
Your complete AI-powered Rails stack is now running with local GPU acceleration. The Llama model runs entirely on your hardware - no API keys, no external dependencies, no usage limits.
Running Llama 2 locally changes everything. No API keys, no rate limits, no per-token costs—just pure performance on your own hardware. The combination of Rails 8’s simplified stack (Solid Queue, Solid Cache) with local AI models gives you complete control over your AI infrastructure.This architecture has proven invaluable in production. By separating Rails for web and FastAPI for AI, each tool does what it does best. Your data never leaves your servers, you can customize models for your specific use case, and you’re not dependent on any external AI provider.Rails developers don’t need to become ML experts—you just need a GPU and this simple integration pattern. Your Rails app remains the conductor, while Python handles the AI heavy lifting.