Your First AI Feature in Rails: Calling Python Services the Right Way
Complete Working Code AvailableAll code from this tutorial is available as a fully functional demo at github.com/bullrico/code_examples.Clone and run it in under 2 minutes:The demo includes mock AI responses so you can test without a GPU or model installation.
Why Rails Developers Should Stop Trying to Make Rails Do Everything
After 17 years with Rails and two years deep in AI development, I’ve learned something that might surprise you: Rails isn’t the right tool for AI workloads. And that’s perfectly fine. When I started building AI features for production applications, my first instinct was to cram everything into Rails. After all, if you have a hammer, everything looks like a nail, right? But after months of fighting with Ruby’s ML ecosystem, dealing with memory bloat from AI libraries, and watching response times crawl, I made a decision that changed everything: I moved AI processing to Python services and never looked back. This isn’t about abandoning Rails—it’s about using the right tool for each job. Rails excels at web applications, databases, and business logic. Python excels at AI, data science, and machine learning. Let’s build a system that leverages both.Prerequisites
Before we dive in, make sure you have:- Rails 8 (we’ll use Solid Queue and Solid Cache to avoid Redis)
- Python 3.11+ with GPU support
- GPU requirements:
- NVIDIA GPU with at least 6GB VRAM (for 7B Q4_K_M quantized model)
- Or Apple Silicon Mac with 8GB+ unified memory (M1/M2/M3)
- System RAM: 16GB minimum (for Rails, PostgreSQL, and services)
- Docker and Docker Compose installed
- PostgreSQL 15+
The Architecture: Multi-Service with Docker Compose
Here’s what we’re building:DC
Docker Compose Environment
All services run in isolated containers with shared networking
R
Rails 8
Port: 3000
• Solid Queue
• Solid Cache
• Web UI
F
FastAPI
Port: 8001
• OpenAI/LLMs
• GPU Support
• Async API
P
PostgreSQL
Port: 5432
• App Data
• Job Queue
• Cache Store
Internal Docker Network
Isolated Containers
One Command Deploy
Setting Up Your FastAPI Service
Let’s start with the Python side. Create a new directory for your AI service:requirements.txt
file:
main.py
:
.env
file:
http://localhost:8001
with GPU acceleration. You can test it by visiting http://localhost:8001/docs
to see the automatic API documentation.
Note: The first request will be slower as the model loads into GPU memory. Subsequent requests will be much faster.
Integrating with Rails Using Faraday
Faraday is the preferred HTTP client for production Rails apps—it’s 87% faster than alternatives and has excellent middleware support. Add to your Gemfile:app/services/ai_service.rb
:
.env
file:
Background Processing with Solid Queue
For longer-running AI tasks, Rails 8’s Solid Queue works perfectly without Redis:Production Lessons from Real-World AI Integration
After building several AI-powered features in production Rails apps (including an LLM routing platform I’m currently working on), here are the key architectural insights:1. Always Set Timeouts
AI services can be slow. Set aggressive timeouts to keep your Rails app responsive:2. Monitor Token Usage
AI APIs charge by tokens. Track usage to avoid surprise bills:3. Cache with Solid Cache
Rails 8’s Solid Cache stores cache data in PostgreSQL—perfect for AI responses:config/application.rb
:
4. Handle Failures Gracefully
AI services fail. Plan for it:Docker Compose Setup for Development
The best way to manage multiple services is with Docker Compose. Create adocker-compose.yml
:
.env
file for your secrets:
- Rails app: http://localhost:3000
- FastAPI docs: http://localhost:8001/docs
- PostgreSQL: localhost:5432
Dockerfile for Production FastAPI Service
For production deployments with GPU support, see the complete Dockerfile examples at github.com/bullrico/code_examples.Production Deployment
For production with GPU support:Option 1: Cloud GPU Providers
- RunPod: Great for NVIDIA GPUs, pay per hour
- Replicate: Easy deployment for AI models
- Modal: Serverless GPU compute
Option 2: On-Premise
If you have your own GPU server, deploy using the same Docker setup. The repository includes a simple Dockerfile:Performance Considerations
With local models on GPU:- Response time: 200-500ms for most queries (after model is loaded)
- First request: 10-30 seconds while model loads into GPU memory
- Concurrent requests: Limited by GPU memory (typically 3-5 for 7B model)
- Use background jobs: For bulk processing or when response time isn’t critical
Final Thoughts
Running Llama 2 locally changes everything. No API keys, no rate limits, no per-token costs—just pure performance on your own hardware. The combination of Rails 8’s simplified stack (Solid Queue, Solid Cache) with local AI models gives you complete control over your AI infrastructure. This architecture has proven invaluable in production. By separating Rails for web and FastAPI for AI, each tool does what it does best. Your data never leaves your servers, you can customize models for your specific use case, and you’re not dependent on any external AI provider. Rails developers don’t need to become ML experts—you just need a GPU and this simple integration pattern. Your Rails app remains the conductor, while Python handles the AI heavy lifting.What’s Next?
What else would you like me to write about or shed light on? Here are some ideas:- Streaming responses for real-time AI chat
- Building RAG systems with Rails and Python
- Advanced error handling and circuit breakers
- Monitoring and observability for AI services
- Fine-tuning Llama models for specific domains
- Implementing semantic search with vector databases