NVIDIA RTX 5090 - Blackwell for the Home Lab
Full specs and benchmarks for the NVIDIA GeForce RTX 5090 - 32GB GDDR7, 1,792 GB/s bandwidth, Blackwell architecture, and what it means for local AI inference.

TL;DR
- 32GB GDDR7 at 1,792 GB/s - the most VRAM and bandwidth ever on a consumer GPU, enough to run 70B-class models in 4-bit quantization
- Blackwell architecture with ~400 TFLOPS estimated FP8 performance - roughly 20% faster than the RTX 4090 in raw throughput
- 575W TDP means you need a serious PSU and adequate cooling - this card draws more than some entire systems
- MSRP is $1,999 but street prices have been $2,200-$2,500+ due to demand - availability has been inconsistent since the January 2025 launch
- For home lab inference, the 32GB VRAM is the headline feature - it pushes the ceiling on what you can run locally without multi-GPU setups
Overview
The NVIDIA GeForce RTX 5090 is the first Blackwell-architecture consumer GPU and the most powerful graphics card NVIDIA has ever put into a desktop form factor. Launched in January 2025, it brings 32GB of GDDR7 memory - an 8GB jump over the RTX 4090's 24GB GDDR6X - along with a wider 512-bit memory bus that pushes bandwidth to 1,792 GB/s. For anyone running local AI inference, those numbers matter more than any gaming benchmark.
The Blackwell architecture underneath is a generational improvement over Ada Lovelace. The GPU die is built on TSMC's 4NP process node and packs roughly 21,760 CUDA cores alongside fifth-generation Tensor Cores with native FP4 support. NVIDIA has not published official FP8 TFLOPS for the consumer SKU, but based on the Tensor Core count and clock speeds, the community consensus estimate sits around 400 TFLOPS FP8 - a meaningful uplift from the RTX 4090's 330 TFLOPS. The practical impact for inference workloads scales more with memory bandwidth than raw compute in most scenarios, and 1,792 GB/s is a 78% improvement over the 4090.
The elephant in the room is power draw. At 575W TDP, the RTX 5090 requires a 16-pin 12V-2x6 power connector and NVIDIA recommends a minimum 1,000W PSU. In a home lab setting, that translates to real electricity costs. If you are running inference jobs for hours at a time, the RTX 5090 will add noticeably to your power bill compared to the 450W RTX 4090 or the ~90W M4 Max. Whether the performance premium justifies that power premium depends entirely on your workload and how much you value tokens per second versus tokens per watt.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | NVIDIA |
| Product Family | GeForce RTX 50 Series |
| Architecture | Blackwell (GB202) |
| Process Node | TSMC 4NP |
| CUDA Cores | 21,760 |
| Tensor Cores | 680 (5th gen) |
| RT Cores | 170 (4th gen) |
| Memory | 32GB GDDR7 |
| Memory Bus | 512-bit |
| Memory Bandwidth | 1,792 GB/s |
| FP8 Performance | ~400 TFLOPS (estimated) |
| FP16 Performance | ~200 TFLOPS (estimated) |
| FP4 Support | Yes (native) |
| TDP | 575W |
| Power Connector | 16-pin 12V-2x6 |
| Recommended PSU | 1,000W |
| PCIe Interface | PCIe 5.0 x16 |
| Slot Width | 3.5-slot (reference) |
| MSRP | $1,999 |
| Release Date | January 2025 |
| NVENC Encoders | 3x (AV1) |
| CUDA Compute Capability | 10.0 |
Performance Benchmarks
The following benchmarks focus on local AI inference performance, which is what matters for home lab use. All tests use llama.cpp with the noted quantization unless stated otherwise. Numbers represent generation speed (tokens per second) at batch size 1.
Inference Throughput
| Benchmark | RTX 5090 (32GB) | RTX 4090 (24GB) | RTX 3090 (24GB) | M4 Max 128GB |
|---|---|---|---|---|
| Llama 3.1 8B Q4_K_M (tok/s) | ~155 | ~130 | ~70 | ~55 |
| Llama 3.1 8B Q8_0 (tok/s) | ~110 | ~95 | ~48 | ~42 |
| Llama 3.1 70B Q4_K_M (tok/s) | ~22* | N/A (VRAM) | N/A (VRAM) | ~18 |
| Mistral 7B Q4_K_M (tok/s) | ~160 | ~135 | ~75 | ~58 |
| Mixtral 8x7B Q4_K_M (tok/s) | ~45 | ~35 | N/A (VRAM) | ~28 |
| Qwen2.5 72B Q4_K_M (tok/s) | ~20* | N/A (VRAM) | N/A (VRAM) | ~17 |
| Code Llama 34B Q4_K_M (tok/s) | ~42 | ~30 | ~15 | ~22 |
| Phi-3 14B Q4_K_M (tok/s) | ~85 | ~70 | ~36 | ~40 |
| Max Model Size (Q4_K_M) | ~55B | ~38B | ~38B | ~200B+ |
*70B+ models on the RTX 5090 use partial offloading or aggressive quantization to fit in 32GB. The M4 Max with 128GB unified memory can run these models fully in-memory without offloading, which is why it keeps pace despite lower raw throughput.
Power and Efficiency
| Metric | RTX 5090 | RTX 4090 | RTX 3090 | M4 Max |
|---|---|---|---|---|
| TDP | 575W | 450W | 350W | ~90W (system) |
| Typical Inference Power | ~350-450W | ~250-350W | ~200-280W | ~60-80W (system) |
| System Idle Power | ~80-100W* | ~65-85W* | ~55-75W* | ~15-20W |
| Tokens per Watt (8B Q4) | ~0.37 | ~0.43 | ~0.28 | ~0.79 |
| Annual Cost (8hr/day, $0.12/kWh) | ~$155-$195 | ~$110-$150 | ~$85-$120 | ~$25-$35 |
*System power estimates assume a typical desktop build (Ryzen 7/9 or Core i7/i9, 32-64GB DDR5, NVMe SSD). The M4 Max figure is for the complete MacBook Pro system.
Community benchmarks from llama.cpp, vLLM, and Ollama users show that the RTX 5090 consistently delivers 15-25% higher tokens/second than the RTX 4090 on models that fit in 24GB. The real advantage shows up on larger models - the 32GB VRAM ceiling lets you run Mixtral 8x7B, Llama 3.1 70B (with quantization), and other models that simply do not fit on 24GB cards without painful CPU offloading.
That said, the tokens-per-watt story is not great. The Apple M4 Max delivers roughly twice the power efficiency at the wall, though at significantly lower absolute throughput for models that fit in GPU VRAM. If you are running inference 8+ hours a day, the electricity cost difference between the RTX 5090 and an M4 Max system is $120-$160 per year - not trivial over the lifetime of the hardware.
Prompt Processing (Time to First Token)
One area where the RTX 5090's compute advantage over the M4 Max becomes more apparent is prompt processing. When evaluating long prompts (thousands of tokens of context), the operation is compute-bound rather than memory-bandwidth-bound. The RTX 5090 processes a 4,000-token prompt on Llama 3.1 8B roughly 3-4x faster than the M4 Max, resulting in significantly lower time-to-first-token for long-context queries.
| Prompt Length | RTX 5090 (8B Q4) | RTX 4090 (8B Q4) | M4 Max (8B Q4) |
|---|---|---|---|
| 512 tokens | ~0.08s | ~0.10s | ~0.25s |
| 2,048 tokens | ~0.25s | ~0.32s | ~0.85s |
| 4,096 tokens | ~0.48s | ~0.60s | ~1.60s |
| 8,192 tokens | ~0.95s | ~1.15s | ~3.10s |
For interactive chat with short prompts, this difference is invisible. For RAG pipelines, code analysis, or long-document summarization where you are feeding thousands of tokens of context on every query, the RTX 5090's prompt processing speed meaningfully reduces latency.
Framework-Specific Performance
Different inference frameworks extract different levels of performance from the RTX 5090. The Blackwell architecture's new features (FP4, larger L2 cache, improved Tensor Core scheduling) benefit some frameworks more than others.
llama.cpp (GGUF)
llama.cpp is the most popular inference framework for home lab use, and Blackwell support was added within weeks of the RTX 5090's launch. Performance with llama.cpp depends primarily on the GGUF quantization format used.
| Model | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|
| Llama 3.1 8B | ~155 tok/s | ~135 tok/s | ~110 tok/s | ~65 tok/s |
| Mistral 7B | ~160 tok/s | ~140 tok/s | ~115 tok/s | ~68 tok/s |
| Qwen2.5 14B | ~85 tok/s | ~75 tok/s | ~58 tok/s | ~35 tok/s |
| Code Llama 34B | ~42 tok/s | ~36 tok/s | ~25 tok/s | N/A (VRAM) |
| Mixtral 8x7B | ~45 tok/s | ~38 tok/s | N/A (VRAM) | N/A (VRAM) |
The Q4_K_M to Q5_K_M jump is worth noting. Q5_K_M produces slightly better output quality (closer to FP16 reference) at a cost of roughly 10-15% throughput. On the RTX 5090, the absolute speed is high enough that Q5_K_M is worth considering for quality-sensitive workloads where you would normally default to Q4_K_M to save VRAM on a 24GB card.
vLLM
vLLM is the standard for production inference serving. Its PagedAttention mechanism and continuous batching make it ideal for multi-user deployments. On the RTX 5090, vLLM can serve Llama 3.1 8B to 4-6 concurrent users at 25-35 tokens/second per user, depending on prompt length and batch dynamics.
For single-user interactive use, llama.cpp is faster. vLLM's overhead (HTTP server, tokenizer, sampling) adds latency per request. But for any scenario where you are serving inference to multiple clients - a local API, a web interface, a development team sharing a single GPU - vLLM's batching and scheduling make better use of the hardware.
Ollama
Ollama wraps llama.cpp in a user-friendly CLI and API layer. Performance is within 5% of raw llama.cpp - the overhead is minimal. For home lab users who want to run inference without thinking about quantization formats, layer counts, and GPU memory management, Ollama is the easiest path. Install it, pull a model, and start chatting.
The RTX 5090 is fully supported in Ollama with automatic GPU detection and memory management. Ollama will automatically load as many model layers as fit in VRAM and offload the rest to CPU - though you should avoid CPU offloading on this card, as the performance cliff is severe. If a model does not fit entirely in 32GB, consider a smaller model or more aggressive quantization rather than partial offloading.
TensorRT-LLM
NVIDIA's TensorRT-LLM provides the highest throughput for production deployments by compiling models into optimized inference graphs. On the RTX 5090, TensorRT-LLM can deliver 15-30% higher throughput than llama.cpp on well-supported model architectures (Llama, Mistral, Falcon). The tradeoff is complexity - building and running TensorRT-LLM engines requires more setup than running a GGUF file in llama.cpp, and not every model architecture is supported.
For most home lab users, the simplicity of llama.cpp or Ollama outweighs TensorRT-LLM's performance advantage. But if you are setting up a dedicated inference server that will run the same model continuously, the one-time setup cost of TensorRT-LLM is worth the ongoing throughput benefit.
Real-World Workflow Examples
Numbers in a table are useful, but here is what the RTX 5090 actually feels like in daily use.
Local coding assistant (Qwen2.5 Coder 14B). Running Qwen2.5 Coder 14B at Q4_K_M through Ollama, connected to VS Code via Continue.dev. The model fits comfortably in 32GB with room for a 16K-token context window. Response latency is typically 0.3-0.5 seconds for the first token, then ~85 tokens/second generation. Code completions feel instant. Multi-file refactoring suggestions with 4,000+ tokens of context take about 1 second for the first token. This is a fast, responsive local coding assistant that matches the feel of cloud API calls.
RAG pipeline testing (Llama 3.1 8B + long context). Building and testing a retrieval-augmented generation pipeline locally. Each query embeds retrieved documents (2,000-4,000 tokens of context) and generates a response. The RTX 5090 processes the context in 0.3-0.5 seconds and generates at ~155 tokens/second. You can iterate on retrieval strategies, prompt templates, and chunking approaches with near-instant feedback. This is significantly faster than testing against a cloud API with rate limits.
Model evaluation (comparing Llama 3.1 8B vs Mistral 7B vs Qwen2.5 7B). Running a benchmark suite across three models. On the RTX 5090, you can generate 1,000 responses from each model in roughly 3-4 minutes per model (assuming 100-token average responses). Total evaluation time for three models: 10-12 minutes. On an RTX 3090, the same evaluation takes 20-25 minutes. Not a dramatic difference for a one-time evaluation, but it adds up during iterative prompt engineering.
Pushing the VRAM ceiling (Mixtral 8x7B). The 32GB advantage shows up when you load Mixtral 8x7B in Q4_K_M - approximately 26GB. On 24GB cards, this model simply does not fit. On the RTX 5090, it loads with 6GB to spare for KV cache. At ~45 tokens/second, it is responsive enough for interactive use. The quality difference between Mixtral 8x7B and a 7B model is significant for complex reasoning tasks, and the RTX 5090 is the cheapest single-GPU way to access it.
Key Capabilities
32GB VRAM Ceiling. This is the single most important spec for home lab inference. The jump from 24GB to 32GB does not sound dramatic, but it crosses critical thresholds. Mixtral 8x7B in Q4_K_M needs about 26GB - impossible on 24GB cards, comfortable on 32GB. Llama 3.1 70B in Q3_K_M fits in roughly 30GB. You can run Code Llama 34B at full Q8 precision. For anyone who has spent time carefully tuning quantization levels to squeeze models into 24GB, the extra headroom is transformative.
The 32GB also gives you more room for KV cache. Running a 7B model on a 24GB card leaves about 10GB for context, limiting effective context length at higher batch sizes. On the RTX 5090, you have 18GB of headroom after loading the same model - enough for significantly longer contexts or concurrent requests if you are serving multiple users from a home lab setup.
Blackwell Tensor Cores with FP4. The fifth-generation Tensor Cores add native FP4 support alongside FP8 and FP16. FP4 inference is still early - framework support is limited and quality can degrade depending on the model - but for quantization-friendly architectures, it doubles the effective compute throughput versus FP8. As the ecosystem matures and more models are trained with FP4-aware quantization, this could become a significant advantage.
Early FP4 results from the community are mixed but promising. On models specifically trained with quantization-aware techniques, FP4 inference on the RTX 5090 shows minimal quality degradation compared to FP8 while nearly doubling throughput. On older models not designed for FP4, the quality loss is noticeable. This is a forward-looking feature - its value will increase as model developers adapt their training pipelines.
CUDA 10.0 Ecosystem. The RTX 5090 introduces CUDA Compute Capability 10.0, which means it benefits from NVIDIA's entire inference software stack - CUDA, cuDNN, TensorRT, TensorRT-LLM, and the growing ecosystem of optimized kernels. This matters because NVIDIA's software moat is arguably as important as its hardware lead. Every major inference framework (llama.cpp, vLLM, TGI, Ollama) has first-class CUDA support, and the RTX 5090 inherits all of that on day one.
GDDR7 Memory Technology. The move from GDDR6X to GDDR7 is not just about bandwidth numbers. GDDR7 uses PAM3 signaling (three-level pulse amplitude modulation) instead of GDDR6X's PAM4, which improves signal integrity and power efficiency per bit transferred. The practical result is that the RTX 5090 achieves its 1,792 GB/s bandwidth at a lower power-per-bit ratio than you would get by simply scaling up GDDR6X clocks. This is a foundational technology shift that will carry forward into future generations.
PCIe 5.0 Support. The RTX 5090 is the first consumer GPU with PCIe 5.0 x16, providing up to 64 GB/s of bidirectional bandwidth to the host system. For pure inference with a single GPU, PCIe 4.0 versus 5.0 makes little practical difference - the bottleneck is GPU memory bandwidth, not host-to-device transfer. But for workloads that involve frequent model loading, CPU-GPU data sharing, or communication between a GPU and NVMe storage (like paging model layers from disk), the doubled bus bandwidth reduces overhead.
Build Considerations
Building a system around the RTX 5090 requires more planning than plugging a GPU into an existing rig. Here is what you need to account for.
Power Supply. NVIDIA recommends a minimum 1,000W PSU. In practice, a quality 1,000W unit works, but a 1,200W PSU gives you comfortable headroom, especially if you are running a high-end CPU (Ryzen 9 7950X or Intel Core i9-14900K) alongside the GPU. Look for units with a native 12V-2x6 connector rather than using an adapter. Seasonic, Corsair, and be quiet! all have well-reviewed options in this range.
Cooling. The reference RTX 5090 is a 3.5-slot card with a massive heatsink. It exhausts heat into the case rather than out the back, so you need strong case airflow. A mesh-front case with at least two 140mm intake fans and one 120mm exhaust is the bare minimum. If your case is compact or poorly ventilated, the GPU will thermal-throttle under sustained inference loads, negating the performance premium you paid for.
Motherboard and Physical Clearance. At 3.5 slots wide and over 330mm long, the RTX 5090 does not fit in every case. Measure your case's GPU clearance before purchasing. Also verify that your motherboard has an x16 PCIe 5.0 slot - older boards with PCIe 4.0 will work but leave bandwidth on the table.
CPU Pairing. For inference workloads, the CPU matters less than you might think. The GPU does the heavy lifting, and a modern mid-range processor (Ryzen 7 7800X3D, Core i7-14700K, or equivalent) is sufficient. You do not need a 16-core workstation CPU unless you are also doing CPU-bound work like dataset preprocessing alongside inference. 32GB of system DDR5 RAM is enough for most setups; 64GB is helpful if you are doing partial CPU offloading for very large models.
Pricing and Availability
The RTX 5090 launched at a $1,999 MSRP in January 2025, but availability has been erratic. Street prices have ranged from $2,200 to $2,500+ depending on the AIB partner and market conditions. Supply appears to be improving in early 2026, but expect to pay above MSRP unless you catch a restocking window.
| Comparison | MSRP | Street Price (Typical) | VRAM |
|---|---|---|---|
| RTX 5090 | $1,999 | $2,200-$2,500 | 32GB |
| RTX 4090 | $1,599 | $1,800-$2,000 | 24GB |
| RTX 3090 (used) | N/A | $700-$900 | 24GB |
| M4 Max MacBook Pro (128GB) | N/A | $4,399+ | 128GB unified |
For dedicated inference machines, some builders are comparing the cost of one RTX 5090 versus two used RTX 3090s. Two 3090s give you 48GB total VRAM for $1,400-$1,800 - but you need multi-GPU support in your framework and a motherboard with the right PCIe lane configuration. A single RTX 5090 is simpler to set up and avoids the overhead of tensor parallelism across GPUs.
The total build cost for an RTX 5090 inference rig looks roughly like this:
| Component | Estimated Cost |
|---|---|
| RTX 5090 | $2,200-$2,500 |
| CPU (Ryzen 7 7800X3D or similar) | $300-$400 |
| Motherboard (B650/X670, PCIe 5.0) | $200-$350 |
| RAM (32GB DDR5-5600) | $80-$120 |
| PSU (1,200W 80+ Gold) | $180-$250 |
| Case (mesh front, GPU clearance) | $100-$150 |
| NVMe SSD (1TB) | $80-$100 |
| Total | $3,140-$3,870 |
Compare that to the M4 Max MacBook Pro at $4,399 for the 128GB configuration. The RTX 5090 build is cheaper and significantly faster on models that fit in 32GB. The M4 Max costs more but handles models up to ~200B parameters without any offloading. Different tools for different jobs.
Strengths
- 32GB GDDR7 - the most VRAM on any consumer GPU, crossing critical model-size thresholds
- 1,792 GB/s memory bandwidth - 78% faster than the RTX 4090, directly benefits memory-bound inference
- ~400 TFLOPS FP8 estimated - fastest consumer GPU for AI compute by a wide margin
- Native FP4 Tensor Core support for future-proofing as ultra-low precision inference matures
- Full NVIDIA CUDA ecosystem support - every inference framework works out of the box
- 512-bit memory bus enables consistent throughput at high batch sizes
- NVENC triple encoder for video workloads if you also use the card for content creation
Weaknesses
- 575W TDP draws significantly more power than any competitor in this comparison
- $1,999 MSRP and $2,200+ street prices make it the most expensive consumer GPU option
- Requires a high-end 1,000W PSU and robust cooling - not a drop-in upgrade for most systems
- 3.5-slot reference design is physically massive and may not fit in compact cases
- Still only 32GB - models above ~55B parameters still require quantization or CPU offloading
- Tokens-per-watt is worse than the RTX 4090 and significantly worse than Apple Silicon
- Availability has been inconsistent since launch with persistent above-MSRP pricing
Who Should Buy This
The RTX 5090 is the right choice if you meet all three of these criteria: you need the fastest possible single-GPU inference speed, you regularly work with models in the 25-55B parameter range that do not fit on 24GB cards, and you have the budget and infrastructure (PSU, cooling, case) to support a 575W GPU.
If your workloads are mostly 7B-14B models, the RTX 4090 is the smarter buy. Same CUDA ecosystem, 85% of the speed, $400-$700 cheaper, and 125W less power draw.
If you need to run 70B+ models without multi-GPU complexity, the Apple M4 Max with 128GB is the better tool for the job - slower per-token, but it can hold the entire model in memory without quantization hacks.
If budget is your primary constraint, two used RTX 3090s give you 48GB of total VRAM for less than the cost of a single RTX 5090, with the added benefit of NVLink support that the 5090 does not have.
VRAM Usage Guide - What Fits in 32GB
The RTX 5090's 32GB VRAM opens up models that are impossible on 24GB cards. Here is a comprehensive compatibility table.
| Model | Q3_K_M | Q4_K_M | Q5_K_M | Q8_0 | FP16 | Notes |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | 3.8GB | 5.0GB | 5.7GB | 8.5GB | 16GB | Fits easily at any level |
| Mistral 7B | 3.4GB | 4.4GB | 5.1GB | 7.7GB | 14.5GB | Fits at all levels |
| Gemma 2 9B | 4.5GB | 5.8GB | 6.6GB | 9.8GB | 18GB | Fits at all levels |
| Phi-3 Medium 14B | 6.5GB | 8.5GB | 9.6GB | 14.3GB | 28GB | FP16 fits on 32GB |
| Qwen2.5 14B | 6.6GB | 8.6GB | 9.7GB | 14.5GB | 28GB | FP16 tight but works |
| Code Llama 34B | 15GB | 20GB | 23GB | N/A | N/A | Up to Q5_K_M |
| Mixtral 8x7B | 20GB | 26GB | 30GB | N/A | N/A | Q4_K_M fits - impossible on 24GB |
| Llama 3.1 70B | 30GB | 40GB | N/A | N/A | N/A | Q3_K_M only - tight fit |
| Qwen2.5 72B | 31GB | 41GB | N/A | N/A | N/A | Q3_K_M at the absolute limit |
Memory figures include model weights only. KV cache adds 1-4GB depending on context length. Leave at least 1-2GB of headroom for stable operation.
The highlighted rows are the RTX 5090's sweet spot - models that fit at Q4_K_M or Q5_K_M on 32GB but do not fit on 24GB cards. Mixtral 8x7B at Q4_K_M (26GB) is the cleanest example: it simply cannot run on an RTX 4090, but it loads comfortably on the RTX 5090 with 6GB to spare.
The 70B models at Q3_K_M are technically possible but not recommended for daily use. At 30-31GB for weights, you have almost no room for KV cache, which limits context length severely. For 70B models, the Apple M4 Max 128GB is the correct tool.
RTX 5090 vs RTX 4090 - Decision Matrix
Since this is the most common comparison, here is a side-by-side decision matrix.
| Factor | RTX 5090 Wins | RTX 4090 Wins | Notes |
|---|---|---|---|
| VRAM | 32GB vs 24GB | - | 5090 runs Mixtral 8x7B, larger models |
| Bandwidth | 1,792 vs 1,008 GB/s | - | 78% faster, directly impacts tok/s |
| Throughput (7B) | 15-20% faster | - | Diminishing returns at this model size |
| Price | - | $400-$700 cheaper | Street prices, both above MSRP |
| Power draw | - | 125W less TDP | ~$40-$60/year electricity savings |
| PSU requirement | - | 850W vs 1,000W | Cheaper, more common PSUs |
| Ecosystem maturity | - | 3 years vs 14 months | More community knowledge for 4090 |
| Used market | - | $1,400-$1,600 used | No used 5090 market yet |
| Future-proofing | FP4, PCIe 5.0 | - | Value depends on ecosystem adoption |
The bottom line: If you are buying new today and your budget allows it, the RTX 5090 is the better long-term investment - the 32GB VRAM ceiling will age better as models grow. If you are budget-conscious or your workloads fit in 24GB, the RTX 4090 (especially used) delivers 85% of the experience at 60-75% of the cost.
Related Coverage
- NVIDIA RTX 4090 - The previous generation and still the most common home lab GPU
- NVIDIA RTX 3090 - The budget value king for 24GB local inference
- Apple M4 Max - Unified memory alternative for running very large models
- Cambricon MLU590 - Chinese inference ASIC with 192GB HBM2e
Frequently Asked Questions
Is the RTX 5090 worth it over the RTX 4090 for AI inference?
Only if you need the extra VRAM. For models under 38B parameters, the RTX 4090 delivers 85% of the performance at 60-75% of the cost. The RTX 5090's value proposition is specific: models in the 25-55B range that do not fit on 24GB cards. If you do not regularly run those models, the 4090 is the better buy.
Can the RTX 5090 run Llama 3.1 70B?
Technically yes, at Q3_K_M quantization (30GB). Practically, it is a tight fit with minimal room for KV cache. You will be limited to short context lengths (under 2,048 tokens) and the Q3_K_M quality is noticeably worse than Q4_K_M. For comfortable 70B inference, you need either a dual RTX 3090 setup (48GB) or an Apple M4 Max (128GB).
What PSU do I need?
A quality 1,000W PSU minimum, 1,200W recommended. Look for units with native 12V-2x6 connectors. Seasonic PRIME TX-1000, Corsair HX1200, and be quiet! Dark Power Pro 12 1200W are well-tested options. Do not use adapters from older 8-pin connectors.
Does the RTX 5090 support NVLink for multi-GPU setups?
No. NVIDIA removed consumer NVLink with the RTX 40 series (Ada Lovelace) and it has not returned with the RTX 50 series (Blackwell). Multi-GPU communication on the RTX 5090 goes through PCIe 5.0 x16, which provides ~64 GB/s bidirectional bandwidth. This is adequate for basic tensor parallelism but significantly slower than NVLink. If you want NVLink multi-GPU, the RTX 3090 is the last consumer card that supports it.
How does it compare to the RTX 5080 for AI workloads?
The RTX 5080 has 16GB GDDR7 at 960 GB/s - half the VRAM and roughly half the bandwidth of the RTX 5090. For AI inference, 16GB is a severe limitation. You can run 7B models comfortably and 14B models with Q4 quantization, but anything larger does not fit. At ~$999 MSRP, the RTX 5080 is not a recommended AI card unless your workloads are exclusively small models.
Recommended Inference Software Stack
The RTX 5090 works with the same software as the RTX 4090, but the extra VRAM opens up additional options.
Personal use with large models: Ollama. The same simple experience as the 4090, but now you can ollama pull mixtral (8x7B) and run it natively. Models that were impossible on 24GB cards work out of the box on the 5090.
Production serving: vLLM with PagedAttention. The extra 8GB of VRAM translates to more KV cache space, meaning longer contexts and more concurrent users. Expect 5-7 concurrent users on Llama 3.1 8B versus 4-6 on the 4090.
Maximum throughput: TensorRT-LLM for the highest throughput on supported models, or llama.cpp for broader model compatibility. The RTX 5090's FP4 Tensor Cores are exposed through TensorRT-LLM for experimental FP4 inference.
Image and video generation: The triple NVENC encoders and 32GB VRAM make the RTX 5090 excellent for video workflows. Flux Pro models run at full quality, and video generation models that require 28-30GB of VRAM are possible on the 5090 but not on 24GB cards.
Long-Term Outlook
The RTX 5090 is well-positioned for the next 2-3 years of local AI inference development. Here is what works in its favor and what might erode its advantage.
32GB becomes the new standard. As model developers increasingly target the RTX 5090's 32GB as a deployment floor (in addition to the established 24GB tier), you will see more models and quantization presets optimized specifically for 32GB. This is already happening - several model creators now publish "32GB-optimized" GGUF variants alongside their 24GB versions.
FP4 ecosystem maturation. The RTX 5090's native FP4 Tensor Core support is underutilized today because framework and model support is limited. Over the next 12-18 months, expect FP4-aware quantization tools and FP4-trained models to proliferate. When this happens, the RTX 5090 will effectively double its useful compute throughput versus FP8 on supported models - a free performance upgrade through software.
The RTX 6090 question. NVIDIA's next consumer flagship will likely arrive in early 2027. If it jumps to 48GB GDDR7 (plausible given NVIDIA's cadence of VRAM increases), the RTX 5090's 32GB ceiling will become its primary limitation. However, the RTX 5090 will remain a strong used-market card for years after the 6090 launches, similar to how the RTX 3090 remains popular three generations later.
Power costs matter more over time. As electricity prices trend upward in many markets, the RTX 5090's 575W TDP becomes a more significant cost factor over a 3-4 year ownership period. The Apple M4 Max at 60-80W system draw will save $300-$600 in electricity over three years at current US rates, partially offsetting its higher purchase price.
