vllm.config.attention ¶
AttentionConfig ¶
Configuration for attention mechanisms in vLLM.
Source code in vllm/config/attention.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
backend class-attribute instance-attribute ¶
backend: AttentionBackendEnum | None = None
Attention backend to use. If None, will be selected automatically.
disable_flashinfer_prefill class-attribute instance-attribute ¶
disable_flashinfer_prefill: bool = False
Whether to disable flashinfer prefill.
disable_flashinfer_q_quantization class-attribute instance-attribute ¶
disable_flashinfer_q_quantization: bool = False
If set, when using fp8 kv, do not quantize Q to fp8.
flash_attn_max_num_splits_for_cuda_graph class-attribute instance-attribute ¶
flash_attn_max_num_splits_for_cuda_graph: int = 32
Flash Attention max number splits for cuda graph decode.
flash_attn_version class-attribute instance-attribute ¶
flash_attn_version: Literal[2, 3] | None = None
Force vllm to use a specific flash-attention version (2 or 3). Only valid when using the flash-attention backend.
use_cudnn_prefill class-attribute instance-attribute ¶
use_cudnn_prefill: bool = False
Whether to use cudnn prefill.
use_prefill_decode_attention class-attribute instance-attribute ¶
use_prefill_decode_attention: bool = False
Use separate prefill and decode kernels for attention instead of the unified triton kernel.
use_trtllm_attention class-attribute instance-attribute ¶
use_trtllm_attention: bool | None = None
If set to True/False, use or don't use the TRTLLM attention backend in flashinfer. If None, auto-detect the attention backend in flashinfer.
use_trtllm_ragged_deepseek_prefill class-attribute instance-attribute ¶
use_trtllm_ragged_deepseek_prefill: bool = False
Whether to use TRTLLM ragged deepseek prefill.
__post_init__ ¶
Source code in vllm/config/attention.py
_set_from_env_if_set ¶
Set field from env var if set, with deprecation warning.
Source code in vllm/config/attention.py
compute_hash ¶
compute_hash() -> str
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/attention.py
validate_backend_before classmethod ¶
Enable parsing of the backend enum type from string.