The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
-
Updated
Jun 9, 2024 - Python
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Official release of InternLM2 7B and 20B base and chat models. 200K context support
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
FlashInfer: Kernel Library for LLM Serving
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Utilities for efficient fine-tuning, inference and evaluation of code generation models
Python package for rematerialization-aware gradient checkpointing
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
An simple pytorch implementation of Flash MultiHead Attention
Poplar implementation of FlashAttention for IPU
Add a description, image, and links to the flash-attention topic page so that developers can more easily learn about it.
To associate your repository with the flash-attention topic, visit your repo's landing page and select "manage topics."