About the Llama 2 Model (Large Language Model)
Llama 2 is a collection of pretrained and fine-tuned large language models developed by Meta AI, with parameter sizes ranging from 7 billion to 70 billion. It represents a significant advancement over its predecessor Llama 1, featuring an expanded pretraining corpus (40% larger), doubled context length, and grouped-query attention. The fine-tuned Llama 2-Chat variants are optimized for dialogue use cases and match some closed-source models in helpfulness and safety benchmarks.
Overview
- Use Case: Commercial and research applications including assistant-like chat for tuned models and various natural language generation tasks for pretrained models
- Creator: Meta AI
- Architecture: Auto-regressive transformer trained using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for chat variants
- Parameters: 7B
- Release Date: 2023-07
- License: Custom commercial license (Meta Llama 2 Community License)
- Context Length: 4,096 tokens
GPU Memory Requirements
Default (FP16) inference requires approximately 14 GB of GPU memory.
| Quantization | Memory (GB) | Notes |
|---|---|---|
| FP16 | 14 | - |
| INT8 | 7 | - |
| INT4 | 4 | Using GPTQ or bitsandbytes quantization |
Training Data
2 trillion tokens from publicly available sources (pretraining data cutoff September 2022). Fine-tuning includes publicly available instruction datasets and over 1 million new human-annotated examples.
Evaluation Benchmarks
- MMLU
- TriviaQA
- Natural Questions
- GSM8K
- HumanEval
- BIG-Bench Hard
Compare GPUs for AI/ML
Try on Hugging Face
Read the Paper
References
- https://arxiv.org/abs/2307.09288
- https://ai.meta.com/resources/models-and-libraries/llama-downloads/
- https://github.com/facebookresearch/llama
Notes
- Parameter count listed is for 7B variant; 13B and 70B variants also available
- GPU memory requirements are for 7B model at FP16
- The 7B model training used 184,320 GPU hours on A100-80GB; total training across all Llama 2 models was approximately 3.3M GPU hours with estimated 539 tCO2 total emissions
- Does not include Meta user data in training
- Grouped-query attention (GQA) is only used in the 70B variant; the 7B and 13B models use standard multi-head attention (MHA)