Skip to content

Commit 93df682

Browse files
committed
Adding Intro to LLMs
1 parent aceb5e2 commit 93df682

File tree

9 files changed

+162
-0
lines changed

9 files changed

+162
-0
lines changed
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
{
2+
"cells": [
3+
{
4+
"attachments": {},
5+
"cell_type": "markdown",
6+
"id": "b323bf88-cf1d-43b4-abb0-0b7c35bb5f42",
7+
"metadata": {},
8+
"source": [
9+
"# **Introduction to LLMs**\n",
10+
"\n",
11+
"**Credits:** The following short notes on LLMs are created after reading a phenomenal **\"Quick Start Guide To LLMs\"** by **Sinan Ozdemir**.\n",
12+
"\n",
13+
"### **A little bit about Transformers:**\n",
14+
"<img style=\"float: right;\" width=\"400\" height=\"400\" src=\"data/images/transformer.jpeg\">\n",
15+
"\n",
16+
"1. Sequence to Sequence Model.\n",
17+
"2. Has two main components: Encoder and Decoder\n",
18+
"3. An **encoder** which is tasked with taking in raw text, splitting them up into its core components, convert them into vectors and using **self-attention** to understand the context of the text.\n",
19+
"4. Transformer's self attention mechanism allows each word to \"attend to\" al other words in the sequence which enables it to capture long-term dependencies and contextual relationships between words. The goal is to understand each word as it relates to the other tokens in the input text.\n",
20+
"5. A **decoder** excels at generating text by using a modified type of attention (i.e. **cross attention**) to predict the next best token.\n",
21+
"6. Transformers are **trained** to solve a specific NLP task called as **Language Modeling**.\n",
22+
"7. **Limitation:** Transformers are still limited to an input context window (i.e. maximum length og text it can process at any given moment)\n",
23+
"\n",
24+
"### **Attention**\n",
25+
"1. It is a mechanism that assigns different weights to different parts of the input allowing the model to prioritize and emphasize the most important information while performing tasks like translation or summarization.\n",
26+
"2. Attention allows a model to focus on different parts of the input dynamically, leading to improved performance.\n",
27+
"\n",
28+
"### **What is Language Modeling?**\n",
29+
"1. Language Modeling involves creation of statistical/deep learning models for predicting the likelyhood of a sequence of tokens in a specified vocabulary.\n",
30+
"2. Two types of Language Modeling Tasks are: \n",
31+
" a. Autoencoding Task \n",
32+
" b. Autoregressive Task \n",
33+
"3. **Autoregressive Language Models** are trained to predict the next token in a sentence, based on the previous tokens in the phrase. These models correspond to the **decoder** part of the transformer model. A mask is applied on the full sentence so that the attention head can only see the tokens that came before. These models are ideal for text generatation. For eg: **GPT**\n",
34+
"4. **Autoencoding Language Models** are trained to reconstruct the original sentence from a corrupted version of the input. These models correspond to the **encoder** part of the transformer model. Full input is passed. No mask is applied. Autoencoding models create a bidirectional representation of the whole sentence. They can be fine-tuned for a variety of tasks, but their main application is sentence classification or token classification. For eg: **BERT**\n",
35+
"5. **Combination of autoregressive and autoencoding language models** are more versatile and flexible in generating text. It has been shown that the combination models can generate more diverse and creative text in different context compared to pure decode-based autoregressive models due to their ability to capture additional context using the encoder. For eg: **T5**\n",
36+
"\n",
37+
"### **LLMs are:**\n",
38+
"1. Usually derived from Transformer architecture (but nor necesserily) by training on large amount of text data.\n",
39+
"2. Designed to understand and generate human language, code, and much more.\n",
40+
"3. Highly parallelized and scalable.\n",
41+
"4. Example: BERT, GPT and T5\n",
42+
"5. Techniques like: Stop word removal, stemming, and truncation are not used nor are they necessary for LLMs. LLMs are designed to handle the inherent complexity and variability of human language, including the use of stop words and variations in word forms like tenses and misspellings.\n",
43+
"6. Every LLM on the market has been **pre-trained** on a large corpus of the text data and on a specific language modeling related tasks.\n",
44+
"7. **Remember:** How an LLM is **pre-trained** and **fine-tuned** makes all the difference.\n",
45+
"\n",
46+
"### **Pre-Training, Transfer Learning and Fine-Tuning**\n",
47+
"<img style=\"float: right;\" width=\"400\" height=\"400\" src=\"data/images/transfer_learning.jpeg\">\n",
48+
"\n",
49+
"1. **Pre-training** of an LLM happens on a large corpus of text data and on a specific language modeling related task. During this phase LLM tries to learn and understand general language and relationships between words.\n",
50+
"2. **Transfer Learning** is a technique used in machine learning to leverage the knowledge gained from one task to improve performance on another related task. Understand that pre-trained model has already learned a lot of information about the language and the relationships between words, and this information can be used as a starting point to improve performance on a new task. \n",
51+
" **a.** Transfer Learning for LLMs involves taking an LLM that has been pre-trained on one corpus of text data and then fine-tuning it for a specific downstream task, such as text classification or text generation, by updating the model's parameter with task-specific data. \n",
52+
" **b.** Transfer Learning allows LLMs to be **fine-tuned** for specific tasks with much smaller amounts of task-specific data than it would require if the model were trained from scratch. This greatly reduces the amount of time and resources required to train LLMs. \n",
53+
"<img style=\"float: right;\" width=\"400\" height=\"400\" src=\"data/images/fine_tuning_loop.jpeg\">\n",
54+
"3. **Fine-tuning** involves training the LLM on a smaller, task-specific dataset to adjust its parameters for the specific task at hand. The basic fine-tuning loop is more or less same. \n",
55+
" **a.** Define a model you want to fine-tune as well as fine-tuning parameters (eg: learning rate) \n",
56+
" **b.** Aggregate some training data. \n",
57+
" **c.** Compute loss and gradients. \n",
58+
" **d.** Update the model via backpropogation. \n",
59+
"4. The Transformers package from Hugging Face provides a neat and clean interface for training and fine-tuning LLMs.\n",
60+
"\n",
61+
"### **Alignment in LLMs**\n",
62+
"1. Alignment in Language Model refers to how well the model can respond to the input prompts that match the user's expectations. Put another way, an aligned LLM has an objective that matches a human's objective.\n",
63+
"2. A popular method of aligning language model is through the incorporation of Reinforcement Learning into the training loop.\n",
64+
"3. Reinforcement Learning with Human Feedback (RLHF) is a popular method of aligning pre-trained LLMs that uses human feedback to enhance their performance.\n",
65+
"\n",
66+
"### **Popular Modern LLMs**\n",
67+
"\n",
68+
"#### **1. BERT (Bidirectional Encoder Representation from Transformers)**\n",
69+
"<img style=\"float: right;\" width=\"300\" height=\"300\" src=\"data/images/bert_oov.jpeg\">\n",
70+
"\n",
71+
"1. By Google - Autoencoding Language Model\n",
72+
"2. Pretrained on: \n",
73+
" **a.** English Wikipedia - At the time 2.5 billion words \n",
74+
" **b.** Book Corpus - 800 million words \n",
75+
"3. BERT's tokenizer handles OOV tokens (out of vocabulary / previously unknown) by breaking them up into smaller chunks of known tokens.\n",
76+
"4. Trained on two language modeling specific tasks: \n",
77+
" **a.** **Masked Language Modeling (MLM) aka Autoencoding Task** - Helps BERT recognize token interaction within the sentence. \n",
78+
" **b.** **Next Sentence Prediction (NSP) Task** - Helps BERT to understand how tokens interact with each other between sentences. \n",
79+
"<img style=\"float: right;\" width=\"300\" height=\"300\" src=\"data/images/bert_language_model_task.jpeg\">\n",
80+
"5. BERT uses three layer of token embedding for a given piece of text: Token Embedding, Segment Embedding and Position Embedding.\n",
81+
"6. BERT uses the encoder of transformer and ignores the decoder to become exceedingly good at processing/understanding massive amounts of text very quickly relative to other slower LLMs that focus on generating text one token at a time.\n",
82+
"7. BERT itself doesn't classify text or summarize documents but it is often used as a pre-trained model for downstream NLP tasks. \n",
83+
"<img style=\"float: right;\" width=\"300\" height=\"300\" src=\"data/images/bert_classification.jpeg\">\n",
84+
"8. 1 year later RoBERTa by Facebook AI shown to not require NSP task. It matched and even beat the original BERT model's performance in many areas.\n",
85+
"\n",
86+
"\n",
87+
"#### **2. GPT (Generative Pre-Trained Transformer)**\n",
88+
"\n",
89+
"1. By OpenAI - Autoregressive Language Model\n",
90+
"2. Pretrained on: Proprietary Data (Data for which the rights of ownership are restricted so that the ability to freely distribute the is limited)\n",
91+
"3. Autoregressive Language Model that uses attention to predict the next token in a sequence based on the previous tokens.\n",
92+
"4. GPT relies on the decoder portion of the Transformer and ignores the encoder to become exceptionally good at generating text one token at a time.\n",
93+
"\n",
94+
"#### **3. T5 (Text to Text Transfer Transformer)**\n",
95+
"<img style=\"float: right;\" width=\"400\" height=\"400\" src=\"data/images/t5.jpeg\">\n",
96+
"\n",
97+
"1. By Google - Combination of Autoencoder and Autoregressor Language Model.\n",
98+
"2. T5 uses both encoder and decoder of the Transformer to become highly versatile in both processing and generating text.\n",
99+
"3. T5 based models can generate wide range of NLP tasks, from text classification to generation.\n",
100+
"\n",
101+
"#### **4. Domain Specific LLMs**\n",
102+
"\n",
103+
"1. BioGPT - Trained on large scale biomedical literature (more than 2 million articles). Developed by the AI healthcare company, Owkin, in collaboration with Hugging Face.\n",
104+
"2. SciBERT\n",
105+
"3. BlueBERT\n",
106+
"\n",
107+
"### **Applications:**\n",
108+
"#### **1. Medical Domain**\n",
109+
"1. Electronic Medical Record (EMR) Processing\n",
110+
"2. Clinical Trial Matching\n",
111+
"3. Drug Discovery\n",
112+
"\n",
113+
"#### **2. Finance**\n",
114+
"1. Fraud Detection\n",
115+
"2. Sentiment Analysis of Financial News\n",
116+
"3. Trading Strategies\n",
117+
"4. Customer Service Automation via Chatbots and Virtual Assistants\n",
118+
"\n",
119+
"#### **3. And many more**\n",
120+
"1. Text Classification\n",
121+
"2. Text Summarization\n",
122+
"3. Chatbots\n",
123+
"4. Information Retreival\n",
124+
"\n",
125+
"### **Quick Summary**\n",
126+
"1. What really sets the Transformers appart from other deep learning architectures is its ability to capture long-term dependencies and relationships between tokens using attention mechanism.\n",
127+
"2. Attention is the crucial component of Transformer.\n",
128+
"3. Factor behind transformer's effectiveness as a language model is it is highly parallelizable, allowing for faster training and efficient processing of text.\n",
129+
"4. LLMs are pre-trained on large corpus and fine-tuned on smaller datasets for specific tasks.\n"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": null,
135+
"id": "79ed93c8-4db0-4cef-bcaf-3e5a209aaf49",
136+
"metadata": {},
137+
"outputs": [],
138+
"source": []
139+
}
140+
],
141+
"metadata": {
142+
"kernelspec": {
143+
"display_name": "Python 3 (ipykernel)",
144+
"language": "python",
145+
"name": "python3"
146+
},
147+
"language_info": {
148+
"codemirror_mode": {
149+
"name": "ipython",
150+
"version": 3
151+
},
152+
"file_extension": ".py",
153+
"mimetype": "text/x-python",
154+
"name": "python",
155+
"nbconvert_exporter": "python",
156+
"pygments_lexer": "ipython3",
157+
"version": "3.9.13"
158+
}
159+
},
160+
"nbformat": 4,
161+
"nbformat_minor": 5
162+
}
Loading
Loading
Loading

0 commit comments

Comments
 (0)