oracle-samples · qiuosier · Apr 15, 2025
diff --git a/ai-quick-actions/fine-tuning-tips.md b/ai-quick-actions/fine-tuning-tips.md
@@ -2,6 +2,19 @@
 
 Table of Contents:
 
+* [Introduction](#introduction)
+* [Methods](#methods)
+* [Datasets](#datasets)
+    * [Completion Model](#completion-model)
+    * [Chat Model](#chat-model)
+    * [Vision Model](#vision-model)
+    * [Un-supervised Training](#un-supervised-training)
+    * [Tokenized Data](#tokenized-data)
+* [Fine-Tune a Model](#fine-tune-a-model)
+* [Advanced Options](#advanced-options)
+
+See also:
+
 - [Home](README.md)
 - [Policies](policies/README.md)
 - [CLI](cli-tips.md)
@@ -25,7 +38,7 @@ AI Quick Actions is progressively introducing fine-tuning capabilities to more L
 
 ![AQUA](web_assets/model-explorer.png)
 
-### Method
+## Methods
 
 The primary method used by AI Quick Action for fine-tuning is Low-Rank Adaptation ([LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora)). LoRA stands out as a parameter-efficient fine-tuning method that allows for the adaptation of pre-trained models to specific tasks without the need to retrain the entire model. This technique is particularly beneficial for those who wish to leverage the power of LLMs while operating within the constraints of limited computational resources.
 
@@ -43,74 +56,113 @@ In AI Quick Actions, the following [LoRA config parameters](https://huggingface.
 
 All linear modules in the model are used as the `target_modules` for LoRA fine-tuning.
 
-### Dataset
+## Datasets
 
 The success of fine-tuning LLMs heavily relies on the quality and diversity of the training dataset. Preparing a dataset for fine-tuning involves several critical steps to ensure the model can effectively learn and adapt to the specific domain or task at hand. The process begins with collecting or creating a dataset that is representative of the domain or task, ensuring it covers the necessary variations and nuances. Once the dataset is assembled, it must be preprocessed, which includes cleaning the data by removing irrelevant information, normalizing text, and possibly anonymizing sensitive information to adhere to privacy standards.
 
 Fine-tuning with AI Quick Actions requires dataset in JSONL format. Each row in the JSONL file must be a valid JSON, and all rows in the file must have the same JSON format.
 
-The following JSON formats are supported:
+Many models are available as completion model (e.g., `meta-llama/Llama-3.1-8B` and `mistralai/Mistral-7B-v0.3`) and chat model (e.g., `meta-llama/Llama-3.1-8B-Instruct` and `mistralai/Mistral-7B-Instruct-v0.3`). Completion models are base pretrained models taking a given prompt or context and generating coherent and contextually relevant text continuations. Chat models are fine-tuned based on the completion model using a [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating). They are designed to engage in conversations.
+
+### Completion Model
 
-**Instruction format**:
+For completion model, such as `meta-llama/Llama-3.1-8B` and `mistralai/Mistral-7B-v0.3`, the training data need to be formatted as prompt (the input or instruction) and completion (the desired output) pairs.
 
-Instruction format is mainly for fine-tuning completion model. Each JSON should contain a `prompt` and a `completion`:
+**Prompt/Completion format**:
 ```
 {"prompt": "Where's the headquarter of Oracle?", "completion": "Austin, TX"}
 {"prompt": "Who's Shakespeare?", "completion": "William Shakespeare was ..."}
 {"prompt": "How far is New York from Boston?", "completion": "215 miles via I-95N"}
 ```
 
-The `prompt` is the input to the LLM and the `completion` is the expected output from the LLM. You may want to format the `prompt` with a specific template depending on your task. For chat model, training data in instruction format will be converted to conversational format automatically, if the `chat_template` is available from the tokenizer.
+### Chat Model
+
+Chat models, such as `meta-llama/Llama-3.1-8B-Instruct` and `mistralai/Mistral-7B-Instruct-v0.3` are instruction tuned using a `chat_template`. Conversational format is recommended for fine-tuning chat models. Conversational format enables you to specify different messages with different roles.
 
 **Conversational format**:
+```
+{"conversations": [{"role": "system", "content": "You are helpful assistant."}, {"role": "user", "content": "Where's the headquarter of Oracle?"}, {"role": "assistant", "content": "Austin, TX"}]}
+{"conversations": [{"role": "system", "content": "You are helpful assistant."}, {"role": "user", "content": "Who's Shakespeare?"}, {"role": "assistant", "content": "William Shakespeare was ..."}]}
+{"conversations": [{"role": "system", "content": "You are helpful assistant."}, {"role": "user", "content": "How far is New York from Boston?"}, {"role": "assistant", "content": "215 miles via I-90"}]}
+```
+
+Training data in prompt/completion format will be converted to conversational format automatically, for example,
+```
+{"prompt": "Where's the headquarter of Oracle?", "completion": "Austin, TX"}
+```
+
+will be converted to 
 
-Conversational format is mainly for fine-tuning chat model. Each JSON should contain a list of `messages`, each `message` may have different `role` and `content`.
 ```
-{"messages": [{"role": "system", "content": "You are helpful assistant."}, {"role": "user", "content": "Where's the headquarter of Oracle?"}, {"role": "assistant", "content": "Austin, TX"}]}
-{"messages": [{"role": "system", "content": "You are helpful assistant."}, {"role": "user", "content": "Who's Shakespeare?"}, {"role": "assistant", "content": "William Shakespeare was ..."}]}
-{"messages": [{"role": "system", "content": "You are helpful assistant."}, {"role": "user", "content": "How far is New York from Boston?"}, {"role": "assistant", "content": "215 miles via I-95N"}]}
+{"conversations": [{"role": "user", "content": "Where's the headquarter of Oracle?"}, {"role": "assistant", "content": "Austin, TX"}]}
 ```
 
-Note that conversational format cannot be used for fine-tuning completion model (while `chat_template` is not available from the tokenizer).
+### Vision Model
 
-**Dataset for Mllama Model**:
+For fine-tuning [Mllama](https://huggingface.co/docs/transformers/main/model_doc/mllama) or [Phi4 multimodal](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) models, you may specify one or more images along with the text in the training data.
+The images should be stored under the same directory of the JSONL file.
 
-For fine-tuning [Mllama](https://huggingface.co/docs/transformers/main/model_doc/mllama) models, the images should be stored under the same directory of the JSONL file. The relative path of the image file should be specified with `file_name` in the dataset. For example:
+For prompt/completion format, only one image can be specified for each record. The relative path of the image file should be specified with `file_name` in the dataset. For example:
 
-* Instruction format:
+**Prompt/Completion format with image file**:
 ```
-{"prompt": "what is the brand of this camera?", "completion": "dakota", "file_name": "images/5566811_bc00d504a6_o.jpg"}
-{"prompt": "what does the small white text spell?", "completion": "copenhagen", "file_name": "images/4920614800_0f04f8f0a4_o.jpg"}
-{"prompt": "what kind of beer is this?", "completion": "ale", "file_name": "images/5721283932_bc1e954b5c_o.jpg"}
+{"prompt": "what is the brand of this camera?", "completion": "dakota", "file_name": "images/image1.jpg"}
+{"prompt": "what does the small white text spell?", "completion": "copenhagen", "file_name": "images/image2.jpg"}
+{"prompt": "what kind of beer is this?", "completion": "ale", "file_name": "images/image3.jpg"}
 ```
 
-We support two variants of the conversational format:
+We support multiple variants of the conversational format:
+
+**Conversational format with role, content and single image**:
+
+In this format, in addition to the conversations, you specify the image with the `file_name`.
 ```
-{"conversations": [{"role": "user", "content": "what is the brand of this camera?"}, {"role": "assistant", "content": "dakota"}], "file_name": "images/5566811_bc00d504a6_o.jpg"}
-{"conversations": [{"role": "user", "content": "what does the small white text spell?"}, {"role": "assistant", "content": "copenhagen"}], "file_name": "images/4920614800_0f04f8f0a4_o.jpg"}
-{"conversations": [{"role": "user", "content": "what kind of beer is this?"}, {"role": "assistant", "content": "ale"}], "file_name": "images/5721283932_bc1e954b5c_o.jpg"}
+{"conversations": [{"role": "user", "content": "what is the brand of this camera?"}, {"role": "assistant", "content": "dakota"}], "file_name": "images/image1.jpg"}
+{"conversations": [{"role": "user", "content": "what does the small white text spell?"}, {"role": "assistant", "content": "copenhagen"}], "file_name": "images/image2.jpg"}
+{"conversations": [{"role": "user", "content": "what kind of beer is this?"}, {"role": "assistant", "content": "ale"}], "file_name": "images/image3.jpg"}
 ```
 
-or
+**Conversational format with user/assistant messages and single image**
 
+In this format, the conversation is specified as a pair of messages from user and assistant. The image is specified with the `file_name`.
 ```
-{"conversations": [{"user": "what is the brand of this camera?", "assistant": "dakota"}], "file_name": "images/5566811_bc00d504a6_o.jpg"}
-{"conversations": [{"user": "what does the small white text spell?", "assistant": "copenhagen"}], "file_name": "images/4920614800_0f04f8f0a4_o.jpg"}
-{"conversations": [{"user": "what kind of beer is this?", "assistant": "ale"}], "file_name": "images/5721283932_bc1e954b5c_o.jpg"}
+{"conversations": [{"user": "what is the brand of this camera?", "assistant": "dakota"}], "file_name": "images/image1.jpg"}
+{"conversations": [{"user": "what does the small white text spell?", "assistant": "copenhagen"}], "file_name": "images/image2.jpg"}
+{"conversations": [{"user": "what kind of beer is this?", "assistant": "ale"}], "file_name": "images/image3.jpg"}
 ```
 
-**Tokenized Data**
+**Conversational format with one or more images**
 
-Alternatively, you can also use tokenized data for fine-tuning your model. For example:
+In this format, the message `content` is specified as a list.
 ```
-{"input_ids":[1,733,16289,28793,995,622,347,2078,264,7526,302,...]}
-{"input_ids":[1,733,16289,28793,995,460,396,16107,13892,28723,...]}
+{"messages": [{"role": "user", "content": [{"type": "image", "image": "images/image1a.jpg"}, {"type": "image", "image": "images/image1b.jpg"}, {"type": "text", "text": "what is the brand of this camera?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "dakota"}]}]}
+{"messages": [{"role": "user", "content": [{"type": "image", "image": "images/image2a.jpg"}, {"type": "image", "image": "images/image2b.jpg"}, {"type": "text", "text": "what does the small white text spell?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "copenhagen"}]}]}
+{"messages": [{"role": "user", "content": [{"type": "image", "image": "images/image3.jpg"}, {"type": "text", "text": "what kind of beer is this?"}]}, {"role": "assistant", "content": [{"type": "text", "text": "ale"}]}]}
 ```
 
-During fine-tuning, no formatting or tokenization will be done on tokenized data.
+### Un-supervised Training
 
+For unsupervised fine-tuning or pre-training, you can specify the training data as a JSONL file with "text" field or plain text file.
+```
+{"text": "Oracle Corporation is an American multinational computer technology company headquartered in Austin, Texas."}
+{"text": "Apple Inc. is an American multinational technology company headquartered in Cupertino, California, in Silicon Valley."}
+```
+The JSONL format provides you more control on how the text data are split into samples.
+When validation data is not specified, a portion (as specified by `val_set_size`) of the data will be used for validation/evaluation.
+* If the training data is in JSONL format, the samples are shuffled and a portion of the samples are used for validation/evaluation.
+* If the training data is plain text, the last portion of the text are used for validation/evaluation.
 
-### Fine-Tune a Model
+Note that when training with text data, the chat template will not be applied.
+
+### Tokenized Data
+For more advanced use cases, you can format and tokenize the data with your own code and use tokenized data as training data:
+```
+{"input_ids":[1,733,16289,28793,995,622,347,2078,264,7526,302,...]}
+{"input_ids":[1,733,16289,28793,995,460,396,16107,13892,28723,...]}
+```
+The training code will not do any additional formatting or processing for tokenized data.
+
+## Fine-Tune a Model
 
 By clicking on one of the "Ready to Fine Tune" models, you will see more details of the model. You can initiate a fine-tuning job to create a fine-tuned model by clicking on the "Fine Tune" button.
 
@@ -137,7 +189,20 @@ In addition, you will need to specify the infrastructure and parameters for fine
 
 Distributed training leverage multiple GPUs to parallelize the fine-tuning. Recent advancements in distributed training frameworks have made it possible to train models with billions of parameters. Frameworks like [DeepSpeed](https://www.deepspeed.ai/) and [FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) have been developed to optimize distributed training. AI Quick Actions will configure the distributed training automatically when multiple GPUs are available. It is important to note that the communication between multiple nodes incurs significant overhead comparing to the communication between multiple GPUs within a single node. Therefore, it is highly recommended that single replica is used when possible. Multi-node fine-tuning may not have better performance than single node fine-tuning when the number of replica is less than 5.
 
-### Advanced Finetuning Options
+
+### Training Metrics
+
+Once the fine-tuning job is successfully submitted, a fine-tuned model will be created in the model catalog. The model details page will be displayed and the lifecycle state will be "In progress" as the job is running. At the end of each epoch, the loss and accuracy will be calculated and updated in the metrics panel.
+
+![FineTuneMetrics](web_assets/fine-tune-metrics.png)
+
+The accuracy metric reflects the proportion of correct completions made by the model on a given dataset. A higher accuracy indicates that the model is performing well in terms of making correct completions. On the other hand, the loss metric represents the model's error. It quantifies how far the model's completions are from the actual target completions. The goal during training is to minimize this loss function, which typically involves optimizing the model's weights to reduce the error on the training data.
+
+As the training progresses, monitoring both accuracy and loss provides insights into the model's learning dynamics. A decreasing loss alongside increasing accuracy suggests that the model is learning effectively. However, it's important to watch for signs of over-fitting, where the model performs exceptionally well on the training data but fails to generalize to new, unseen data. This can be detected if the validation loss stops decreasing or starts increasing, even as training loss continues to decline.
+
+## Advanced Options
+
+### Fine-tuning Parameters
 
 The service allows overriding default hyperparameters when creating a fine-tuned model. The basic configuration shows `epochs` and `learning rate`,
 whereas the advanced configuration includes additional parameters:
@@ -155,17 +220,7 @@ can be modified and set prior to creation of the fine-tuning job.
 
 ![Fine Tuning Parameters](web_assets/finetuning-params.png)
 
-### Training Metrics
-
-Once the fine-tuning job is successfully submitted, a fine-tuned model will be created in the model catalog. The model details page will be displayed and the lifecycle state will be "In progress" as the job is running. At the end of each epoch, the loss and accuracy will be calculated and updated in the metrics panel.
-
-![FineTuneMetrics](web_assets/fine-tune-metrics.png)
-
-The accuracy metric reflects the proportion of correct completions made by the model on a given dataset. A higher accuracy indicates that the model is performing well in terms of making correct completions. On the other hand, the loss metric represents the model's error. It quantifies how far the model's completions are from the actual target completions. The goal during training is to minimize this loss function, which typically involves optimizing the model's weights to reduce the error on the training data.
-
-As the training progresses, monitoring both accuracy and loss provides insights into the model's learning dynamics. A decreasing loss alongside increasing accuracy suggests that the model is learning effectively. However, it's important to watch for signs of over-fitting, where the model performs exceptionally well on the training data but fails to generalize to new, unseen data. This can be detected if the validation loss stops decreasing or starts increasing, even as training loss continues to decline.
-
-### Advanced Configuration Update Options
+### Configuration Update Options
 
 The available shapes for models in AI Quick Actions are pre-configured for fine-tuning for the models available in the Fine-Tuned model tab. 
 However, if you need to add more shapes to the list of 
@@ -240,7 +295,7 @@ For example, the field `"finetuning_params": "--trust_remote_code True"` might b
 needs to execute the code that resides on the Hugging Face Hub rather than natively in the Transformers library.  
 
 
-Table of Contents:
+See also:
 
 - [Home](README.md)
 - [Policies](policies/README.md)