๐Llama 3.2 model
Presentation Report: AI Model Training Project - BBO Developing
1. Project Introduction
The "BBO Developing" project focuses on developing a conversational AI system powered by state-of-the-art language models (e.g., LLaMA, GPT). The system is designed to understand user input and generate appropriate responses. The development process includes multiple stages: data collection, preprocessing, model training, storage, and evaluation.
2. Overview of the Pipeline Workflow
The system is structured as a modular pipeline composed of key classes, each responsible for a distinct stage of data processing or model management:
Build_Data โ Data collection and preprocessing from real-world sources
Generate_Number_Data โ Synthetic logic/mathematical data generation
ChatDataset โ Custom dataset preparation for language model input
HA_Training โ Full-stack model training orchestration
3. In-Depth Class-by-Class Pipeline Analysis
Class 1: Build_Data โ Real-World Text Data Collection and Preprocessing
class Build_Data:
waiting_list = []
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_que_name = "valhalla/t5-base-qg-hl"
model_que = T5ForConditionalGeneration.from_pretrained(model_que_name).to(device)
tokenizer_que = T5Tokenizer.from_pretrained(model_que_name)
model_ans_name="deepset/roberta-large-squad2"
model_ans = AutoModelForQuestionAnswering.from_pretrained(model_ans_name).to(device)
tokenizer_ans = AutoTokenizer.from_pretrained(model_ans_name)
model_para_name="facebook/bart-large-cnn"
model_para = BartForConditionalGeneration.from_pretrained(model_para_name).to(device)
tokenizer_para = BartTokenizer.from_pretrained(model_para_name)
def __init__(self, Dic_contexts,epoch=1): # Tฤng phแบกm vi token ฤแป cรณ cรขu dร i hฦกn
self.epoch=epoch
self.contexts = list(Dic_contexts.values())[0]
self.data=[]
self.Context_To_Data()
def Answering(self,context,question):
nlp = pipeline('question-answering', model=self.model_ans, tokenizer=self.tokenizer_ans)
context_chunks = self.chunk_text(context,self.tokenizer_ans,chunk_size=64)
answers = []
for chunk in context_chunks:
answer = nlp({'context': chunk, 'question': question})
answers.append(answer['answer'])
final_answer = ' '.join(answers[1:])
return self.paraphasing(final_answer)
def chunk_text(self,text,tokenizer,chunk_size=512):
tokens = tokenizer.encode(text)
chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]
return [tokenizer.decode(chunk) for chunk in chunks]
def paraphasing(self,answer):
nlp = pipeline('text2text-generation', model=self.model_para, tokenizer=self.tokenizer_para)
# Generating a more coherent version
output = nlp(f"paraphrase: {answer}")
# Print the improved sentence
return output[0]['generated_text']
def Context_To_Data(self):
for i in range(self.epoch):
for index, context in enumerate(self.contexts):
print(f"\rTiแบฟn ฤแป sinh Data: {index+1}/{len(self.contexts)}", end="", flush=True)
question=self.Questioning(context)
answer=self.Answering(context,question)
self.data.append({"context":context,"question":question,"answer":answer})
def Questioning(self, context):
input_text = f"generate question: {context}"
input_ids = self.tokenizer_que.encode(input_text, return_tensors="pt").to(self.device)
outputs = self.model_que.generate(
input_ids,
max_length=300, # Tฤng ฤแป dร i cรขu hแปi
do_sample=True,
top_k=40,
top_p=0.95,
temperature=0.6,
repetition_penalty=1.6,
num_return_sequences=1
)
question = self.tokenizer_que.decode(outputs[0], skip_special_tokens=True)
return question
def cal_token(self, text):
return sum(len(sentence.split()) for sentence in text)
1.1. Purpose and Role:
Acts as a general-purpose preprocessor for all downstream training processes
Converts real-world documents into structured prompt-response pairs
1.2. Internal Pipeline:
Load data from multiple formats: .docx, .txt, and browser-scraped HTML (using selenium)
Clean text by removing special characters and enforcing UTF-8 encoding
Segment paragraphs or sentences into prompt-response pairs
Store results in an internal queue (waiting_list) or save to intermediate files
1.3. Output:
list_prompt: List of input prompts (e.g., questions or commands)
list_response: Corresponding model-expected responses
1.4. Advanced Processing Features:
Filtering based on keyword presence or sentence length
Custom splitting of dialogues using delimiters like โQ:โ and โA:โ
Can incorporate additional sources such as CSV and Excel with sentence-level mapping
Class 2: Generate_Number_Data โ Synthetic Logic & Math Data Generator
class Generate_Number_Data:
def __init__(self,texts,end="",st=0,multi=1):
self.start=st
self.end=end
if end=="":
self.end=len(texts)-1
self.data=Build_Data(self.extrac_dict(texts)).data
def extrac_dict(self,dicts):
re={}
test_key=list(dicts.keys())
for i in test_key:
re[i]=dicts[i][self.start:self.end]
return re
2.1. Purpose:
Automates the creation of logical/math-focused training data such as arithmetic problems
2.2. Internal Workflow:
Randomly generate numeric values and operations (e.g., addition, subtraction, multiplication, division)
Convert operations into natural language questions (e.g., "What is 7 plus 5?")
Generate accurate answers (e.g., "7 + 5 = 12")
Format them into structured prompt-response pairs for training
2.3. Pipeline Role:
Adds diversity and structured logic to the training dataset
Enhances reasoning capability of the target language model
2.4. Expansion Capabilities:
Supports dataset generation for algebraic patterns or geometric reasoning
Capable of scaling data samples by setting quantity and difficulty level
Provides consistent input-output format for evaluation compatibility
Class 3: ChatDataset โ Custom Dataset for HuggingFace Transformers

3.1. Purpose:
Converts prompt-response lists into tensor-ready format for model ingestion
3.2. Internal Processing:
Accepts list_prompt and list_response from the preprocessing step
Uses a tokenizer (from HuggingFace) to convert raw text into token IDs
Applies attention masks, padding, and truncation to maintain input consistency
Outputs training-ready samples as dictionaries: {input_ids, attention_mask, labels}
3.3. Integration:
Fully compatible with PyTorch's DataLoader
Supports mixed batch sizes and customizable sequence lengths
Enables real-time augmentation if needed (e.g., paraphrasing)
3.4. Features for Fine-Tuning:
Automatic detection of maximum input lengths
Conditional masking for multi-turn conversation modeling
Can adapt for causal vs. sequence-to-sequence tasks depending on model type
Class 4: BBO_Training โ Model Training Orchestration

4.1. Objective:
Manages the complete lifecycle from model setup to training and saving
CL4.2. Training Workflow:
Initialize pre-trained model and tokenizer
Configure optimizer, learning rate scheduler, and loss function (e.g., CrossEntropyLoss)
Invoke train() method:
For each epoch:
Iterate over training batches
Forward pass to compute predictions
Compute loss and backpropagate gradients
Apply gradient clipping (if needed)
Log performance metrics (loss, accuracy)
Optionally save checkpoints periodically
Save final model state using save_model()
CL4.3. Highlights:
Highly configurable architecture with modular hooks
Support for distributed training across multiple GPUs
Compatible with Transformers Trainer API and PEFT (Parameter-Efficient Fine-Tuning)
Easily extendable to integrate custom evaluation metrics and callbacks
4. Summary: Full Training Pipeline
Data Preparation:
From Build_Data
Dataset Construction:
Via ChatDataset
Model Training:
Managed by BBO_Training.train() and BBO_Training.fit()
Model Export:
Handled by BBO_Training.save_model() and evaluate by BBO_Training.evaluate()
5. Recommendations for Future Improvements: Strategic Shift to RAG, API Integration, and Automation
1. Transition from Fine-Tuning to Retrieval-Augmented Generation (RAG)
Advantages of RAG over Fine-Tuning:
Efficiency: Fine-tuning requires retraining and storing large models, while RAG dynamically retrieves relevant knowledge without altering the model weights.
Stability: RAG decouples knowledge storage from the model itself, reducing hallucinations and ensuring more consistent outputs.
Cost Reduction: Avoids repetitive fine-tuning, saving GPU hours and hardware costs.
Rapid Updates: Content updates can be made by refreshing the retrieval database, without needing to retrain the model.
Conclusion: RAG is a more scalable, flexible, and cost-effective approach compared to traditional fine-tuning, especially as knowledge bases frequently evolve.
2. Leveraging API-Based Language Models Combined with RAG
Advantages of API Utilization:
Resource Savings: Outsourcing heavy model inference to external API providers (e.g., OpenAI, Anthropic, Cohere) reduces dependency on local GPUs/CPUs.
Flexibility: Easily switch between models (e.g., GPT-4, Claude, Gemini) depending on specific project needs.
Focus on Core Development: Minimizes infrastructure management overhead, allowing the team to focus on improving retrieval systems, prompt engineering, and knowledge bases.
Conclusion: Combining RAG with API-hosted models strikes an optimal balance between performance, cost, and development speed.
3. Automating Deployment and Maintenance Using n8n
Advantages of n8n Automation:
Low-Code Workflow Management: Allows setting up and managing complex workflows without heavy coding, saving time and resources.
Integration-Friendly: Native support for over 300 services (databases, APIs, cloud storage, messaging apps).
Scalable and Transparent: Easily monitor, update, and version control workflows via a user-friendly dashboard.
Event-Driven Triggers: Automate retraining triggers, dataset updates, model deployment, and monitoring alerts seamlessly.
Conclusion: Switching from traditional manual deployments to n8n-based automation significantly improves system maintainability, resilience, and operational efficiency.
Last updated