🎍Llama 3.2 model

Presentation Report: AI Model Training Project - BBO Developing

1. Project Introduction

The "BBO Developing" project focuses on developing a conversational AI system powered by state-of-the-art language models (e.g., LLaMA, GPT). The system is designed to understand user input and generate appropriate responses. The development process includes multiple stages: data collection, preprocessing, model training, storage, and evaluation.

2. Overview of the Pipeline Workflow

The system is structured as a modular pipeline composed of key classes, each responsible for a distinct stage of data processing or model management:

Build_Data – Data collection and preprocessing from real-world sources
Generate_Number_Data – Synthetic logic/mathematical data generation
ChatDataset – Custom dataset preparation for language model input
HA_Training – Full-stack model training orchestration

3. In-Depth Class-by-Class Pipeline Analysis

Class 1: Build_Data – Real-World Text Data Collection and Preprocessing

class Build_Data:
    waiting_list = []
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model_que_name = "valhalla/t5-base-qg-hl"
    model_que = T5ForConditionalGeneration.from_pretrained(model_que_name).to(device)
    tokenizer_que = T5Tokenizer.from_pretrained(model_que_name)

    model_ans_name="deepset/roberta-large-squad2"
    model_ans = AutoModelForQuestionAnswering.from_pretrained(model_ans_name).to(device)
    tokenizer_ans = AutoTokenizer.from_pretrained(model_ans_name)

    model_para_name="facebook/bart-large-cnn"
    model_para = BartForConditionalGeneration.from_pretrained(model_para_name).to(device)
    tokenizer_para = BartTokenizer.from_pretrained(model_para_name)


    def __init__(self, Dic_contexts,epoch=1):  # Tăng phạm vi token để có câu dài hơn
        self.epoch=epoch
        self.contexts = list(Dic_contexts.values())[0]
        self.data=[]
        self.Context_To_Data()

    def Answering(self,context,question):
        nlp = pipeline('question-answering', model=self.model_ans, tokenizer=self.tokenizer_ans)
        context_chunks = self.chunk_text(context,self.tokenizer_ans,chunk_size=64)
        answers = []
        for chunk in context_chunks:
            answer = nlp({'context': chunk, 'question': question})
            answers.append(answer['answer'])

        final_answer = ' '.join(answers[1:])
        return self.paraphasing(final_answer)

    def chunk_text(self,text,tokenizer,chunk_size=512):
      tokens = tokenizer.encode(text)
      chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]
      return [tokenizer.decode(chunk) for chunk in chunks]

    def paraphasing(self,answer):
      nlp = pipeline('text2text-generation', model=self.model_para, tokenizer=self.tokenizer_para)

      # Generating a more coherent version
      output = nlp(f"paraphrase: {answer}")

      # Print the improved sentence
      return output[0]['generated_text']

    def Context_To_Data(self):
        for i in range(self.epoch):
          for index, context in enumerate(self.contexts):
            print(f"\rTiến độ sinh Data: {index+1}/{len(self.contexts)}", end="", flush=True)
            question=self.Questioning(context)
            answer=self.Answering(context,question)
            self.data.append({"context":context,"question":question,"answer":answer})


    def Questioning(self, context):
        input_text = f"generate question: {context}"
        input_ids = self.tokenizer_que.encode(input_text, return_tensors="pt").to(self.device)
        outputs = self.model_que.generate(
            input_ids,
            max_length=300,  # Tăng độ dài câu hỏi
            do_sample=True,
            top_k=40,
            top_p=0.95,
            temperature=0.6,
            repetition_penalty=1.6,
            num_return_sequences=1
        )
        question = self.tokenizer_que.decode(outputs[0], skip_special_tokens=True)
        return question


    def cal_token(self, text):
        return sum(len(sentence.split()) for sentence in text)

1.1. Purpose and Role:

Acts as a general-purpose preprocessor for all downstream training processes
Converts real-world documents into structured prompt-response pairs

1.2. Internal Pipeline:

Load data from multiple formats: .docx, .txt, and browser-scraped HTML (using selenium)
Clean text by removing special characters and enforcing UTF-8 encoding
Segment paragraphs or sentences into prompt-response pairs
Store results in an internal queue (waiting_list) or save to intermediate files

1.3. Output:

list_prompt: List of input prompts (e.g., questions or commands)
list_response: Corresponding model-expected responses

1.4. Advanced Processing Features:

Filtering based on keyword presence or sentence length
Custom splitting of dialogues using delimiters like “Q:” and “A:”
Can incorporate additional sources such as CSV and Excel with sentence-level mapping

Class 2: Generate_Number_Data – Synthetic Logic & Math Data Generator

class Generate_Number_Data:
  def __init__(self,texts,end="",st=0,multi=1):
    self.start=st
    self.end=end
    if end=="":
      self.end=len(texts)-1
    self.data=Build_Data(self.extrac_dict(texts)).data
  def extrac_dict(self,dicts):
    re={}
    test_key=list(dicts.keys())
    for i in test_key:
      re[i]=dicts[i][self.start:self.end]
    return re

2.1. Purpose:

Automates the creation of logical/math-focused training data such as arithmetic problems

2.2. Internal Workflow:

Randomly generate numeric values and operations (e.g., addition, subtraction, multiplication, division)
Convert operations into natural language questions (e.g., "What is 7 plus 5?")
Generate accurate answers (e.g., "7 + 5 = 12")
Format them into structured prompt-response pairs for training

2.3. Pipeline Role:

Adds diversity and structured logic to the training dataset
Enhances reasoning capability of the target language model

2.4. Expansion Capabilities:

Supports dataset generation for algebraic patterns or geometric reasoning
Capable of scaling data samples by setting quantity and difficulty level
Provides consistent input-output format for evaluation compatibility

Class 3: ChatDataset – Custom Dataset for HuggingFace Transformers

3.1. Purpose:

Converts prompt-response lists into tensor-ready format for model ingestion

3.2. Internal Processing:

Accepts list_prompt and list_response from the preprocessing step
Uses a tokenizer (from HuggingFace) to convert raw text into token IDs
Applies attention masks, padding, and truncation to maintain input consistency
Outputs training-ready samples as dictionaries: {input_ids, attention_mask, labels}

3.3. Integration:

Fully compatible with PyTorch's DataLoader
Supports mixed batch sizes and customizable sequence lengths
Enables real-time augmentation if needed (e.g., paraphrasing)

3.4. Features for Fine-Tuning:

Automatic detection of maximum input lengths
Conditional masking for multi-turn conversation modeling
Can adapt for causal vs. sequence-to-sequence tasks depending on model type

Class 4: BBO_Training – Model Training Orchestration

4.1. Objective:

Manages the complete lifecycle from model setup to training and saving

CL4.2. Training Workflow:

Initialize pre-trained model and tokenizer
Configure optimizer, learning rate scheduler, and loss function (e.g., CrossEntropyLoss)
Invoke train() method:
- For each epoch:
  - Iterate over training batches
  - Forward pass to compute predictions
  - Compute loss and backpropagate gradients
  - Apply gradient clipping (if needed)
  - Log performance metrics (loss, accuracy)
  - Optionally save checkpoints periodically
Save final model state using save_model()

CL4.3. Highlights:

Highly configurable architecture with modular hooks
Support for distributed training across multiple GPUs
Compatible with Transformers Trainer API and PEFT (Parameter-Efficient Fine-Tuning)
Easily extendable to integrate custom evaluation metrics and callbacks

4. Summary: Full Training Pipeline

Data Preparation:
- From Build_Data
Dataset Construction:
- Via ChatDataset
Model Training:
- Managed by BBO_Training.train() and BBO_Training.fit()
Model Export:
- Handled by BBO_Training.save_model() and evaluate by BBO_Training.evaluate()

5. Recommendations for Future Improvements: Strategic Shift to RAG, API Integration, and Automation

1. Transition from Fine-Tuning to Retrieval-Augmented Generation (RAG)

Advantages of RAG over Fine-Tuning:
- Efficiency: Fine-tuning requires retraining and storing large models, while RAG dynamically retrieves relevant knowledge without altering the model weights.
- Stability: RAG decouples knowledge storage from the model itself, reducing hallucinations and ensuring more consistent outputs.
- Cost Reduction: Avoids repetitive fine-tuning, saving GPU hours and hardware costs.
- Rapid Updates: Content updates can be made by refreshing the retrieval database, without needing to retrain the model.
Conclusion: RAG is a more scalable, flexible, and cost-effective approach compared to traditional fine-tuning, especially as knowledge bases frequently evolve.

2. Leveraging API-Based Language Models Combined with RAG

Advantages of API Utilization:
- Resource Savings: Outsourcing heavy model inference to external API providers (e.g., OpenAI, Anthropic, Cohere) reduces dependency on local GPUs/CPUs.
- Flexibility: Easily switch between models (e.g., GPT-4, Claude, Gemini) depending on specific project needs.
- Focus on Core Development: Minimizes infrastructure management overhead, allowing the team to focus on improving retrieval systems, prompt engineering, and knowledge bases.
Conclusion: Combining RAG with API-hosted models strikes an optimal balance between performance, cost, and development speed.

3. Automating Deployment and Maintenance Using n8n

Advantages of n8n Automation:
- Low-Code Workflow Management: Allows setting up and managing complex workflows without heavy coding, saving time and resources.
- Integration-Friendly: Native support for over 300 services (databases, APIs, cloud storage, messaging apps).
- Scalable and Transparent: Easily monitor, update, and version control workflows via a user-friendly dashboard.
- Event-Driven Triggers: Automate retraining triggers, dataset updates, model deployment, and monitoring alerts seamlessly.
Conclusion: Switching from traditional manual deployments to n8n-based automation significantly improves system maintainability, resilience, and operational efficiency.

PreviousSystem NextSystem Diagram

Last updated 2 months ago