Page cover

๐ŸŽLlama 3.2 model

Presentation Report: AI Model Training Project - BBO Developing

1. Project Introduction

The "BBO Developing" project focuses on developing a conversational AI system powered by state-of-the-art language models (e.g., LLaMA, GPT). The system is designed to understand user input and generate appropriate responses. The development process includes multiple stages: data collection, preprocessing, model training, storage, and evaluation.

2. Overview of the Pipeline Workflow

The system is structured as a modular pipeline composed of key classes, each responsible for a distinct stage of data processing or model management:

  1. Build_Data โ€“ Data collection and preprocessing from real-world sources

  2. Generate_Number_Data โ€“ Synthetic logic/mathematical data generation

  3. ChatDataset โ€“ Custom dataset preparation for language model input

  4. HA_Training โ€“ Full-stack model training orchestration

3. In-Depth Class-by-Class Pipeline Analysis

Class 1: Build_Data โ€“ Real-World Text Data Collection and Preprocessing

class Build_Data:
    waiting_list = []
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model_que_name = "valhalla/t5-base-qg-hl"
    model_que = T5ForConditionalGeneration.from_pretrained(model_que_name).to(device)
    tokenizer_que = T5Tokenizer.from_pretrained(model_que_name)

    model_ans_name="deepset/roberta-large-squad2"
    model_ans = AutoModelForQuestionAnswering.from_pretrained(model_ans_name).to(device)
    tokenizer_ans = AutoTokenizer.from_pretrained(model_ans_name)

    model_para_name="facebook/bart-large-cnn"
    model_para = BartForConditionalGeneration.from_pretrained(model_para_name).to(device)
    tokenizer_para = BartTokenizer.from_pretrained(model_para_name)


    def __init__(self, Dic_contexts,epoch=1):  # Tฤƒng phแบกm vi token ฤ‘แปƒ cรณ cรขu dร i hฦกn
        self.epoch=epoch
        self.contexts = list(Dic_contexts.values())[0]
        self.data=[]
        self.Context_To_Data()

    def Answering(self,context,question):
        nlp = pipeline('question-answering', model=self.model_ans, tokenizer=self.tokenizer_ans)
        context_chunks = self.chunk_text(context,self.tokenizer_ans,chunk_size=64)
        answers = []
        for chunk in context_chunks:
            answer = nlp({'context': chunk, 'question': question})
            answers.append(answer['answer'])

        final_answer = ' '.join(answers[1:])
        return self.paraphasing(final_answer)

    def chunk_text(self,text,tokenizer,chunk_size=512):
      tokens = tokenizer.encode(text)
      chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]
      return [tokenizer.decode(chunk) for chunk in chunks]

    def paraphasing(self,answer):
      nlp = pipeline('text2text-generation', model=self.model_para, tokenizer=self.tokenizer_para)

      # Generating a more coherent version
      output = nlp(f"paraphrase: {answer}")

      # Print the improved sentence
      return output[0]['generated_text']

    def Context_To_Data(self):
        for i in range(self.epoch):
          for index, context in enumerate(self.contexts):
            print(f"\rTiแบฟn ฤ‘แป™ sinh Data: {index+1}/{len(self.contexts)}", end="", flush=True)
            question=self.Questioning(context)
            answer=self.Answering(context,question)
            self.data.append({"context":context,"question":question,"answer":answer})


    def Questioning(self, context):
        input_text = f"generate question: {context}"
        input_ids = self.tokenizer_que.encode(input_text, return_tensors="pt").to(self.device)
        outputs = self.model_que.generate(
            input_ids,
            max_length=300,  # Tฤƒng ฤ‘แป™ dร i cรขu hแปi
            do_sample=True,
            top_k=40,
            top_p=0.95,
            temperature=0.6,
            repetition_penalty=1.6,
            num_return_sequences=1
        )
        question = self.tokenizer_que.decode(outputs[0], skip_special_tokens=True)
        return question


    def cal_token(self, text):
        return sum(len(sentence.split()) for sentence in text)

1.1. Purpose and Role:

  • Acts as a general-purpose preprocessor for all downstream training processes

  • Converts real-world documents into structured prompt-response pairs

1.2. Internal Pipeline:

  1. Load data from multiple formats: .docx, .txt, and browser-scraped HTML (using selenium)

  2. Clean text by removing special characters and enforcing UTF-8 encoding

  3. Segment paragraphs or sentences into prompt-response pairs

  4. Store results in an internal queue (waiting_list) or save to intermediate files

1.3. Output:

  • list_prompt: List of input prompts (e.g., questions or commands)

  • list_response: Corresponding model-expected responses

1.4. Advanced Processing Features:

  • Filtering based on keyword presence or sentence length

  • Custom splitting of dialogues using delimiters like โ€œQ:โ€ and โ€œA:โ€

  • Can incorporate additional sources such as CSV and Excel with sentence-level mapping

Class 2: Generate_Number_Data โ€“ Synthetic Logic & Math Data Generator

class Generate_Number_Data:
  def __init__(self,texts,end="",st=0,multi=1):
    self.start=st
    self.end=end
    if end=="":
      self.end=len(texts)-1
    self.data=Build_Data(self.extrac_dict(texts)).data
  def extrac_dict(self,dicts):
    re={}
    test_key=list(dicts.keys())
    for i in test_key:
      re[i]=dicts[i][self.start:self.end]
    return re

2.1. Purpose:

  • Automates the creation of logical/math-focused training data such as arithmetic problems

2.2. Internal Workflow:

  1. Randomly generate numeric values and operations (e.g., addition, subtraction, multiplication, division)

  2. Convert operations into natural language questions (e.g., "What is 7 plus 5?")

  3. Generate accurate answers (e.g., "7 + 5 = 12")

  4. Format them into structured prompt-response pairs for training

2.3. Pipeline Role:

  • Adds diversity and structured logic to the training dataset

  • Enhances reasoning capability of the target language model

2.4. Expansion Capabilities:

  • Supports dataset generation for algebraic patterns or geometric reasoning

  • Capable of scaling data samples by setting quantity and difficulty level

  • Provides consistent input-output format for evaluation compatibility

Class 3: ChatDataset โ€“ Custom Dataset for HuggingFace Transformers

3.1. Purpose:

  • Converts prompt-response lists into tensor-ready format for model ingestion

3.2. Internal Processing:

  1. Accepts list_prompt and list_response from the preprocessing step

  2. Uses a tokenizer (from HuggingFace) to convert raw text into token IDs

  3. Applies attention masks, padding, and truncation to maintain input consistency

  4. Outputs training-ready samples as dictionaries: {input_ids, attention_mask, labels}

3.3. Integration:

  • Fully compatible with PyTorch's DataLoader

  • Supports mixed batch sizes and customizable sequence lengths

  • Enables real-time augmentation if needed (e.g., paraphrasing)

3.4. Features for Fine-Tuning:

  • Automatic detection of maximum input lengths

  • Conditional masking for multi-turn conversation modeling

  • Can adapt for causal vs. sequence-to-sequence tasks depending on model type

Class 4: BBO_Training โ€“ Model Training Orchestration

4.1. Objective:

  • Manages the complete lifecycle from model setup to training and saving

CL4.2. Training Workflow:

  1. Initialize pre-trained model and tokenizer

  2. Configure optimizer, learning rate scheduler, and loss function (e.g., CrossEntropyLoss)

  3. Invoke train() method:

    • For each epoch:

      • Iterate over training batches

      • Forward pass to compute predictions

      • Compute loss and backpropagate gradients

      • Apply gradient clipping (if needed)

      • Log performance metrics (loss, accuracy)

      • Optionally save checkpoints periodically

  4. Save final model state using save_model()

CL4.3. Highlights:

  • Highly configurable architecture with modular hooks

  • Support for distributed training across multiple GPUs

  • Compatible with Transformers Trainer API and PEFT (Parameter-Efficient Fine-Tuning)

  • Easily extendable to integrate custom evaluation metrics and callbacks

4. Summary: Full Training Pipeline

  1. Data Preparation:

    • From Build_Data

  2. Dataset Construction:

    • Via ChatDataset

  3. Model Training:

    • Managed by BBO_Training.train() and BBO_Training.fit()

  4. Model Export:

    • Handled by BBO_Training.save_model() and evaluate by BBO_Training.evaluate()

5. Recommendations for Future Improvements: Strategic Shift to RAG, API Integration, and Automation

1. Transition from Fine-Tuning to Retrieval-Augmented Generation (RAG)

  • Advantages of RAG over Fine-Tuning:

    • Efficiency: Fine-tuning requires retraining and storing large models, while RAG dynamically retrieves relevant knowledge without altering the model weights.

    • Stability: RAG decouples knowledge storage from the model itself, reducing hallucinations and ensuring more consistent outputs.

    • Cost Reduction: Avoids repetitive fine-tuning, saving GPU hours and hardware costs.

    • Rapid Updates: Content updates can be made by refreshing the retrieval database, without needing to retrain the model.

  • Conclusion: RAG is a more scalable, flexible, and cost-effective approach compared to traditional fine-tuning, especially as knowledge bases frequently evolve.

2. Leveraging API-Based Language Models Combined with RAG

  • Advantages of API Utilization:

    • Resource Savings: Outsourcing heavy model inference to external API providers (e.g., OpenAI, Anthropic, Cohere) reduces dependency on local GPUs/CPUs.

    • Flexibility: Easily switch between models (e.g., GPT-4, Claude, Gemini) depending on specific project needs.

    • Focus on Core Development: Minimizes infrastructure management overhead, allowing the team to focus on improving retrieval systems, prompt engineering, and knowledge bases.

  • Conclusion: Combining RAG with API-hosted models strikes an optimal balance between performance, cost, and development speed.

3. Automating Deployment and Maintenance Using n8n

  • Advantages of n8n Automation:

    • Low-Code Workflow Management: Allows setting up and managing complex workflows without heavy coding, saving time and resources.

    • Integration-Friendly: Native support for over 300 services (databases, APIs, cloud storage, messaging apps).

    • Scalable and Transparent: Easily monitor, update, and version control workflows via a user-friendly dashboard.

    • Event-Driven Triggers: Automate retraining triggers, dataset updates, model deployment, and monitoring alerts seamlessly.

  • Conclusion: Switching from traditional manual deployments to n8n-based automation significantly improves system maintainability, resilience, and operational efficiency.

Last updated