🎍Testing Demo

General Introduction

Purpose of the Demo/Testing Session: To publicly release the demo version of the Cardano Constitution Chatbot to the community. To test the core features of the Telegram Chatbot before entering the final deployment phase, evaluate the system's performance under simulated load, and collect initial feedback from a group of test users, etc,..
Scope of the Demo/Testing: Covers all primary interaction flows such as: Commands, Text messages, Callback Queries — with a focus on the chatbot's question-answering functionality, automated notification flows, account management features, bilingual support, and more.
Execution Environment: Local development environment with infrastructure configuration closely resembling the production setup.
Date and Time of Execution: From May 8 to May 16, 2025.
Demo Conductors and Participants:
- The Bamboo Team consisting of 6 members.
- Community Support: 22 participants.

Demo Scenarios and Results (Testcase)

Demo bot link: @GovernCardanoBot

Demo Results Summary

📍 1. Context & Objective

This report summarizes the performance of a chatbot model evaluated on the data_core dataset. The dataset consists of 400 question-answer pairs, split equally into 4 batches (batch1 to batch4). Each batch contains:

100 questions,
Corresponding gold-standard answers, and
Model-generated answers from the chatbot.

The objective is to quantitatively assess response quality across batches using multiple industry-standard metrics.

📊 2. Evaluation Metrics Explained

Metric

Description

Duration

Avg. response time (seconds). Measures processing efficiency.

BLEU

Measures lexical overlap (n-gram match). Useful for fluency/formal similarity.

Factual Accuracy

Assesses factual correctness of generated answers.

Relevance Score

Measures how well responses match the intent of the question.

ROUGE

Evaluates content overlap (focus on recall). Important for summarization-style tasks.

Semantic Similarity

Measures how close in meaning the chatbot answer is to the reference answer.

📈 3. Quantitative Results Overview

data

name

Batches

Duration

BLEU

Factual

Accuracy

Relevance

Score

ROUGE

Semantic

Similarity

data_core

Batch 1

7.47

0.02

0.71

0.79

0.22

0.54

data_core

Batch 2 🔝

7.08

0.03

0.75

0.81

0.23

0.60

data_core

Batch 3

5.78

0.02

0.71

0.81

0.21

0.56

data_core

Batch 4 🔻

6.23

0.01

0.66

0.73

0.16

0.43

🔝 Batch 2 consistently outperforms others across all key metrics. 🔻 Batch 4 shows a critical decline in multiple areas.

📌 4. Key Insights

🔹 Batch 2 – Best Performer

Highest factual correctness, BLEU, ROUGE, and semantic alignment.
Indicates effective alignment between input, prompt handling, and model output.
Balances speed and quality well (6.8s average duration).

🔹 Batch 4 – Needs Review

Lowest across all accuracy-related metrics.
Despite acceptable latency (6.3s), responses suffer from:
- Hallucinations (low factual accuracy),
- Irrelevant answers (low relevance score),
- Poor semantic alignment.

🔹 Efficiency Trend

Model becomes faster from batch1 → batch3.
Minor slowdown in batch4 could relate to increased token count or failed optimizations.

✅ 5. Recommendations

🔍 Audit pipeline for batch4:
- Review prompt structure, input preprocessing, or token length issues.
🔄 Replicate batch2 configurations:
- Consider using batch2 as a reference setup (model version, prompt template, decoding settings).
🧪 Conduct ablation tests:
- Modify only one variable (e.g. context length, sampling temperature) across batches to isolate causes.

📎 6. Summary Table

Metric

Best Batch

Worst Batch

Key Action

Duration

batch3

batch1

Maintain speed

BLEU

batch2

batch4

Tune lexical generation

Factual Accuracy

batch2

batch4

Prevent hallucinations

Relevance Score

batch2

batch4

Improve prompt clarity

ROUGE

batch2

batch4

Boost content coverage

Semantic Similarity

batch2

batch4

Enhance meaning retention

📊 7. Visual Comparison of Metrics Across Batches

📌 Chart Highlights:

Duration (Top left): Processing time improved steadily from batch1 → batch3, then slightly increased in batch4.
BLEU (Top center): batch2 achieved the highest lexical overlap (~0.030); batch4 dropped significantly (~0.010).
Factual Accuracy (Top right): Peaked at batch2 (~0.745), indicating strong correctness; batch4 fell to ~0.66.
Relevance Score (Bottom left): High and stable in batch1–3 (~0.79–0.81), but batch4 degraded sharply (~0.73).
ROUGE (Bottom center): Best in batch2 (~0.23), lowest in batch4 (~0.16) showing poor content retention.
Semantic Similarity (Bottom right): Indicates meaningfulness. batch2 leads (~0.60), while batch4 drops significantly (~0.43).

🔚 8. Final Remarks

This evaluation highlights both the potential of the current chatbot pipeline (as seen in batch2) and the need for immediate refinement (as seen in batch4). Continuous batch-wise evaluation like this is essential for:

Tracking quality regressions,
Diagnosing systematic errors,
and Ensuring delivery of consistent, reliable chatbot responses.

User reviews

The information table below shows some outstanding comments/ feedback from users after experiencing/ trying the chatbot that the team received.

No.

Message

User

Date

Image from DB

Những câu lệnh thực thi còn chậm, tính năng vẫn còn sơ sài, câu lệnh /account đôi khi không hoạt động, câu lệnh /f chỉ liệt kê 10 câu hỏi cơ bản về hiến pháp Cardano cần bổ sung thêm hoặc nên thêm link và trích nguồn tham khảo bằng tiếng anh

M****nil

2025-05-13 10:48:31.642865

https://prnt.sc/BL_HGFjLCG69

This AI chatbot is very helpful. Through it, I can grasp some brief and basic information about this ecosystem. It also increases my trust in the project. As someone new to learning about Cardano and having a liking for this coin, I hope you will continue to work hard in building and developing this blockchain ecosystem. I wish you the best of luck. See you again when ADA hits $6!

Kh*****ng0605

2025-05-13 11:58:22.30521

https://prnt.sc/L6qwDbMy14Lf

there no information about: 4. How can changes be proposed in Cardano? 7. How does Cardano's Treasury fund work? 9. Is there any organization overseeing the Cardano Constitution? 10. Does the Constitution help Cardano resist centralization?

Ha*****ng

2025-05-14 06:00:04.614024

https://prnt.sc/FQJ87_kOxnW4

cần làm sạch vector dữ liệu bởi vì có nhiều dấu * không cần thiết

tr*****24

2025-05-14 13:04:30.959081

https://prnt.sc/kP9NN9ukk3qr

I see that Bot training data is still very little. need to add more because this topic is very little understood and very little known so the bot needs more data to provide to users thank you. I find your bot quite good on blockchain topic but I think it needs more data because this market is quite large and users usually don't know much about these issues so you need to provide more information or explain more about this topic

Ch*****031

2025-05-15 07:42:06.80212

https://prnt.sc/bnIQRD-tqBMI

bổ sung 1 role nhận thông tin fund hoặc thông báo khi có fund mới,...

Other

2025-05-15 14:07:42.586733

Receive contributions directly via telegram

tốc độ phản hồi của bot khá chậm. Và cơ chế trả lời suy luận còn hạn chế. Ví dụ các câu hỏi xoay quang cardano kiến bot Ai này chưa trả lời được

tha*****n09

2025-05-16 15:15:40.896509

https://prnt.sc/u7VUMeq2qRaa

Note: For data protection and privacy reasons of users participating in the test process. We will hide sensitive information of users such as telegram id, username. Sorry for the inconvenience.

Conclusion

During the testing phase, the core features operated relatively stably and as expected (80%). However, we observed some performance issues when handling a large number of concurrent requests, and the chatbot's reasoning capability still requires improvement. In addition, other functionalities available on the chatbot need enhancements in processing speed, standardized output formatting, and reduction of pending states. It is also necessary to add essential notifications, such as those related to funds, intersections, etc., to provide users with more specific and detailed information.

Overall, the testing phase was successful, demonstrating the Telegram Chatbot System's ability to handle a wide range of interactions (commands, texts, callbacks) and its automated notification features. The main test scenarios produced the expected results under normal load conditions. The RAG (Retrieval-Augmented Generation) feature showed effective data retrieval and response generation based on external sources. However, it is important to address some response time issues in certain features, and feedback from test users will serve as a valuable basis for improvements in future versions.

Based on user feedback, we have implemented several feature upgrades and system optimizations, including:

Enhanced Interaction Experience: Introduced typing effects to improve user engagement and interactivity on the chatbot interface.
System Performance Optimization: Improved data streaming, call handling, and response processing to significantly enhance chatbot response speed.
Output Stability Improvements: Resolved issues related to unresponsive outputs, particularly in Telegram output formatting.
Output Format Enhancement: Standardized and refined the structure of chatbot responses for better readability and compatibility across platforms.
Infrastructure Upgrade: Upgraded the server system to ensure greater scalability and more stable performance under high request loads.

Our development journey doesn't end here. We will actively leverage user contributions to drive continuous upgrades, maintenance, and optimization efforts. This iterative approach is key to refining the chatbot and achieving its ultimate perfection in the final phase.

PreviousModel Evaluation Report NextChatbot User Guide

Last updated 1 month ago