🎍Testing Demo
General Introduction
Purpose of the Demo/Testing Session: To publicly release the demo version of the Cardano Constitution Chatbot to the community. To test the core features of the Telegram Chatbot before entering the final deployment phase, evaluate the system's performance under simulated load, and collect initial feedback from a group of test users, etc,..
Scope of the Demo/Testing: Covers all primary interaction flows such as: Commands, Text messages, Callback Queries — with a focus on the chatbot's question-answering functionality, automated notification flows, account management features, bilingual support, and more.
Execution Environment: Local development environment with infrastructure configuration closely resembling the production setup.
Date and Time of Execution: From May 8 to May 16, 2025.
Demo Conductors and Participants:
The Bamboo Team consisting of 6 members.
Community Support: 22 participants.
Demo Scenarios and Results (Testcase)
Demo bot link: @GovernCardanoBot
Demo Results Summary
📍 1. Context & Objective
This report summarizes the performance of a chatbot model evaluated on the data_core
dataset. The dataset consists of 400 question-answer pairs, split equally into 4 batches (batch1 to batch4). Each batch contains:
100 questions,
Corresponding gold-standard answers, and
Model-generated answers from the chatbot.
The objective is to quantitatively assess response quality across batches using multiple industry-standard metrics.
📊 2. Evaluation Metrics Explained
Duration
Avg. response time (seconds). Measures processing efficiency.
BLEU
Measures lexical overlap (n-gram match). Useful for fluency/formal similarity.
Factual Accuracy
Assesses factual correctness of generated answers.
Relevance Score
Measures how well responses match the intent of the question.
ROUGE
Evaluates content overlap (focus on recall). Important for summarization-style tasks.
Semantic Similarity
Measures how close in meaning the chatbot answer is to the reference answer.
📈 3. Quantitative Results Overview
data
name
Batches
Duration
BLEU
Factual
Accuracy
Relevance
Score
ROUGE
Semantic
Similarity
data_core
Batch 1
7.47
0.02
0.71
0.79
0.22
0.54
data_core
Batch 2 🔝
7.08
0.03
0.75
0.81
0.23
0.60
data_core
Batch 3
5.78
0.02
0.71
0.81
0.21
0.56
data_core
Batch 4 🔻
6.23
0.01
0.66
0.73
0.16
0.43
🔝 Batch 2 consistently outperforms others across all key metrics. 🔻 Batch 4 shows a critical decline in multiple areas.
📌 4. Key Insights
🔹 Batch 2 – Best Performer
Highest factual correctness, BLEU, ROUGE, and semantic alignment.
Indicates effective alignment between input, prompt handling, and model output.
Balances speed and quality well (6.8s average duration).
🔹 Batch 4 – Needs Review
Lowest across all accuracy-related metrics.
Despite acceptable latency (6.3s), responses suffer from:
Hallucinations (low factual accuracy),
Irrelevant answers (low relevance score),
Poor semantic alignment.
🔹 Efficiency Trend
Model becomes faster from batch1 → batch3.
Minor slowdown in batch4 could relate to increased token count or failed optimizations.
✅ 5. Recommendations
🔍 Audit pipeline for batch4:
Review prompt structure, input preprocessing, or token length issues.
🔄 Replicate batch2 configurations:
Consider using batch2 as a reference setup (model version, prompt template, decoding settings).
🧪 Conduct ablation tests:
Modify only one variable (e.g. context length, sampling temperature) across batches to isolate causes.
📎 6. Summary Table
Duration
batch3
batch1
Maintain speed
BLEU
batch2
batch4
Tune lexical generation
Factual Accuracy
batch2
batch4
Prevent hallucinations
Relevance Score
batch2
batch4
Improve prompt clarity
ROUGE
batch2
batch4
Boost content coverage
Semantic Similarity
batch2
batch4
Enhance meaning retention
📊 7. Visual Comparison of Metrics Across Batches

📌 Chart Highlights:
Duration (Top left): Processing time improved steadily from batch1 → batch3, then slightly increased in batch4.
BLEU (Top center): batch2 achieved the highest lexical overlap (~0.030); batch4 dropped significantly (~0.010).
Factual Accuracy (Top right): Peaked at batch2 (~0.745), indicating strong correctness; batch4 fell to ~0.66.
Relevance Score (Bottom left): High and stable in batch1–3 (~0.79–0.81), but batch4 degraded sharply (~0.73).
ROUGE (Bottom center): Best in batch2 (~0.23), lowest in batch4 (~0.16) showing poor content retention.
Semantic Similarity (Bottom right): Indicates meaningfulness. batch2 leads (~0.60), while batch4 drops significantly (~0.43).
🔚 8. Final Remarks
This evaluation highlights both the potential of the current chatbot pipeline (as seen in batch2) and the need for immediate refinement (as seen in batch4). Continuous batch-wise evaluation like this is essential for:
Tracking quality regressions,
Diagnosing systematic errors,
and Ensuring delivery of consistent, reliable chatbot responses.
User reviews
The information table below shows some outstanding comments/ feedback from users after experiencing/ trying the chatbot that the team received.
1
Những câu lệnh thực thi còn chậm, tính năng vẫn còn sơ sài, câu lệnh /account đôi khi không hoạt động, câu lệnh /f chỉ liệt kê 10 câu hỏi cơ bản về hiến pháp Cardano cần bổ sung thêm hoặc nên thêm link và trích nguồn tham khảo bằng tiếng anh
M****nil
2025-05-13 10:48:31.642865
2
This AI chatbot is very helpful. Through it, I can grasp some brief and basic information about this ecosystem. It also increases my trust in the project. As someone new to learning about Cardano and having a liking for this coin, I hope you will continue to work hard in building and developing this blockchain ecosystem. I wish you the best of luck. See you again when ADA hits $6!
Kh*****ng0605
2025-05-13 11:58:22.30521
3
there no information about: 4. How can changes be proposed in Cardano? 7. How does Cardano's Treasury fund work? 9. Is there any organization overseeing the Cardano Constitution? 10. Does the Constitution help Cardano resist centralization?
Ha*****ng
2025-05-14 06:00:04.614024
4
cần làm sạch vector dữ liệu bởi vì có nhiều dấu * không cần thiết
tr*****24
2025-05-14 13:04:30.959081
5
I see that Bot training data is still very little. need to add more because this topic is very little understood and very little known so the bot needs more data to provide to users thank you. I find your bot quite good on blockchain topic but I think it needs more data because this market is quite large and users usually don't know much about these issues so you need to provide more information or explain more about this topic
Ch*****031
2025-05-15 07:42:06.80212
6
bổ sung 1 role nhận thông tin fund hoặc thông báo khi có fund mới,...
Other
2025-05-15 14:07:42.586733
Receive contributions directly via telegram
7
tốc độ phản hồi của bot khá chậm. Và cơ chế trả lời suy luận còn hạn chế. Ví dụ các câu hỏi xoay quang cardano kiến bot Ai này chưa trả lời được
tha*****n09
2025-05-16 15:15:40.896509
Note: For data protection and privacy reasons of users participating in the test process. We will hide sensitive information of users such as telegram id, username. Sorry for the inconvenience.
Conclusion
During the testing phase, the core features operated relatively stably and as expected (80%). However, we observed some performance issues when handling a large number of concurrent requests, and the chatbot's reasoning capability still requires improvement. In addition, other functionalities available on the chatbot need enhancements in processing speed, standardized output formatting, and reduction of pending states. It is also necessary to add essential notifications, such as those related to funds, intersections, etc., to provide users with more specific and detailed information.
Overall, the testing phase was successful, demonstrating the Telegram Chatbot System's ability to handle a wide range of interactions (commands, texts, callbacks) and its automated notification features. The main test scenarios produced the expected results under normal load conditions. The RAG (Retrieval-Augmented Generation) feature showed effective data retrieval and response generation based on external sources. However, it is important to address some response time issues in certain features, and feedback from test users will serve as a valuable basis for improvements in future versions.
Based on user feedback, we have implemented several feature upgrades and system optimizations, including:
Enhanced Interaction Experience: Introduced typing effects to improve user engagement and interactivity on the chatbot interface.
System Performance Optimization: Improved data streaming, call handling, and response processing to significantly enhance chatbot response speed.
Output Stability Improvements: Resolved issues related to unresponsive outputs, particularly in Telegram output formatting.
Output Format Enhancement: Standardized and refined the structure of chatbot responses for better readability and compatibility across platforms.
Infrastructure Upgrade: Upgraded the server system to ensure greater scalability and more stable performance under high request loads.
Our development journey doesn't end here. We will actively leverage user contributions to drive continuous upgrades, maintenance, and optimization efforts. This iterative approach is key to refining the chatbot and achieving its ultimate perfection in the final phase.
Last updated