CCAC 2024 第四届智慧论辩评测（AI-Debater）

本届智慧论辩评测包含基础论辩能力评测及综合论辩表现评测两个阶段。第一阶段由三个计算论辩领域的下游任务组成，而第二阶段则包含两个较为综合性任务。本次评测鼓励参赛选手使用大语言模型（LLM）解决问题，因此不针对各个任务分别提供训练数据，而是统一提供论辩指令数据集。除阶段二的自主论辩任务外，每个任务都有相应的测试集。

论辩指令数据集：双语数据集，基于GPT-3.5-turbo，使用self-instruct技术方案构造的论辩指令数据集，最终整理为带格式的json文件。

论辩挖掘数据集（支撑阶段一的任务一、二）：英文数据集，来源于英文维基百科，由专业标注员进行了论点、立场、论据等标注，整理为带格式的txt文件。

反论点生成数据集（支撑阶段一的任务三）：数据集来源于ChangeMyView论坛(CMV)，标注员针对用户的交互内容进行了反驳关系标注，整理为带格式的txt文件。

华语辩论赛数据集（支撑阶段二的任务一）：中文数据集，来源于2007至2021年的近700场知名华语辩论比赛，语音转译、人工校验后由标注员进行了论点句标注，整理为带格式的txt文件。

点击下载训练数据

阶段一：LLM基础论辩能力评测

任务一：论据发现

介绍

给定一个论点和一个候选句子，判断当前句子是否为支持当前论点的论据。输出有两种标签：1表示是论据，0表示非论据。

数据样例

输入：Controlling birth rates allows families to raise the future earnings power of the next generation.<tab>Note that the Millennium Development Goals have been superseded by the Sustainable Development Goals.
输出：0

评价指标

任务二：立场分类

介绍

给定一个辩题和一个论点，判断该论点的立场为支持或反对。输出有两种标签：1表示支持，-1表示反对。

数据样例

输入：Should we ban the production of generic drugs<tab>One long-standing solution to mitigating costs has been generics, a pharmaceutical drug with the same chemical substance as a drug whose patents have expired.
输出：-1

评价指标

任务三：反论点生成

介绍

针对给定的话题和原始论点，由参赛模型自动生成反驳原始论点的1个句子（称为反论点）。

数据样例

输入：Should the phrase "under God" be retained in the Pledge of Allegiance?<tab>the under God line is actually a relatively new addition and can therefore be easily removed without significant consequences.
输出：Just because something is new doesn't mean it lacks importance or significance, and removing it can have unforeseen consequences.

评价指标

ROUGE-L

阶段二：LLM综合论辩表现评测

任务一：基于辩题的论点生成

介绍

针对既定的辩题，由参赛模型自动生成贴合辩题的5个论点。

数据样例

输入：公众事件中不应该批评不完美受害者
输出：
将矛头调转向批评不完美的受害者，使受害者与加害者之间的力量进一步失衡，不符合媒体伦理。
舆论的变动可能影响案件的走向。
如果秉持着应该批评不完美的心态，无疑会使得将来更少受害者敢于向公众发声。
...

评价指标

ROUGE-L

任务二：自主论辩智能体

介绍

基于LLM实现一个自主论辩智能体，与基准智能体针对给定的辩题进行辩论赛。基准智能体持正方，参赛智能体持反方。

辩论赛交互规则

陈词：提出本方的观点和论据，为辩论奠定基础。
攻辩：主要对对方的论点进行反驳，并进一步强化自身立场。
结辩：回应对方反驳，总结本方发言。

评价指标

Debatrix（点此了解更多）、人工评价

参考文献

https://eval.ai/challenge/1449/leaderboard/3606
Jian Yuan, Liying Cheng, Ruidan He, Yinzi Li, Lidong Bing, Zhongyu Wei, Qin Liu, Chenhui Shen, Shuonan Zhang, Changlong Sun, Luo Si, Changjian JIang and Xunjing Huang. Overview of Argumentative Text Understanding for AI Debater Challenge. NLPCC 2021.
Lu Ji, Zhongyu Wei, Xiangkun Hu, Yang Liu, Qi Zhang and XuanJing Huang. Incorporating argument-level interactions for persuasion comments evaluation using co-attention model. COLING 2018.
Lu Ji, Zhongyu Wei, Jing Li, Qi Zhang and Xuanjing Huang. Discrete Argument Representation Learning for Interactive Argument Pair Identification. NAACL 2021.
Jian Yuan, Zhongyu Wei, Donghua Zhao, Qi Zhang and Changjian Jiang. Leveraging Argumentation Knowledge Graph for Interactive Argument Pair Identification. ACL 2021 findings.
Xinyu Hua, Zhe Hu, and Lu Wang. Argument Generation with Retrieval, Planning, and Realization. ACL 2019.
Milad Alshomary, Shahbaz Syed, Arkajit Dhar, Martin Potthast, and Henning Wachsmuth. Counter-Argument Generation by Attacking Weak Premises. ACL 2021 findings.
Liying Cheng, Lidong Bing, Ruidan He, Qian Yu, Yan Zhang, and Luo Si. IAM: A Comprehensive and Large-Scale Dataset for Integrated Argument Mining Tasks. ACL 2022.
Jiayu Lin, Rong Ye, Meng Han, Qi Zhang, Ruofei Lai, Xinyu Zhang, Zhao Cao, Xuanjing Huang, and Zhongyu Wei. Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation. EMNLP 2023.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning Language Models with Self-Generated Instructions. ACL 2023.
Jingcong Liang, Rong Ye, Meng Han, Ruofei Lai, Xinyu Zhang, Xuanjing Huang, Zhongyu Wei. Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM. arXiv:2403.08010.
Slonim, N., Bilu, Y., Alzate, C. et al. An autonomous debating system. Nature 591, 379–384 (2021).

如有疑问，请致信评测会务组：disclab@fudan.edu.cn 评测官网：http://www.fudan-disc.com/sharedtask/AIDebater24/index.html