关于校园图书馆网站建设,地推平台去哪里找,网络运维工作总结,网件路由器app本博客系博主个人理解和整理所得#xff0c;包含内容无法详尽#xff0c;如有补充#xff0c;欢迎讨论。
这里只提供数据集相关介绍和来源出处#xff0c;或者下载地址等#xff0c;因版权原因不提供数据集所含的元数据。如有需要#xff0c;请自行下载。
“Complete d…本博客系博主个人理解和整理所得包含内容无法详尽如有补充欢迎讨论。
这里只提供数据集相关介绍和来源出处或者下载地址等因版权原因不提供数据集所含的元数据。如有需要请自行下载。
“Complete dataset cannot be distributed because of Twitter privacy policies and news publisher copy rights. Social engagements and user information are not disclosed because of Twitter Policy.” 0. 一些数据集合集及相关综述
Fake News Detection | Papers With Code
GitHub - ICTMCG/fake-news-detection: This repo is a collection of AWESOME things about fake news detection, including papers, code, etc.
论文Combating Fake News: A Survey on Identification and Mitigation Techniques
论文TACL2022 A Survey on Automated Fact-Checking
论文Fake news detection: A hybrid CNN-RNN based deep learning approach
论文The Surprising Performance of Simple Baselines for Misinformation Detection该论文的github项目地址GitHub - ComplexData-MILA/misinfo-baselines有本文实验所用的数据集。 一、多模态数据集
1. ExFaux数据集 [23] 2020年 包含263张image和若干text有social metadata和comments英文来自Twitter和Reddit。本文的任务除了判断一个post的真假之外还需要判断是多模态内容中的image or claim导致的虚假这个是别的方法没有考虑过的也因为数据集没有提供这个标签论文中没有提供数据集的下载地址和源码从网上也没有搜索到
2. FauxBuster数据集 [24]2018年
包含917个image及text有social metadata和comments英文来自Twitter和Reddit网站。
3. MR2数据集 [25] 包含14700个image及text类别是rumornon-rumorunverified有social metadata和comments还有从网站上检索的text和image证据中文来自weibo英文来自Twitter。数据集下载地址在谷歌云上weibo有24Gtwitter 有22GGitHub - THU-BPM/MR2 4. MuMiN数据集 [26]2022年 包含984张image和text有social metadata和comments可以构成tweet之间的graph很多语言来自Twitter因为来自twitter根据twitter的规则作者只提供了可以用来从twitter下载所有数据的爬虫代码没有提供数据文件。数据需要大家自行下载。数据集下载地址MuMiN - A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset
5. Weibo数据集 [27]来自att-RNN
包含9528张image和text有social metadata是字段型的social info比如转发数用户名用户的发推数目等没有retweet和comments中文来自Weibo
6. COSMOS数据集 [36] 包含image和text是为了检测out-of-context也就是图文不符的情况。Training (160 K images), Validation (40 K images) and Test (1700 images)。没有social info每一张图片还给出了最多10个bounding box数据集地址https://github.com/shivangi-aneja/COSMOS需要先填一个表格然后作者会给数据集的下载方式。数据集主页COSMOS Dataset — COSMOS: Catching Out-of-Context Misinformation using Self-Supervised Learning 1.0 documentation 7. MM-COVID 数据集[37] 包含imagetextsocial info等内容covid-19相关3981个虚假内容和 7192 个真实内容包含 English, Spanish, Portuguese, Hindi, French and Italian, 6 种不同语言数据集地址https://github.com/bigheiniu/MM-COVID还有一个地址GitHub - bigheiniu/MM-COVID: Cross Linugual COVID-19 Fake News Dataset上述地址中只提供了news和相关tweet的id在谷歌云上。因为来自twitter所以数据集中只提供了tweet的id而tweet的text内容需要大家自行下载数据集中提供了下载的代码。 8. FakeNewsNet [9] 数据集 包含两部分GossipCopPolitiFact是否包含BuzzFeed下载地址如下。作者提供了爬虫代码可以直接运行得到所需的数据。需要twitter API
https://github.com/KaiDMML/FakeNewsNet
下面也是[9]官方提出的数据集但是数据集变成了BuzzFeedPolitiFact不知道为什么。这个数据集可以直接下载。
FakeNewsNet | Kaggle
下面是一个修改后的下载该数据集的项目能够以更快的速度从twitter上下载tweets
GitHub - SaschaStenger/FakeNewsNet_modified
下面这个论文的作者似乎能分享这个数据集的tweet和user元数据。给下面的作者发邮件能得到谷歌云上的数据
GitHub - hwang219/AttackFakeNews
使用该数据集的论文[7]SAFEicassp的知识蒸馏论文spotfake or plus 9.Fakeddit数据集 [21] 有images 数据集的下载地址
GitHub - entitize/Fakeddit: r/Fakeddit New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
GitHub - faiazrahman/Multimodal-Fake-News-Detection: Multi-Modal Fine-Grained Fake News Detection with Dialogue Summarization 10.PHEME数据集[15]; 这个数据集包含twitter完整的post信息包含pic_url项可以自己去爬对应的images。但是很多post该项为空。有个人[18]爬了这个数据集的imagesGitHub - drivsaf/MFAN使用该数据集的文章ICMR-2020-KMGCN, [18], HMCAN 11. ReCOVery数据集 [38] 包含textimage时间信息传播信息textual, visual, temporal, and network information用了三个关键词检索相关articleSARS-CoV-2,COVID-19, Coronavirus爬取每一个news的News Content具体包括12个IdURL发布者发布时间作者标题和正文图片国家政治偏向真假情况NewsGuard score and MBFC factual reporting同时作者还爬取了news在tweet上的传播情况包括许多详细信息the corresponding tweets with detailed information such as their IDs, text, languages of text, times of being created, statistics on retweeted/replied/liked。但是由于twitter的内容限制数据集中没有提供这些tweet的信息只提供了相关的tweet id需要大家自行下载。下载的指导在数据集中有。To comply with Twitter’s Terms of Service, we only publicly release the IDs of the collected data for non-commercial research use, but provide the instructions for obtaining the tweets using the released IDs for user convenience.数据集下载地址http://coronavirus-fakenews.com作者在上述地址中提供了news的url正文textimage-url没有image文件。但是network信息只提供了tweet id需要自己爬tweet的内容。这个image和tweet都需要别人帮忙爬上面那个地址可能打不开还有一个GitHub - apurvamulay/ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research 12.CHECKED数据集来自 [39] 包含textimage还有commentrepost等social info中文来自weibo关于covid-19的数据集这个数据集的地址
https://github.com/cyang03/CHECKED
上面的地址中的数据集提供了所有数据的json文件有comments和reposts这些都有id不知道能不能构成graph。这些东西的text文本都有。其中包含 pic_url 和 video_url而没有image文件。需要大家自己根据image url去下载image和video。 二、纯文本数据集
1. CW-CURE 数据集
来自[30]包含3266条claim text医疗领域的只有text没有social info没有image。英文来自Twitter
2. BioClaims 数据集
来自[31],related to COVID-19, measles, cystic fibrosis, and depression好像只有text英文来自Twitter
4. LIAR [22] 数据集
数据来源是从politifact网站上爬的 数据集下载地址
https://www.cs.ucsb.edu/~william/data/liar_dataset.zip 5. 又一个weibo数据集 [19]在[16]的基础上数据量增加了一倍。 该文章的代码
GitHub - thunlp/CED: source code for TKDE paper “CED: Credible Early Detection of Social Media Rumors”
该数据集所在地址及规模大小
GitHub - thunlp/Chinese_Rumor_Dataset: 中文谣言数据 使用该数据集的文章有[18]; 6. CoAID 数据集 [8] 包含 5,216 条新闻, 296,752 related user engagements, 958 social platform posts about COVID-19, and ground truth labels.数据集地址如下GitHub - cuilimeng/CoAID但是github项目里面只给了news的https以及跟news相关的tweet的id以及retweet的id。所以数据其实还需要自己爬。而且github中没有提供爬虫代码。这个需要别人帮忙有个论文说这个数据集包含image应该是在爬取tweet内容的时候可以把其image也一起爬了。但是本数据集在提出的时候并没有处理image。使用该数据集的论文[7] 7. MC-Fake 数据集[40] 纯文本信息针对news有 tweets, retweets, replies, retweet relations, and replying relations可以构建post的propagation network27,155 news events, 5 million posts, 2 million users and an induced user social graph with 0.2 billion edges.数据集中有五种主题 five topics: Politics, Entertainment, Health, Covid-19 and Syria War,来自Twitter又没有内容需要自己爬诶人生怎么这么艰难 8. 又有名字一样的Weibo和Twitter数据集来自[16] 下面是原始论文提供的地址失效了
http://alt.qcri.org/⇠wgao/data/rumdect.zip
新的数据集下载地址如下
https://www.dropbox.com/s/46r50ctrfa0ur1o/rumdect.zip?dl0
推特数据集没有提供content因为twitter的版权问题只能自己根据id用Twitter API去爬。weibo提供了post的详细content。但是本数据集没有提供source images虽然content里面有post附带的picture的url信息而且大部分post的picture项为null。 该数据集规模大小下面的图片来自[19]. 使用该数据集的文章[19], [2]; 三、事实核查数据集
1. FEVEROUS数据集 [28]; 包含需要验证的claim证据是wikipedia上的pages中的sentence或者tables其他数据集很少有用tables的数据集下载地址https://fever.ai/dataset/feverous.html 2. CHEF数据集 [29]NAACL 2022年; 中文包含完整的text内容和对应的social metadata(e.g. author, domain, URL publication date).数据集下载地址https://github.com/THU-BPM/CHEF 3. MultiFC数据集 [32], EMNLP 2019年; 有36534条claimtextual sources and rich metadata, and labelled for veracity by human expert journalists. 但是metadata如上图是独立特征的不是retweet英文爬取的数据来自26个事实核查网站非人工创造的claimmulti指的是claim来自多个domaine.g., politics, sports, and entertainment。而不是多模态。
4.FEVER 数据集 [1]
dataset seeks to retrieve supporting evidence for single-sentence claims and classify the claims as Supported, Refuted or NotEnoughInfo.
使用该数据集的论文 三、其他还在分析和整理中没有分类
1. 又一个PHEME [33] 2.PolitiFact5 is a website that manually assigns factcheck label to claims, along with the background information.
PolitiFact
使用该数据集的论文[7] 3.Zlatkova et al. (2019) propose a dataset for fact-checking claims about images [3]. 4.TabFact (Chen et al., 2019) presents semi-structural tables for fact verification [4]. 5.The SemEval-2020 shared task (Da San Martino et al., 2020) centers on the detection of propaganda techniques in news articles, which is more linguistically oriented [5]. 6.infosurgeon [6]通过修改news的KG或者ARM graph生成的虚假新闻 7. Constraint 数据集[34];
跟covid-19有关的数据 9.All-in-one: Multi-task Learning for Rumour Verifification. 10. 文章[20]提出了两个数据集Snopes和Politifact有images。
we split the full dataset into two sub datasets called Snopes and Politifact datasets. The former one contains pairs where FC-articles are from snopes.com and the later one contains pairs where FC-articles are from politifact.com. 这是该文章的code和dataset地址还没下载因为在google云上
https://github.com/nguyenvo09/EMNLP2020 11.ANTiVax 数据集[35]
跟covid-19有关来自twitter 12. FakeNews AMT [12]提出了该数据集
使用该数据集的论文[10] [11] 13. Celeb [12]提出了该数据集
使用该数据集的论文[10] [11] 14 PHEME 9 出自[13]
使用该数据集的论文[10] 16.Sentimental-LIAR [14]提出的对LIAR添加了情感信息和emotions。 21.又一个twitter数据集来自[17]
https://www.dropbox.com/s/7ewzdrbelpmrnxu/rumdetect2017.zip?dl0
由于twitter的限制该数据集只提供了source tweet ID 及 source tweet content还有传播树结构。 其他post相关内容需要自己根据id去爬。
这个好像就是text研究中用的很多的twitter15和twitter16没有image内容。 参考文献
[1] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 809–819.
[2] F. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan, “A convolutional approach for misinformation identification,” in Proceedings of IJCAI, 2017.
[3] Dimitrina Zlatkova, Preslav Nakov, and Ivan Koychev. 2019. Fact-checking meets fauxtography: Verifying claims about images. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics.
[4] Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. Tabfact: A largescale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164.
[5] Giovanni Da San Martino, Alberto Barron-Cedeno, ´Henning Wachsmuth, Rostislav Petrov, and Preslav Nakov. 2020. Semeval-2020 task 11: Detection of ropaganda techniques in news articles. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1377–1414.
[6] InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection
[7] Embracing Domain Differences in Fake News: Cross-domain Fake News Detection
using Multimodal Data,AAAI 2021.
[8] Limeng Cui and Dongwon Lee. 2020. CoAID: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885
[9] Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2020a. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3):171–188.
[10] An Emotion-Based Multi-Task Approach to Fake News Detection (Student Abstract)
[11] Saikh, T.; De, A.; Ekbal, A.; and Bhattacharyya, P. 2020. A Deep Learning Approach for Automatic Detection of Fake News. arXiv:2005.04938.
[12] Perez-Rosas, V.; Kleinberg, B.; Lefevre, A.; and Mihalcea, R. ´ 2018. Automatic Detection of Fake News. In Proceedings of the 27th International Conference on Computational Linguistics, 3391–3401. Santa Fe, New Mexico, USA: ACL.
[13] Zubiaga, A.; Liakata, M.; and Procter, R. 2016. Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media. arXiv:1610.07363.
[14] Upadhayay, B.; and Behzadan, V. 2020. Sentimental LIAR: Extended Corpus and Deep Learning Models for Fake Claim Classification. In 2020 IEEE International Conference on Intelligence and Security Informatics (ISI), 1–6. IEEE.
[15] Arkaitz Zubiaga, Maria Liakata, and Rob Procter. 2017. Exploiting context for rumour detection in social media. In International Conference on Social Informatics. Springer, 109–123.
[16] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Bernard J Jansen, Kam-Fai Wong, and Meeyoung Cha. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of IJCAI 2016.
[17] Jing Ma, Wei Gao, Kam-Fai Wong. Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning. ACL 2017.
[18] IJCAI 2022, MFAN: Multi-modal Feature-enhanced Attention Networks for Rumor Detection;
[19] Changhe Song, Cheng Yang, Huimin Chen, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. Ced: Credible early detection of social media rumors. IEEE Transactions on Knowledge and Data Engineering, 33(8):3035–3047, 2019
[20] Nguyen Vo and Kyumin Lee. 2020. Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7717–7731, Online. Association for Computational Linguistics.
[21] Kai Nakamura, Sharon Levy, and William Yang Wang. 2020. Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6149–6157, Marseille, France. European Language Resources Association.
[22] ACL 2017“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection
[23] Ziyi Kou, Daniel Yue Zhang, Lanyu Shang, and Dong Wang. 2020. ExFaux: A Weakly Supervised Approach to Explainable Fauxtography Detection. In 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020, Xintao Wu, Chris Jermaine, Li Xiong, Xiaohua Hu, Olivera Kotevska, Siyuan Lu, Weija Xu, Srinivas Aluru, Chengxiang Zhai, Eyhab AlMasri, Zhiyuan Chen, and Jeff Saltz (Eds.). IEEE, 631–636. https://doi.org/10. 1109/BigData50022.2020.9378019
[24] Daniel Yue Zhang, Lanyu Shang, Biao Geng, Shuyue Lai, Ke Li, Hongmin Zhu, Md. Tanvir Al Amin, and Dong Wang. 2018. FauxBuster: A Content-free Fauxtography Detector Using Social Media Comments. In IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018, Naoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen K. Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, and Jeffrey S. Saltz (Eds.). IEEE, 891–900. https://doi.org/10.1109/BigData.2018.8622344
[25] SIGIR 2023, MR2: A Benchmark for Multimodal Retrieval-Augmented Rumor Detection in Social Media
[26] Dan Saattrup Nielsen and Ryan McConville. 2022. MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 3141–3153. https://doi.org/10.1145/3477495.3531744
[27] Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo Luo. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on Multimedia. 795–816
[28] Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ 68d30a9594728bc39aa24be94b319d21-Abstract-round1.html
[29] Xuming Hu, Zhijiang Guo, Guanyu Wu, Aiwei Liu, Lijie Wen, and Philip S. Yu. 2022. CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 3362–3376. https://doi.org/10.18653/v1/2022.naacl-main.246
[30] Identifying Checkworthy CURE Claims on Twiter, www 2023;
[31] Amelie Wührl and Roman Klinger. 2021. Claim Detection in Biomedical Twitter Posts. In Proceedings of the 20th Workshop on Biomedical Language Processing. 131–142.
[32] MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims, EMNLP 2019,
[33] Cody Buntain and Jennifer Golbeck. 2017. Automatically identifying fake news in popular twitter threads.In 2017 IEEE International Conference on Smart Cloud (SmartCloud), pages 208–215. IEEE.
[34] Parth Patwa, Shivam Sharma, Srinivas Pykl, Vineeth Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2021. Fighting an infodemic: Covid-19 fake news dataset. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation, pages 21–29. Springer.
[35] Kadhim Hayawi, Sakib Shahriar, Mohamed Adel Serhani, Ikbal Taleb, and Sujith Samuel Mathew. 2022. Anti-vax: a novel twitter dataset for covid-19 vaccine misinformation detection. Public health, 203:23–30.
[36] COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning,
[37] Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. MM-COVID: A Multilingual and Multidimensional Data Repository for CombatingCOVID-19 Fake New. arXiv e-prints (2020), arXiv–2011.
[38] X. Zhou, A. Mulay, E. Ferrara, and R. Zafarani, “Recovery: A multimodal repository for covid-19 news credibility research,” 2020.
[39] Social Network Analysis and Mining (2021), CHECKED: Chinese COVID‑19 fake news dataset;
[40] Erxue Min, Yu Rong, Yatao Bian, Tingyang Xu, Peilin Zhao, Junzhou Huang, and Sophia Ananiadou. 2022. Divide-and-Conquer: Post-User Interaction Network for Fake News Detection on Social Media. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 1148–1158. https://doi.org/10.1145/3485447.3512163 总结
博主目前暂时收集和整理了这些数据集“其他”类别中的数据集还没有进行详细分析同时还有许多不完善的地方。
如有补充或者错漏敬请大家在评论区指正。
会持续更新对数据集进行详细介绍提供相应的数据集下载地址。