Large Language Models (LLMs) encode world knowledge through pre-training on massive datasets,[1] making them the backbone of knowledge extraction applications. Their reliability degrades on long-tail knowledge: low-popularity knowledge that occurs infrequently in pre-training data.[2] Popularity is not neutral: pre-training datasets are predominantly web-crawled and, as such, are generalist, English-centric, and mostly produced over the past 30 years by Western, High-income, Educated, Liberal, Male-dominated (WHELM)[3] communities, raising the risk of models underperforming on specialized domains, non-English languages and non-contemporary times sources, and on knowledge belonging to marginalized social groups. Retrieval-Augmented Generation (RAG) has been proposed as a mitigation, but corpora used for retrieval may still be biased. Knowledge Graphs (KGs) provide a more transparent and deterministic alternative, yet open-domain KGs such as Wikidata exhibit coverage gaps along the same dimensions.[4] The X-TAIL workshop aims to advance research on extracting, exploiting, and ultimately preserving long-tail knowledge, blending the strengths of LLMs and KGs.
The previous edition saw great engagement by the public and insightful discussion on the challenges of dealing with long-tail knowledge. Papers were published in the Joint Proceedings of Posters, Demos, Workshops, and Tutorials of EKAW 2024. The invited speaker's talk was delivered by Jan-Christoph Kalo (University of Amsterdam), with the title What do Large Language Models know about the World?. The invited speaker's talk notes, accepted papers, and slides can be accessed and downloaded from the previous edition webpage.
[1] Petroni, Fabio, et al. "Language Models as Knowledge Bases?" In EMNLP-IJCNLP, Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/D19-1250
[2] Kandpal, Nikhil, et al. "Large Language Models Struggle to Learn Long-Tail Knowledge." Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA), ICML’23, vol. 202 (July 2023): 15696–707. https://dl.acm.org/doi/10.5555/3618408.3619049
[3] Daryani, Yalda, et al. "The Homogenizing Engine: AI’s Role in Standardizing Culture and the Path to Policy." Policy Insights from the Behavioral and Brain Sciences 13, no. 1 (2026): 14–27. https://doi.org/10.1177/23727322251406591
[4] Kraft, Angelie, and Soulier, Eloïse. "Knowledge-Enhanced Language Models Are Not Bias-Proof: Situated Knowledge and Epistemic Injustice in AI." Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ‘24, June 5, 2024, 1433–45. https://doi.org/10.1145/3630106.3658981