1. ABOUT THE DATASET -------------------- Title: Creator(s): [Fangjun Li [1], David Hogg [2], Anthony Cohn [3]] Organisation(s): [1. University of Leeds. 2. University of Leeds. 3. University of Leeds, Alan Turing Institute. ] Rights-holder(s):Unless otherwise stated, Copyright 2023 University of Leeds Publication Year: Description: [This dataset, associated with the AAAI-24 paper "Advancing Spatial Reasoning in Large Language Models: An In-depth Evaluation and Enhancement Using the StepGame Benchmark", aims to enhance spatial reasoning evaluations in language models. It rectifies template errors in the StepGame benchmark, providing a refined version in the bAbI format, promoting more accurate evaluation of language models' capabilities in spatial reasoning tasks.] Cite as: [Fangjun Li, David Hogg and Anthony Cohn (2024): SpatialLM-StepGame. [Dataset]. https://doi.org/10.5518/1468] Related publication: [Fangjun Li, David Hogg, and Anthony Cohn "Advancing Spatial Reasoning in Large Language Models: An In-depth Evaluation and Enhancement Using the StepGame Benchmark", Proceedings of the AAAI conference on artificial intelligence, 2024 (Accepted).] Contact: [lifangjun95@gmail.com] 2. TERMS OF USE --------------- [This dataset is licensed under MIT license.] 3. PROJECT AND FUNDING INFORMATION ---------------------------------- [This work has been partially supported by Microsoft Research - Accelerating Foundation Models Research program, with the provision of Azure resources to access GPT. This work was also partially supported by the Turing’s Defence and Security programme through a partnership with the UK government in accordance with the framework agreement between GCHQ and The Alan Turing Institute.] 4. CONTENTS ----------- File listing [Within the 'data' folder of the repository, there are two subdirectories: 'correct_clean' and 'correct_noise'. These subdirectories contain sets of '.txt' files, each formatted according to the bAbI dataset format. The files are named using the pattern 'qaK_type.txt', where 'K' represents the K-hop reasoning level, ranging from 1 to 10, and 'type' indicates the file's purpose — either for testing ('test'), training ('train'), or validation ('valid'). The .txt files are composed of structured, plain-text data containing a series of examples. The quantity of examples varies based on the file type: each test file encompasses 10,000 examples, each validation file contains 1,000 examples, and each training file includes 30,000 examples. The examples are made up of sentences that describe spatial configurations, related questions, and answers. These sentences are methodically numbered with index numbers, ensuring a sequential flow and organization within each example. ] 5. METHODS ---------- [The dataset was generated through a process involving the rectification of template errors in the StepGame benchmark(Accessible at https://github.com/ZhengxiangShi/StepGame/tree/main). Template errors were identified when the conveyed meaning in sentences did not align with the intended relationship during the creation of stories and labels. Incorrect sentence templates are listed in our associated paper. To clean the StepGame data, a Python script named 'correct.py' was employed. This script was utilized to obtain 'correct_clean' and 'correct_noise' data samples by removing examples that contained sentences with incorrect templates. The process can be outlined as follows: (1) Download the StepGame data in bAbI format, which is accessible at https://github.com/ZhengxiangShi/StepGame/tree/main/Code/babi_format. (2) Execute the Python script 'correct.py' to perform the cleaning process. The cleaning script was executed to ensure that the dataset accurately reflected the intended relationships and to remove any instances of template errors, resulting in a refined dataset suitable for evaluation and analysis.]