[This template readme file should be edited to be relevant to your dataset. The template proposes a basic set of information to be provided about a dataset. Sections 1-3 provide key information about the dataset and should be completed as fully as possible; Sections 4-5 provide information for interpretation and use of the dataset, and should be completed according to your judgement. Ask yourself in completing these sections: what information would the user of this dataset need in order to be able to understand it or replicate the results?

Use of the README plain text format for dataset documentation is not required, and may not be suitable for longer or more detailed documentation. In these cases, or if preferred, you can use PDF or MS Word.

Information provided here must correspond accurately with information provided in the dataset metadata record, e.g. the dataset title should match exactly, the same Creators should be listed, etc.

The readme file should be saved with the name README_[Creator surname]_[Publication year]. The file name should not exceed 32 characters. Examples: README_Smith_2025.txt; README_Jones-etal_2025.txt.

Text within square brackets is instructional and should be deleted from the final version of the readme.]

1. ABOUT THE DATASET
--------------------

Title:	Named Experts in the Getty Provenance Index - German Sales

Creator(s): Mathew Henrickson

Organisation(s):  University of Leeds 

Rights-holder(s):Unless otherwise stated, Copyright 2025 University of Leeds. 
The original data release was from the Getty Research Insitute as part of the Getty Provenance Index raw data release in 2021: 
Getty Research Institute®, CSV exports of the Getty Provenance Index® (June 1, 2022), https://github.com/thegetty/provenance-index-csv/releases/tag/v3

Publication Year: 2025

Description: The dataset contains records from the Getty Provenance Index - German Sales (1900-1945) art market auction archive and facilitates research of named experts in the art market in Germany. 

Cite as:  Henrickson, Mathew (2025) Dataset for 'Named Experts in the Getty Provenance Index – German Sales'. University of Leeds.https://doi.org/10.5518/1781

Related publication: [Provide a citation for any article reporting results based on analysis of the dataset. Direct links to related publications can also be added to the Related resources fields in the metadata record. If a publication is in preparation at the time of deposit, provide relevant details where known (authors, title, journal, year, etc.) and indicate status at time of deposit (Submitted; In preparation; Accepted).]

Contact: [This is optional. If you wish, enter a contact email or other details for a corresponding Creator.]


2. TERMS OF USE
---------------

[A standard copyright notice and licence statement with URL can be used, e.g. Copyright [publication year] [University of Leeds, name of other rights-holder(s)]. Unless otherwise stated, this dataset is licensed under a Creative Commons Attribution 4.0 International Licence: https://creativecommons.org/licenses/by/4.0/.]


3. PROJECT AND FUNDING INFORMATION
----------------------------------

Title: A Computational Linguistic Analysis of the Language in the Getty Provenance Index - German Sales (1900-1945)

Dates: May 2023 - June 2026

Funding organisation: NA

Grant no.: NA

[Include in this section acknowledgements of all relevant funding sources, including e.g. public and charitable funders, industrial sponsors, and the University. If the dataset was not generated as part of a specific project or with dedicated project funding, you can say e.g. 'This dataset was not created in the course of a funded project.']


4. CONTENTS
-----------
File listing

named_experts_getty_prov_index_german_sales.json

The dataset was developed using a combination of SQL key word screening and LLM data mining to extract the names and institutional titles of named art experts listed as authenticating works by artists in the unstructured text of the Getty Provenance Index - German Sales. 

Data Dictionary: Named Experts in the Getty Provenance Index – German Sales

{
  "record_id": {
    "type": "integer",
    "description": "Unique identifier for the auction record in the internal research database."
  },
  "sale_date": {
    "type": "string",
    "format": "date",
    "description": "Date of the auction sale (YYYY-MM-DD)."
  },
  "artist_name_1": {
    "type": "string",
    "description": "Name of the primary artist associated with the auctioned object."
  },
  "object_type": {
    "type": "string",
    "description": "Type of object being auctioned (e.g., painting, sculpture)."
  },
  "auction_house_1": {
    "type": "string",
    "description": "Name of the auction house responsible for the sale."
  },
  "gpi_auction_entry": {
    "type": "string",
    "description": "Full text of the auction entry as recorded in the Getty Provenance Index."
  },
  "extracted_text": {
    "type": "string",
    "description": "Extracted snippet from the auction entry that includes reference to an expert or expert opinion."
  },
  "expert_names": {
    "type": "array",
    "items": { "type": "string" },
    "description": "List of expert surnames identified in the auction entry."
  },
  "titles_in_text": {
    "type": "array",
    "items": { "type": "string" },
    "description": "List of honorifics or titles (e.g., 'Geh.-Rat') associated with experts in the text."
  },
  "heidelberg_url": {
    "type": "string",
    "format": "uri",
    "description": "URL linking to the digitized auction catalogue page on the Heidelberg University Library website."
  }
}

An Excel version of the data is also made available. 

5. METHODS
----------


The dataset was created by extracting auction records from the Getty Provenance Index – German Sales (1900–1945) using SQL queries. Records were filtered for references to expert opinions, identified through keywords such as academic titles (e.g., Dr., Prof., Hofrat) and terms like Gutachten. From these entries, a 150-character snippet was extracted from the latter half of the auction text and stored in a new extracted_text field.
These snippets were then processed using the open-source large language model Qwen 3–8B to extract clean expert names and academic titles. The resulting structured data was exported in JSON and CSV /XLSX formats. Quality control included manual review and automated checks using Python scripts. The dataset was produced as part of a PhD project at the University of Leeds.