Track 1 (MSA 3.390 – cap: 20) Machine Learning to Read Yesterday’s News. How semantic enrichments enhance the study of digitised historical newspapers
Marten During, Estelle Bunout and Lars Wieneke
Newspapers count among the most attractive sources for historical research. Following mass digitisation efforts over the past decades researchers now face the problem of overabundance of materials which can no longer be managed with keyword search and basic content filtering techniques alone even though only a fraction of the overall archival record has actually been made available. This poses challenges for the contextualisation and critical assessment of these sources which can be effectively addressed using semantic enrichments using natural language processing techniques.
In this workshop we will discuss how these challenges can be addressed with respect to the idiosyncrasies of semantically enriched historical newspapers. We will cover epistemological problems as well as opportunities for data-driven exploration in terms of source criticism and content exploration, based on the impresso interface.
Track 2 (MSA 4.330 – cap: 20) Creating Data and Workflows for Humanities Research
Lorella Viola and Sean Takats
Thanks to funding from the Luxembourg National Research Fund (FNR), we are happy to announce that for our workshop participants we will be able to cover the conference fee (onsite only) and 1 night accommodation. For further information, please contact the workshop organisers Lorella Viola (email@example.com) and Sean Takats (firstname.lastname@example.org).
The Digital History Advanced Research Projects Accelerator (DHARPA) at the Centre for Contemporary and Digital History (C²DH) proposes a workshop to explore humanities research data and workflows. During a two-hour session Lorella Viola and Sean Takats will guide participants through a structured discussion of the underlying sources and methods participants use in their own research. We will ask potential participants to complete a brief questionnaire prior to the conference in order to promote a mixed group that includes researchers from a range of career stages and disciplines. Workshop participants will delve into the challenges of data creation (e.g. the transformations from artifact to source to computationally-accessible data) and reflect critically on their research practices, considering them as potentially modular workflows. Workshop organizers will gather valuable feedback in the ongoing development of their new open-source data orchestration software kiara, and invite participants to join further discussions and beta testing, should they so desire.
Track 3 (MSH DHLab – cap: 20) Automatically extract text, layout and metadata information from XML-files of OCR-ed historical texts (full day)
In the domain of digital humanities, researchers are often interested in analyzing large amounts of historical texts. Most of these texts are digitized with the use of Optical Character Recognition (OCR) software, of which some are manually corrected or enriched. These texts are often stored by digital heritage institutions in a variety of Extensible Markup Language (XML) formats . To be able to use these texts in most types of analyses, the plain text needs to be extracted from these XML files in order to perform further research. XML files can also contain important information regarding the reading order, style, layout information, recognition confidence metrics and which OCR software was used. Furthermore, XML files can contain metadata about full issues of, for example, newspapers. These metadata files contain information such as the title of the paper, name of the publisher, date of publication, and type of text (e.g. article, advertisement or image). This information can be used by researchers to make specific selections of texts based on these characteristics out of the large amounts of data.
To successfully use the data stored within XML files, specific knowledge is needed. Participants of this workshop will get hands-on experience and guidance while learning how to extract relevant information from these various types of XML formats with the use of Jupyter Notebooks. Furthermore, they learn how to restructure this information and how to process the extracted data for future use.
In this workshop, we will work with the XML formats as commonly provided by the KB, the national library of the Netherlands . We will work with the formats ALTO, TEI and PAGE to learn how to extract plain text, metadata and additional information regarding these texts. We will also work with the Didl format to see how we can use metadata to select specific newspaper articles. The participants will get access to ready-to-go Python scripts which can be re-used after the workshop.
This workshop will address the following questions:
What are XML files and how are they structured?
What types of information can be stored in XML files?
Which Python packages can be used to process XML files?
How can we automatically select and extract relevant information?
How can we automatically restructure the information into a more readable format?
How can we automate this selection and extraction process for batches of XML files?
How can we store the extracted information into other file formats for future research?
The data for this workshop will be provided by the KB. The data will contain XML files with metadata of newspaper issues, and digitized texts of newspapers, books and periodicals in various XML formats.
The target audience for this workshop are textual scholars, digital humanists and other students or researchers interested in working with textual data stored in XML files. Although the used Jupyter Notebooks are self-explanatory, basic knowledge of Python is a plus. Experience with XML files is not necessary.
To be able to attend this workshop, participants need to have an instance of Python 3 and Jupyter Notebooks installed on their laptop. To be able to follow the instructions of the workshop, we advise participants to use Anaconda to install these requirements. Furthermore, the participants need to bring their own laptop to the workshop.
First part – theoretical background
Depending on the skill level of the participants, we will start with a short diversion into Python and Jupyter Notebooks, which will be used in the practical second part of the workshop.
Then, the workshop will concentrate on the theoretical background of XML files. We will explain the rationale behind XML files and how they are structured. We will unravel the XML tree with its root, elements, and attributes using various real-life examples. We will also talk about the importance of namespaces. Through these explorations, the differences and similarities in use and function of different styles of XML files/structures will be explored. We will compare ALTO, TEI, PAGE and Didl formats.
After the theoretical background about the structure, we will demonstrate which information is stored in the various types of XML files and how they are stored in the XML tree. This information is important in order to be able to later on correctly extract the relevant information from these files.
Finally we will delve into the different packages that can be used to explore XML files and extract information from them. We will introduce the packages XMLX, ElementTree and Beautiful Soup. For the packages introduced in this workshop the pro’s and con’s will be explained and illustrated with examples.
Second part – practical session
During this second part of the workshop, we will dive more deeply into the practical execution of working with XML files in Jupyter Notebooks. On the basis of prepared Notebooks, participants will be guided through the different steps needed for processing XML-files.
First, participants are shown how to install the various packages and how to import them into their Notebooks. Then, we will use these packages to import XML files into the Notebooks and inspect their structure. This will be done using the various XML formats as provided by the KB.
With a few assignments, participants will then learn how the obtained information can be used to extract various types of information from the XML files. These assignments will include extracting plain text, and, if applicable, reading order. Furthermore, we will extract various types of metadata. We will compare the methods and results of the different packages. To be able to use this information for further analyses, we will show various ways of restructuring the data in Jupyter Notebook and how to store this in different types of formats for future use (e.g. text files or comma seperated files).
Since research in digital humanities often relies on large amounts of data, the workshop will conclude with automating the previous steps. This can be used to automatically select and extract information from large batches of XML files,thereby saving a lot of time compared to performing this task manually.
Track 1 (MSA 2.220 – cap: 30) Greening Digital Humanities
James Baker, Jo Lindsay Walton, Lisa Otty, Christopher Ohge and Lea Beiermann
The digital is material; it also requires infrastructures many of which rely on fossil fuels. As digital humanists, every project we create, every software application we use, every piece of hardware we purchase impacts our environment. In this workshop we aim to surface the ecological impacts of our work while learning with and from our DH community about how to reduce harm to the environment and to the people most impacted by environmental injustices.
This half-day workshop, run under the auspices of the Digital Humanities Climate Coalition (DHCC) will include three activities.
First, facilitated brainstorming and rapid co-production will refine questions, challenges, and opportunities around climate change and the digital humanities that are relevant in your local contexts. This part of the workshop will draw on the workshop model developed for the November 2021 “Greening DH summit“. This model balances the knowledge that time is fleeting and there is an impetus to act, with an awareness that participants will have variable expertise regarding the climate crisis, the energy/resource costs of digital technologies, and ‘green computing’ practices.
Second, participants will be asked to respond to sections of the DHCC’s work-in-progress Greening DH Toolkit. Specifically, participants will work in small groups to evaluate prototype sections of the toolkit, and to design their own implementation strategies for these sections. In keeping with the “RE-MIX” theme of the conference, we will discuss how agencies, funding bodies, and institutions in the Benelux region can be leveraged to enable implementation, as well as the barriers they might create.
Third, participants will be asked to vocalise their next steps, the commitments they make to their future DH work, so as to create both an individual and collective impetus to act. Participants who would like to continue to collaborate on the Toolkit after the workshop will be invited to join the DHCC Toolkit Action Group. Throughout the workshop, the organisers will underscore that as humanities researchers it is our role to probe the values, the power structures, and the future imaginaries that underpin sustainable solutions. Moreover, given the immense and monopolistic power wielded by the global tech sector, and the critiques of this power that are part of the Digital Humanities, this community is well positioned to create change and demonstrate to our colleagues and collaborators how change can happen. Our use of technology and infrastructure should be informed by the ways corporate economic, cultural, and scientific power perpetuates and exacerbates the crisis. Choosing a hardware or hosting provider, for example, should mean considering direct environmental impacts, broader environmental policies and record of the provider, and more broadly still, the kinds of collective future that such a collaborative encounter presupposes. We should be able to candidly explore the complex and sometimes contradictory nature of our ecological impact: we should be able to measure and model where possible, while also creating context around our measurements, flagging uncertainties, and advocating for transforming wider conditions. This workshop then aims to create an encounter space where these issues, dynamics, and possibilities are introduced, shared, and acted upon.
Track 2 (MSA 4.330 – cap: 20) Writing multi-layered articles – the example of the Journal of Digital History
Frédéric Clavert and Elisabeth Guérard
During this workshop, we will explain and debate the concept of ‘multi-layered article’, using the Journal of Digital History as an example. We will fist give some historical insights on multilayered books and articles. We will then show how the Journal of Digital History is functioning, including a demo of the JDH writing environment (jupyter notebook, link to their Zotero account, setting up a github repository). A third part will focus on how to write and preview an article for the Journal. All along this workshop, we will alternate ‘hands-on’ interactive sessions and debate with the audience.
Track 4 (MSA 3.390 – cap: 20) How are we working with digitised newspapers?
About ten years ago, Robert B. Allen and his co-authors asked, “what to do with a million pages of digitized historical newspapers?” in the context of an already massive effort of digitization of these materials. The paper laid out tools ranging from text extraction to image clustering, to deal with the advertisements published in the historical newspapers, highlighting this rich but arduously exploitable source via the dominant text-based search tools, namely the keyword search. Many of these proposed tools have been proved efficient by several initiatives but hardly implemented in the existing collections, in their publicly available interfaces.
Based on workshop gathering experienced researchers, we propose to discuss a series of challenges posed by the availability of these rich collections: what are the workflows that are technically possible and how to integrate the epistemological issues into the discussion of the findings?