Applying Open Science in Text Recognition Projects
Learn how to apply open science standards and choose the right export formats for automatic text recognition projects in this tutorial series finale.
File
Automatic Text Recognition (ATR) - Video 6 End formats and Reusability
Added on 01/27/2025
Speakers
add Add new speaker

Speaker 1: Hi, my name is Sarah Andraschek. This is a tutorial series on automatic text recognition. You're watching the last video dedicated to applying open science standards to your projects, as well as choosing the right N-formats. Once your transcriptions are done, you probably want to keep working with them. To do so, you will have to export your transcriptions. Beforehand, you need to choose an N-format. Your choice relies on what information you want to keep and what you want to export. There are various reasons for export. However, it is not necessary for all transcriptions to be exported. The most common reason behind an export is to create backups of data, especially because digital tools and servers are not always 100% reliable. Export can be done even with unfinished transcriptions, when your work is still in progress. In the video about text recognition basics, we mentioned the necessity to consider all software options before picking one, as it is not always possible to change mid-work. In the cases where it is indeed possible to change, you will want to move your data from the one you are currently working on to the one you want to migrate. Similarly, by exporting the transcription, you can feed the data to another tool, which can be one of the reuse options. Export can also be done to publish finished transcriptions, whether it is the whole corpus or simply a sample. Finally, if you want to transform your transcribed corpus, an export first will be necessary. Although some kinds of exports already include the transformation, it is essential to correctly choose your output. According to the software used for ATR, the output can vary. The two main types, plain text and layout, can be found in every one of them, be it in one form or multiple formats. Plain text represents the simplest form of export, as it only provides the text itself. There are two formats for plain text, simple .txt files or the .docx version. The other type of export, the layout, is one that preserves all information gathered during the automatic text recognition, so metadata, regions, lines or masks from the segmentation or text recognition, and sometimes even additional information as annotations. The layout export provides an encoded version of the transcription, but depending on the format, the markup language is not the same. There are two kinds, a layout encoded in HTML, called HOCR, and layouts encoded in XML, with specific vocabularies, called PageXML or AltoXML. Sometimes a third type of export can also be found, the PDF format, representing plain text with layers, meaning the image is available in PDF, with info from segmentation or text recognition embedded directly as layers. There are numerous export formats for various types of usage, and it can happen that you realize at a later stage that the choice you made initially is not the right one, and you discover that you don't have the possibility to find your transcription again on the software that you used. For such situations, members of the community created a helpful repository that lists all the available tools to convert one format to another. You can find the link in the video description. As we said earlier in the video, transcriptions are often made for being reused. You might want to work on it privately, without prying eyes. However, it is more likely that you are working on your transcription as part of a collaborative project, destined to be made available for later reuse. In that case, you will have to know about how to guarantee the openness of your projects, also being aware of open source tools that can be used to exploit the data. Once you have the output of your data, you will need some additional information to guarantee their openness. To do so, it is essential to follow the FAIR principles. FAIR stands for Findable, Accessible, Interoperable and Reusable. Simply providing documents with layout information will not be enough, because those would still be missing content information crucial for fairness. Firstly, there should be metadata as in the key information of your corpus, such as title, authors, sources, etc. That way the corpus is not a lost piece, and people that will consult it have an easier way of understanding it. Secondly, and this might be one of the most important steps for being able to say that your data is open, you need to provide documentation about the methods and tools that you used for your transcriptions. Above all, for the data to be interoperable, you need to mention the software used to create the transcription as each has its particularities. Moreover, the documentation should also outline the transcription rules that you decided to use, especially in cases where irregularities appeared. It is important to document all choices that you made so that people reusing your data will know what to do. Lastly, to make sure that your project is open and reusable, it should be in an open source format. Attention. Some end formats that we mentioned at the beginning of this video are indeed not open, such as DOCX. With your newly open data, many reuse options emerge. In this section of the video I present a few, but this is not an exhaustive list, and other methods are also possible. One reuse option is to archive and or share your work. Doing that ensures you to have a separate, often online, backup. It is also a way to share your data with others, as it could be useful to people in the community. Here you could use Zenodo or GitHub. And in a sharing more than archiving kind of way, you could use HDR United. Another reuse option is analysis. ATRI's large corpora can be explored by lexical analysis, text analysis or statistics. To do so, there are open source tools that you can easily download, such as TXM or R. With your data, you might also want to obtain a more enriched format than the one you obtained previously. There are various markup languages that you could use, such as the Text Encoding Initiative, so TEI, to have your corpus follow the standard for the representation of texts in digital form or in HTML transformation, to have them directly ready for a publication. Finally, you might want to publish your data in plain text format or the format you selected in your transformation. There are several open source tools to do this, such as Omeka or TEI Publisher. You see, with great transcriptions come great reusability. This was it for the tutorial series on automatic text recognition. We sure hope you learned a lot. Are you eager to learn even more? Make sure to check out the links we put in the description for each video. Bye.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript