Appendix: The XML Structure and Technical Details

Donald J. Mastronarde

Appendix: The XML Structure and Technical Details

XML and TEI

The base form of the digital edition of the scholia is an XML document. XML (eXtensible Markup Language) is an international standard for markup, allowing the creation of computer data structures that are easily reprocessed and do not depend on particular operating systems or applications. XML documents are encoded in Unicode, the international standard for encoding the world’s various language scripts and other systems of symbols. This allows for the use of polytonic Greek as well as roman characters, plus metrical and other symbols in the edition.

TEI is the acronym of the Text Encoding Initiative, a non-profit project providing a standard for sophisticated markup of complex textual documents. TEI originated with the precursor to XML, SGML (Standard Generalized Markup Language), but in recent years TEI definitions have been rewritten in XML. The version of the TEI structure that has been adopted for this edition is known as P5. TEI has been and is being used in a number of projects (for example, EPIDOC) and is looked upon with favor by the U.S. National Endowment for the Humanities in relation to its support of digital projects in the humanities.

A Structure for the Euripides Scholia

TEI allows a vast range of possibilities for markup, but each project is entitled to use whatever subset seems most appropriate. The level of detail in the markup may vary justifiably according to the purposes of the edition and the time available. In a TEI digital edition, various metadata, background information, and declarations of particular usages are included in a teiHeader element that precedes the text element of the document. Within the text element, there are elements for front, body, and back. So far, I have created content within the XML edition itself only for the body element (much of the content of this web site could be converted to parts of the front and back). The structure of this edition is based on the use of four levels of the TEI division-type element, from the largest, div1, to the smallest needed here, div4. Every division element can be given an attribute called ‘type’ (attribute names are conventionally shown as follows: @type), and this attribute is essential to differentiating various structures in the edition.

The div1 element serves to enclose all the material that relates to one tragedy. So far, therefore, there is just one div1, its @type is ‘subdivisionByPlay’ and it also has another attribute, @xml:id, ‘Orestes’. The div1 for Hecuba will have the same value for @type but @xml:id will be ‘Hecuba’. At a later point, there will also be a div1 with @type of ‘preliminaryTexts’ to contain the versions of the Life of Euripides found in the manuscripts of the tragedies and any other prefatory items related to the whole corpus (for instance, epigrams on Euripides).

The div1 element encloses one or two div2 elements. If there is any prefatory material in the manuscript tradition of a play, then the first div2 contains this (@type is ‘hypotheseis’ and @xml:id is ‘hypOrestes’). There will always be a div2 containing the scholia on the play (@type is ‘scholia’ and @xml:id is ‘schOrestes’).

Here I will first describe in detail the scholia division. Each item that I have decided to treat as a separate scholion is contained in its own division of the next level, div3. In the structure adopted here, div3 always has three required attributes and occasionally has an optional fourth attribute. The first two required attributes provide classification of the scholia. @type is used to classify the scholia as older or younger or connected to a named Palaeologan scholar, and in some cases this category has to have a mixed value (as when the same item is both old and Moschopulean). In Release 1, the possible values of @type have been expanded to seventeen in number, namely: vet, rec, mosch, thom, tri, plan, pllgn, vetMosch, vetThom, vetMoschThom, vetTri, recMosch, recThom, recMoschThom, recTri, moschThom, pllgnTri. These are described in the Preface. @subtype is used for a rough classification of the content and in Release 1 takes a value from the following ten possibilities: exeg, paraphr, metr, wdord, diagr, rhet, gram, gloss, artGloss, etaGloss. These are described in the Preface. The lists of possible values can be expanded further if that seems desirable, or if there is time to make finer distinctions among the exegetic scholia. In designing this structure, I hesitated for a while over when to use the value gloss. Many glosses provide synonyms of the lemma word, but some other one-word notations are in a sense exegetical, supplying an understood verb form or a clarifying a possessive. These short annotations, whether synonyms or not, reflect the same kind of pedagogical activity or intellectual practice, so I have adopted the wider definition, except for glosses that are potentially variant readings and a few that are related to a controversy in the discursive scholia. Using the broader sense of the term means that suppressing the display of glosses removes the distraction of almost all the short and usually elementary annotations.

The third required attribute is @xml:id, which must be unique for each div3. The unique value is built as follows: the first two letters of the Latin title of the play (He, Or, Ph, Me, Hi, Al, An, Tr, Rh); the line number of the only line to which the scholion applies or of the first line of a range of lines to which the scholion applies, expanded with leading zeroes to make a four-digit number (0003, 0046, 0589, 1532); a decimal point; and two digits representing the sequence in which I have decided to arrange the notes under a single line number, from 01 to (theoretically) 99. This system will suffice for the initial compilation, but there must also be a mechanism for adding new scholia at an appropriate point within the sequence. If a new item needs to be placed after the item with @xml:id of Or0014.06 and before Or0014.07, it will be Or0014.06a (and if more than one, then Or0014.06b and so forth).

The optional attribute of each scholion div3 is @n. This is necessary only for a scholion that applies to a range of lines, and it provides the explicit value to be displayed in the HTML version. When a scholion belongs to a single line, the line number to be displayed is generated instead by a function in the processing instructions that extracts it from the @xml:id.

The kernel of the structuring of the information, and what makes possible the optional inclusion of different kinds of information and the display of various levels of detail to different users, is the sequence of div4 elements that are the children of each scholion div3. The only one of these that is mandatory is the one with @type of ‘schText’, enclosing the text of a single scholion with its lemma (if any) and its witness list. TEI requires the use of child element p (paragraph) here, but forbids giving it a @type, so this p element does not contribute usefully to the tagging of content or the processing. Before the text of the scholion there may or may not be an element seg (segment) with @type of ‘lemma’ and @subtype either ‘inMS’ or ‘added’ to reflect whether there is an explicit lemma in any of the witnesses or not or whether the lemma has been added by the editor. Added lemmas are processed to be displayed between angle brackets, which are U+27E8 and U+27E9, not the lesser than and greater than symbols, U+003C, U+003E (it would be straightforward to reprogram the XSLT to use instead the alternative system, the use of a dicolon after a lemma that is transmitted in the manuscripts versus a right square bracket after a lemma supplied by the editor). This segment is optional because occasionally it does not seem justified to supply a lemma (as when a scholion applies to a whole line). If the text of the scholion is more than one sentence (or more than one substantial phrasal unit), then the sentences (or units) are tagged as the s element with an attribute @n to provide sentence numbers. These numbers are needed to make the references in the apparatus criticus easier. The lineation of a digital edition is not fixed, so it is impossible to key an apparatus item to a line number. Anchoring each apparatus item to a single word or phrase is possible, but the markup would be far too time-consuming and in my opinion out of proportion to any possible gain for this edition. In Release 1, I have added an @type attribute to each s element. This almost always has a value of ‘default’ for sentences to be run together as prose, but in the instances where a verse passage of more than two lines is quoted in a scholion, the @type has the value ‘verse’ or ‘verseIntro’ (for the sentence that introduces the verse quotation) or ‘verseFinal’ (for the final verse of a quotation unless it is also the last unit of the whole scholion). These different values allow a verse quotation to be processed into HMTL that will be laid out as verse and not simply run in with the surrounding prose. After the text of the scholion, a required seg with @type of ‘witnesses’ contains the sigla of the manuscripts that contain the scholion. Again, to ensure making (even slow) progress on my edition, I have treated the list of witnesses as plain text and declined to use the TEI’s option for tagging each witness. (For the information conveyed by superscripts after a siglum in the HTML display, see the discussion below concerning the div4 for lemma and position.)

There are eight other kinds of div4 that may or may not follow the text of each scholion. In order, the @type of these is drawn from the following list: engTrans, lemmaPosNote, appCrit, appCrit2, prevEditions, commentSim, collNotes, keywords. These are explained in some detail in the Preface. Here I describe their XML structure.

The div4 for the translation contains nothing but a p for the text of the translation.

The div4 for lemma and position contains a p with one to three seg elements: values for @type of these segments are ‘lemmaNote’, ‘refSymb’, ‘pos’. The lemma segment tells which of the witnesses have a lemma and provides the variants in the lemma. With ‘refSymb’ the use of symbols linking a line or word of the poetic text to a particular scholion are recorded. The position segment has two kinds of information: first, it records whether items are above the line, marginal, or intermarginal (all as opposed to being part of a recognizable block of scholia); second, it tells about variations in the ordering of scholia with respect to each other or if a scholion is continued from a previous item without apparent separation. Some editors of scholia suppress information about location, and there may be justification for that in some circumstances. This information seems to have some value, however, in that this edition is intended to be expandable and to provide details that may turn out to be useful to someone who later collates a witness never used before. One might have wanted simply to list the witnesses with superscript indications of position, as done in printed editions. But XML does not handle such modifications easily, and for practical reasons I have therefore kept the use of items needing to be displayed as superscripts to a minimum. Therefore, instead of listing after a gloss shared by Moschopulean and Thoman witnesses the sequence X^sXa^sXb^sT^sY^sGr^sZ^sZa^sZm^s, I have preferred to list the witnesses as XXaXbTYGrZZaZm and to enter the note ‘s.l.’ in the position segment. This does not mean that superscript modifications of sigla do not occur at all: they are still necessary to distinguish different hands (1, 2, 3), or different versions of the same note at different locations in the same witness (for instance, R^a for scholia in the margins of the text of R, but R^b for the scholia written in a continuous block after the end of the text of Orestes). To handle such cases, I use a seg with @type of ‘witMod’ (witness modification), and such a segment can occur within the witness list, in remarks about lemma or position, in the apparatus criticus and in other div4 elements except the translation and keywords.

The div4 for the apparatus criticus (@type ‘appCrit’) contains a p with one or more seg with @type of ‘appItem’. For scholia of more than one sentence, an untagged number is added to the first item of the apparatus located in a particular sentence. The apparatus criticus is another area in which I have decided not to use the more elaborate TEI mechanisms for apparatus criticus readings and variants, because in a project of this kind it seems to me that it would involve an unjustifiably large overhead of markup. I believe the information familiar to those who know how to read the apparatus criticus of a classical text can be adequately provided in textual segments. This means that one will not be able to take my XML document and process it to produce a text that reflects the textual choices and errors of a particular witness, which might be possible with a more elaborate markup of readings and witnesses with pointers to specific words in the text. Such a project would require more personnel and a much larger budget, and I do not think the benefit would be worth the cost, in comparison with the value of editing more scholia. The secondary apparatus, for orthographica and minor curiosities (@type ‘appCrit2’) that need not take up space in the main apparatus but may be useful to collators or others, has a similar structure, except that its segments have @type of ‘orthogr’.

The div4 for Previous Editions (@type ‘previousEditions’) contains a p with one seg with @type of ‘prevEd’, which contains the page and line reference for Schwartz and/or Dindorf (9)and occasionally Matthiae or de Faveri).

Both the div4 for the comment and similia and the div4 for the collation notes contain a single p with one or more seg elements with @type of ‘other’.

The div4 for the keywords contains a p with one or more seg elements with @type of ‘keywds’. Each such seg contains a word or phrase.

The vast majority of the scholia have markup as described so far. There is an alternative pattern of markup for the metrical scholia that describe the metrical form colon by colon. In this case, the first div4 element has @type of ‘schTextMetrAna’; this is structured as for regular scholia, but any part of the note that precedes the description of the first colon is tagged as a single s with @n of 0, so that the sentence describing the first colon will have @n of 1; also, if Triclinius describes two successive cola as the same, then that s will have a range for @n (for example, 5–6 if he says the fifth and sixth cola have the same pattern). When a div4 of this type occurs, it is always followed by another div4 with @type of ‘metrScheme’. This contains one p enclosing s elements with @n corresponding to the numbering of the sentences in the scholion itself. Each s has within it two seg elements, the first to contain the metrical scheme in symbols for long, short, etc., the second to contain the Greek text of the colon as it appears in Triclinius. The two @type values are ‘metrScheme’ and ‘triColon’ (despite the latter name, the same value can be used when an anonymous metrical scholion is marked up: the author of the scholion is conveyed by the tagging of @type at the level of the div3 parent). After this, the other div4 possibilities are identical to those available for the other scholia. By treating the metrical scholia with a different tagging, it becomes possible to process the XML into a modified display so that the metrical scheme and actual text of Triclinius are seen side by side with the scholion (rather than separately at the back of the book, as in de Faveri’s printed edition).

The argumenta or prefatory material have a very similar structure to the scholia. Recall that the relevant div2 has @type of ‘hypotheseis’. Each prefatory item is tagged as a div3, with @type classifying the different sorts. The possible values are: epitome, AristByz, misc, argThom (the long Thoman argument), Thoman (miscellaneous notes in Thoman witnesses), dramatisPersonae). There is also an attribute @n that supplies a numeration of the prefatory items. The first div4 then contains the Greek text of the item, and further div4 elements can be added for apparatus criticus and the other types discussed above.

The structure used for the Triclinian metrical treatises is analogous to that of the prefatory items. The Manuscript page is also generated from an XML document with elements corresponding to the labels of the sections of each entry.

To learn more about the XML markup, you may examine the .rng file or the .xml files themselves, which are among the items in Source Files, linked here.

XML Validation

XML editing for this project has been performed with the

Oxygen XML Editor, a java application that I run under macOS. It is a commercial product, but has an affordable academic license. In working with XML it is normal to have the document validated against some template or schema to ensure that all elements and attributes are being used in the correct fashion. TEI P5 offers an array of modules for different kinds of content and structures, and so far the scholia edition uses only a limited range of modules. One can create a validation document using the Roma tool on the TEI site. Very early in the project, I used a fairly complete schema generated by Roma. In Oxygen, one associates the validation document with the xml file being worked on, and the program continuously checks and flags errors if any are found. It soon became apparent that it would be a great advantage to have a more specific validation document. Therefore, I created from scratch a RelaxNG (XML format) schema document (and Oxygen’s built in tools and validation mechanism helped greatly with this). This contains precise information about the logical structure and specifies the allowable values for all attributes. Because of this, Oxygen is able to automatically supply or complete some parts of what is being typed as well as to flag any mistakes in typing the markup, mistakes that might not be caught by the non-specific Roma-generated schema and that would result in omissions or odd display at a later stage of the project.

XSLT

XSLT is an acronym for eXtensible Stylesheet Language: Transformations. It is an XML-based programming language that can be used to process XML into other formats (such as differently tagged XML or XHTML or HTML or PDF). XSL documents can be written and validated in Oxygen, and Oxygen also has the capacity to apply the transformation to a document in an environment for debugging. After reading much of a large book on XSLT, I built up a stylesheet gradually, partly by trial and error, and eventually arrived at the ones used in the current version of the project. The first task was to generate an HTML file containing everything in the body element of the TEI structure (and this means the text, since there is not yet any content in front or back). This is partly a matter of processing each element in the right way, and partly a matter of deciding how to tag for HTML formatting (see next, under CSS). The most confusing problem I encountered in the process was dealing with what are known as namespaces. When I used the Roma validation and declared the TEI namespace in my XML edition, it was necessary to use the namespace prefix ‘tei:’ in front of every element in the stylesheet instructions; when I switched to my more specific validation document, it was necessary to remove all those prefixes. Namespace prefixes still seem somewhat troublesome, since the transformation to HTML inserts namespace attributes into some tags, and those are in turn flagged as not allowed when the HTML is validated with Barebones BBEdit. I do not quite understand what is involved here, but it does not seem to matter. In practice I do a global removal of those namespace attributes in the HTML document with BBEdit (see below).

Processing the XML file with the XSLT file requires the use of a processing program. The free open-source program Saxon-HE 9.x is used internally to the debugging process in Oxygen, but once debugging is finished, it is much faster to download the java archive of Saxon-HE and run it from the command line in Terminal.

Once a stylesheet that generated the full data was tested out and found successful, it involved only a few minor edits of the stylesheet to cause it to generate instead some subset of the data (old scholia only, scholia without glosses, and the like). These stylesheets have undergone several revisions as the schema for the XML was modified to make room for additional attribute values and for an additional section recording previous editions for those scholia published in the past. A few further tweaks to these files were needed as the revised page design of the 2020 site was being finalized.

CSS

Almost every element in the HTML code that is generated has a ‘class’ attribute, and thus the formatting of the browser display can be handled through yet another document, in the language known as CSS (Cascading Style Sheets). Margins, indentation, font-family, font-size, superscript position, colors, backgrounds, etc. can all be modified by adjustments to the CSS stylesheet. The pages of the Edition in Release 1 have alternative stylesheets in which different items have the CSS instruction display: none;, which causes the paragraph with that style to be suppressed, that is, skipped in the display. The stylesheet to be applied is set through simple javascript programming. This functionality works with client-side javascript in the browser, and thus it is possible to set up a test site on one’s own computer without running a web server and to check the operation of all the files and their relative links before uploading to the web server.

From Collation to XML

For published scholia the basis of collation began as a digital files (.rtf) of the edition of Schwartz kindly provided to me by the TLG. These files required some massaging through a sequence of search-and-replace commands, sometimes carried out by research assistants and sometimes by myself. The TLG, as a favor to this project, subsequently added the Dindorf edition to its database to make the scholia recentiora in it part of the database, and again provided me with digital files. These also required some massaging.

For the triad plays, collations are recorded in a group of files for each play, each file covering 100 lines. For the select plays the collation files cover 400 or 500 lines each, except for Rhesus, where a single file suffices. Collation is carried out by having a window with a collation file occupy one side of the (iMac) screen and the image occupy the the rest of the screen, whether displayed from a local image file (I use Preview) or in a browser window, as is necessary when the library’s manuscript viewer does not allow downloads or allows downloads that are at too low a resolution for one to decipher some scholia accurately. The collation files are synced in the cloud, so when I travel to inspect manuscripts, I incorporate the results of autopsy inspection directly in the files on my laptop

With the development of the Library of Digital Latin Texts, it is now apparent that if I were beginning now, it would be advisable to collate in Excel files rather than Word files, since some of the conversion to XML could then be automated with Python scripts. At this point, since some collation has been done for select plays as well as the triad plays, it is too late to change over.

The portions of Orestes present in the sample released in 2010 (1–25, 401–425, plus a few others) differ in that the many collations of additional witnesses since then have been entered directly into the XML file created back then. The same will now be true of the the entire span Orestes 1–500 when additional witnesses are collated. The question thus arises whether it would have been prudent to collate directly into an XML file in the first place. Perhaps, but when I began, Oxygen XML Editor was somewhat sluggish in dealing with large files (it has since improved greatly in this regard). Secondly, editing within the XML is clumsier and slower than in a Word file. Thirdly, I have found there are actually benefits in the process of moving the information from a Word file into the XML file: it is much easier to get an overview of the notes on a particular line in Word and to reconsider the order in which they should be presented in final form and to spot duplications or near duplications that can be consolidated. Also, during final revision and proofreading, when one discovers something confusing or unclear in the XML version, it is helpful to look back at the Word files to figure out how to clarify the matter. (The fallback, if such checking does not help, is to recheck the images of all the witnesses.)

In Oxygen XML Editor, I have created a number of code templates thst can be entered from a contextual menu or (for those most commonly invoked) a keyboard shortcut. For instance, one template for a discursive scholion contains the skeleton tagging for all the elements, while another for glosses contains the tagging only for the lemma word, the gloss, and witnesses and the position element already filled in with s.l. In the former case, elements that are not needed are deleted; in the latter, elements that are needed (such as for an apparatus criticus when there are variants) are added with a keyboard command. The lemma and content of the scholion and witnesses are moved from the Word document by drag and drop into the appropriate places. (Any accidental error in placement receives the immediate feedback of the validation mark changing from green to red). Apparatus items can similarly be dragged over singly, but for the longer scholia with a lengthy list of entries, each in its own paragraph in Word, my usual practice is to copy the entire sequence of apparatus paragraphs from Word into a new BBEdit window and apply a saved search-and-replace pattern to interpose the correct closing and opening tags (appItem or orthogr) at each line break, add the opening tag, and then drag the entire block of lines from BBEdit into the XML. I had one research assistant who was provided with a copy of Oxygen XML Editor and who performed the preliminary conversion for more than 100 lines, but the rest I have done myself, in the process reconfirming the classification of @type and @subtype, adding translations, comments, and keywords where appropriate, and bringing the style into greater consistency. After the transfer of all the notes on a particular line, a count was made of how many were present in the Word version so that this could be compared to the number indicated by the two digits after the decimal point in the @xml:id. This guards against accidental omissions or duplications and against mistyping the numbers, for, as it turns out, one disadvantage of the specific schema against which the XML is validated is that with this schema Oxygen does not flag an error when two @xml:id attributes are the same.

Once the information is in the XML file, the powerful search capabilities of BBEdit and Oxygen XML Editor are important during revision, copy editing, and proofreading. Perhaps the greatest weakness of the incredibly bloated MS Word is that it nevertheless lacks multifile searching and searching using GREP, both of which are possible in BBEdit and Oxygen XML Editor.

From XML to HTML

After conversion from the Word documents, the XML file contains the scholia of all kinds collated so far as well as the arguments. It contains about 680,000 words in over 116,000 lines, and is about 5.5MB in size. To produce the nine current HTML versions of the scholia (one with the whole set and eight with various subsets) as well as the HTML of the Triclinian metrical treatises, I have developed a short shell script to be run in Terminal on my iMac (processingScript_batch.sh). I have this script, the large XML file (with a name of the form OrestesScholia20200406.xml, for the version resaved under that name on April 6), all the XSLT files, and the XSLT processor saxonhe9.jar (download link for the free home edition to be found at saxonica.com) all located in a local folder that mirrors this site (2020schHtml). In Terminal I change directory (cd) to this folder. The single argument of the shell script is the name of the large XML file. The ten commands in the script each invoke the appropriate XSLT file and direct the resulting file (with appropriate name) to a folder called Output. The processing (on an iMac, Retina 4K, 21.5 inch, 2019, with 3.6 GHz Quad-Core Intel Core i3 processor) takes about fifteen seconds, producing ten HTML 5 files. The files initially range in size from 13MB for the complete set to 572MB for the Triclinian set (and only 49KB for the Triclinian treatises). These files contain over 170,000 xml:nsp declarations within the HTML that BBEdit’s validator says should not be there. These are removed by a series of saved search patterns in BBEdit using the multifile search dialog. This takes about 20 seconds. The scholia files are now reduced to from 10.8MB to 350KB in size. Then all the files are opened in BBEdit and the word ‘selected’ is pasted into the proper option of the select element for ‘Set to display:’, and the BBEdit validation is checked. In the Triclinian treatises an additional paragraph division is inserted in the translation of the third text. All this takes about 3 minutes. The scholia files in the folder Edition, a subfolder of 2020schHtml, are then placed in a ZIP archive with the date in the title, and then the latest html files are transferred from the Output folder to Edition. They are checked briefly in one or more browsers on the local machine before being uploaded to the Edition folder on the web server.

This rapid process will allow convenient creation of revised versions of the online Edition as typographical or other errors are reported. After the official launch of Release 1, new uploads will be listed and explained in the online Revision History.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Euripides Scholia: Scholia on Orestes 1–500 Copyright © 2020 by Donald J. Mastronarde is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.