Transcript formats – Endangered Languages and Cultures of Siberia

Transcripts can take one of two formats:

XML format has been used for Tundra Nenets, Northern Khanty, and Udihe.

Below is an example of what a sentence looks like in XML:

<SÂ id=”n01“>

<AUDIOÂ start=”0.0000“Â end=”6.7429“/>

<A>Sâ²emâ²Â°yaÂ°nanÂ° mÉnâ²Â° Éb yÉÅkÂ°nâ²amÂ°h.</A>

<G>family.LOC.POSS.1SG PRON.1SG one excessive.PTC.IMPF.1SG</G>

<T>In my family I’m the eleventh.</T>

</S>

If the <AUDIO/> tag is missing, the text will still appear, but the transcript will not move through the text.

For Tundra Nenets, the transcribed line is with <A></A> tags.

For Northern Khanty, the transcribed line is with <OST></OST> tags.

For Udihe, the transcribed line is with <UDI></UDI> tags.

Any transcripts for new languages must useÂ <A></A> tags.

<h3>Toolbox TXT files, as exported from ELAN</h3>

TXT format has been used for Even, Forest Enents, and Tundra Yukaghir.

If you opt this route, then your toolbox files need \id tags to signal the start of each transcription unit (\ref will be ignored). you must also use \tx for your source language text.

to replace a transcript, you now simply select and upload a new file when editing an audio or video post. don’t forget to click Update to save the changes that you made to the post. your changes won’t be visible immediately. the solr server now runs multiple checks per day in order to reindex content for changed posts.

note that while conventions (a) and (b) above hold for new languages, for existing languages you should continue to keep the transcript format as it is. for example udihe transcripts use <UDI> rather than <A>. keep doing that. similarly, forest enets uses \ref instead of \id, and \tx_lat_for_toolbox instead of \tx. keep doing that.

it might seem a little complicated, but hopefully you’ll get the hang of it. i didn’t want the migration to cause existing functionality to be lost, and this set of solutions preserves existing functionality while also allowing you to add new languages and data.