Sharing texts better, part 1: Austrian Newspapers

Apr 19 2022

Its not very hard to get individual texts in digital form. But working with grad students in the humanities looking for large sets of texts to do analysis across, I find that larger corpora are so hodgepodge as to be almost completely unusable. For humanists and ordinary people to work with large textual collections, they need to be distributed in ways that are actually accessible, not just open access.

That means:

  • Downloading

  • Reasonable file sizes (rarely more than a gigabyte).

  • Reasonable numbers of files (dont make people download more than a dozen for some analysis tasks.

This isnt happening right now. The hurdles to working with digital texts are overwhelming to almost anyone. I dont usually write up a simple process story about what its like to get collections of texts, but I want to do do so a few times here.

What follows here isI should be cleara sort of infomercial. Over the last year or so Ive started formalizing a much better way to distribute texts than any cultural heritage currently uses.

Ill share texts using it. I want to start looking at some collections I encounter to make clear just how high are the barriers to working with text the way were distributing it now.

Part one: newspapers. Newspapers should be, in theory, a pretty easy type of text to distribute. In an ideal world, a newspaper is divided up into articles. But most of the open-access newspaper collections Ive seen instead chope papers up into pages. Thats the case for the first archive Im going to look at in this series: newspapers from the Austrian National Library hosted on Europeana.

I cant completely remember the details of why Im looking at this collection, but in short: a graduate student in my Working with data class was interested in doing text analysis for their class project on newspapers from there. We decided that the Neue Freie Presse would be an especially useful paper, and identified digitized versions both on Europeana and at ANNO, hosted by the Österreichische Nationalbibliothek. (If you visit the Wikipedia page for the NFP, it takes you to a dead Columbia link) ANNO has a nice online interface including well-formatted links like https://anno.onb.ac.at/cgi-content/annoshow?text=nfp|18970610|20 for full-text: this seems like a possible route for getting data, although the decades of data will take an extremely long time to download in R. Looking for other copies, I first check the Atlas of Digitized Newspapers from the Oceanic Exchanges project, because I know that they have decent information about accessibility. (Despite the name, they are not an atlas in any normal sense, but instead of bibliography, registry, or catalog.) It suggests that access will be to XML files through Europeana, and does not list any access through ANNO above what Ive been able to find.

But it also links to a bulk download site at Europeana. Looking at the Europeana sites during a Zoom call we discover that there are a number of full-text downloads identified by opaque numbers: 9200300 is the first one.

Heres where we hit the first snag. What are these numbers? Looking at the site for one of the NFP pages in the Europeana browser, we see that it, too, starts with 9200300. Perhaps this is just what we want? But the file is unthinkably large116 GB, zipped, for the page-level full text. This is too large for the grad student to download, but I click on it to see what will happen. It spins, and spins, long past the end of office hours. The student has to wait.

A week passes. While looking for a completely different file on my computer, I encounter a 63GB zip file in my downloads. I dimly remember downloading this earlier, and think about opening it. To just unzip a 63GB file would be crazythis is another place that most researchers will be stimied. I know that one can access a zipfile randomly, though, and fire it up in Python to read.

This is a second place that most researchers would be lost63 GB is just too big. There should never be a single file that large unless its completely necessary; in this case, thats clearly not so. The idea that you can extract single files is simply not obvious, so many people will try to extract. I dont know exactly how big that 63GB file will be, but probably large enough to clobber most hard drives.

Ive named the zipfile NFP.zip now, because Im hoping it has the Neue Freie Press. Now I can read the list of filenames.

import zipfile
import html
f = zipfile.ZipFile("NFP.zip")
fnames = f.filelist

It turns out to have 1.6 million little files bundled in there, with names like 9200300/BibliographicResource_3000116292697/3.xml. Hmm. Well, the end is clearly the page number, and perhaps the bibliographic resource is the individual issue?

I read in a single documentthe one-millionthto see.

<TextLine HEIGHT="61" WIDTH="703" VPOS="25" HPOS="166"><String WC="0.5249999762" CONTENT="rung" HEIGHT="29" WIDTH="68" VPOS="37" HPOS="166"/><SP WIDTH="19" VPOS="32" HPOS="234"/><String WC="0.5199999809" CONTENT="des" HEIGHT="29" WIDTH="46" VPOS="33" HPOS="253"/><SP WIDTH="10" VPOS="35" HPOS="299"/><String WC="0.4877777696" CONTENT="höchstens" HEIGHT="43" WIDTH="140" VPOS="30" HPOS="309"/><SP WIDTH="17" VPOS="38" HPOS="449"/><String WC="0.625" CONTENT="ui" HEIGHT="22" WIDTH="28" VPOS="45" HPOS="466"/><SP WIDTH="17" VPOS="45" HPOS="494"/><String WC="0.275000006" CONTENT="emem" HEIGHT="27" WIDTH="84" VPOS="45" HPOS="511"/><SP WIDTH="10" VPOS="42" HPOS="595"/><String WC="0.4562500119" CONTENT="fncvüchm" HEIGHT="40" WIDTH="149" VPOS="42" HPOS="605"/><SP WIDTH="9" VPOS="48" HPOS="754"/><String WC="0.3616666794" CONTENT="Zustan" HEIGHT="36" WIDTH="96" VPOS="48" HPOS="763"/><HYP CONTENT="­"/></TextLine>

Soits XML of the scans including exactly the position in pixels of each work. I consider parsing the textlines out and deconstruction the JSON, but XML parsing is a pain and always tediously, tediously slow. And I dont care about any of this stuffIm doing text mining, so I just want the words. A quick check back at the Europeana site confirms that I have the smallest file on offer.

So lets do the quick and dirty approach. The letters I want follow the word CONTENT in the XML; so Ill just write a quick-and-dirty approach that splits on that string, and grabs everything up to the second quotation mark. This is how people use XML, I tell myself; no one is enough of a sucker to use pythons XML parsing libraries, so lets just munge it out. split is so much faster.

import pyarrow as pa
from pyarrow import parquet
while True:
    pages = []
    ids = []
    for j in range(5000):
        print(i, end = "\r")
        r = f.open(fnames[i])
        words = []
        for word in r.read().decode("utf-8").split('CONTENT="')[1:]:
            words.append(word.split('"', 1)[0])
        page = html.unescape(" ".join(words))
        pages.append(page)
        ids.append(fnames[i].filename.replace(".xml", ""))
        i += 1
    out = pa.table({"ids": ids, "pages": pages})
    parquet.write_table(out, f"{i}.parquet", compression = "zstd", compression_level = 5)
    print(f"{i}/{len(fnames)}")

This is code that pulls out of XML into something better: a parquet file, written by pyarrow, for each group of 5,000 pages. I check one to be surelooks like German. There will surely be mistakesperhaps involving quotation marks in words. But with low-quality OCR, its enough to start.

Arzt der k. k. prio. THÄßbahn, anö den frischen Blätter» des Enca» lyptiis Globnlus. eines ans Anstratten stammende» BaiimcS, i» dem ««oratorwin des Apothekers ^»»>i Sdl»»»»»» Wien. JÄche», - Haupistraze Nr. 16, einzig und allein zukereiteie rmd stets «orrStbig

Rewriting with compression.

I wrote them into a folder with level 5 compression in zstd. The new directory, with parquet files and ids, is a tenth the size: 6.4GB vs 63GB for the zipfile I downloaded. Why on earth have I downloaded massive XML files when I just want text? Who really wants this positional text, anyway? Ive used it a few times over the yearsbut most people want text, not XML. Zipfiles at least are nice, because I can grab the specific files I want. But theyre also slow in their own right. I start parsing at 22:21, and leave my computer openlooking at the timestamps, I dont finish the last file until more than two hours later, at 00:31.

This is bonkers. Mediocre zip compression and uselessly XML-encoded data mean that it takes two hours just to look at the data in the most cursory way. Its important to distribute things in a complete format, but its also important not to waste resources making things too hard to parse. With the parquet formatted versions of the data, it takes not two hours but 55 seconds to parse through every file in this set. Thats a major improvement100 times faster to read, and one-tenth the size. Both of those are big enough differences that they actually affect whether this data is usable or not.

matches = []
from pyarrow import compute as pc
for p in Path("parquet_files").glob("*.parquet"):
    a = parquet.read_table(p)
    which = pc.match_substring(a['pages'], "Gustav Mahler")
    matches.append(a.filter(which))

Sonow weve got a huge set of text in a fairly navigable form. But we dont know what the records are. The identifiers are all things like 9200300/BibliographicResource_3000123565676/4; aside from the page number, its not clear what any of those mean. My working theory to this point was that 9200300 meant the Neue Freie Presse and BibliographicResource_3000123565676 means the individual issue; but I need to know for sure.

Sorting is information

At this point, I start putting the identifiers into the web site and figuring out the layout of the metadata here. It turns out that this is not just one newspaper, but lotsprobably everything contributed from the OSB to Europeana. And, stunningly, the order seems to be completely random? I call the web based Europeana API and get a dcTitle field in this order:

["Der Humorist - 1847-01-29"]
["Blätter für Musik, Theater und Kunst - 1871-09-19"]
["Wiener Zeitung - 1841-10-18"]
["Der Humorist - 1841-03-10"]
["Neue Freie Presse - 1871-10-22"]
["Innsbrucker Nachrichten - 1859-11-25"]
["Die Presse - 1867-06-25"]
["Das Vaterland - 1862-09-26"]
["Wiener Zeitung - 1705-02-28"]
["Wiener Zeitung - 1868-12-04"]

There a couple things weird here. One is the random order. I suppose that this could be my fault, because I just used the filenames from the zipfile in the order they appeared, rather than sorting. But that itself is a problemthe zipfile should have more of an inherent order. It is an underappreciated fact that good sorting is good compression; the more natural an order information appears in, the better it will compress. And of course, the fewer files people will have to download. The other is that title is wrapped in an array: apparently in the EDM things can have multiple titles. OK, thats something I can work with.

So now I have a clear plan.

  1. Get metadata for every record.

  2. Match it to the papers.

  3. Write out each newspaper in chronological order.

To get the metadata, I have to find itthere is no metadata in the data dumps. First I do it using the API. https://api.europeana.eu/record/v2/{id}.json?wskey={api_key}' But it quickly becomes clear this wont scale: Running overnight Ive only download 35,000 of 1.3 million records. So I go back to the Europeana page and download another enormous zipfilea 4 gigabyte one with records for the entire set. How this manages to be so large isnt initially clear to meperhaps, I think, theyve bundled the full text into it?

The answer turns out to be that there is massive amounts of text for each record because, chiefly, every records repeats an extremely long definition of newspaper in many different languages. That this balloons the size so much is a failure of an over-literal use of linked data. Perhaps there would be a way to reference it as an element in a single HTML file, but really, no one cares. This part of the data model will never be used outside a Europeana sitethere is some base-covering in distributing it, but its a massive inconvenience for researchers to have the following block of text (and something vaguely equivalent in Latvian, Arabic, Russian, etc.) **repeated 1.6 million times in a file thats supposed to be a metadata dump about newspaper issues:

Many newspapers, besides employing journalists on their own payrolls, also subscribe to news agencies (wire services) (such as the Associated Press, Reuters, or Agence France-Presse), which employ journalists to find, assemble, and report the news, then sell the content to the various newspapers. This is a way to avoid duplicating the expense of reporting.

Now, I understand the need for clear URIs for concepts and the benefits of linked open data. But the nature of linked open data is that any individual record can be ballooned indefinitely. Why is there a definition of newspaper at such tedious length and not, say a full expansion of the geographic definition of Graz where it appears? I am sure there is a reasonbut Im equally sure its not really a good one.

Toggle to see the metadata for a single newspaper

So now Ive got to parse these monster XML blobs 1.3 million times. And this time I cant resort to regex. Ugh. Again, this is something that most researchers will abandon quickly. Im increasingly XML referred to in the past tense online, as a data format/data movement that failed. Evangelists will surely disagree, and certainly a great deal has been lost. But for my purposes, I need something tabular that can be joined, and XML and tables play extremely poorly together.

But Ill try. The first step will be to get into JSON-LD format, which is a linked data format that actually works inside of programming languages for non-evangelist humans. It turns out to be something of a painmaybe ten minutes of vaguely recalling terms before I precisely figure out how to use Harold Solbrigs rdflib-jsonld extension to the rdflib library to squeeze the data into JSON. Solbrig, thank goodness, has provided a code example. With everything but the format to put in, the transformation is obvious.

from rdflib import Graph, plugin
from rdflib.serializer import Serializer
g = Graph().parse(data=demo, format="xml") #<-took a while to figure this line out!
print(g.serialize(format='json-ld', indent=1))

OK. So all I really need here is the nmewspaper title and the date, so lets see how to parse it out. Once again, the json-ld is massively large. After wasting 40 minutes trying to figure out if I can implement a general solution to parse out all the various @type entries using a json context into a flatter document, and coming up flat against the difficulties of inferring the many contexts, I decide to just do a quick-and-dirty route that will lose most of the json-ld data here. First, filter to only proxies:

proxies = [f for f in json.loads(d) if 'http://www.openarchives.org/ore/terms/Proxy' in f['@type']]

And then reduce to a dict where we grab the first occurrence of a value or id field if it seems to be a Dublin Core item.

Again, this is requiring a completely different set of skills than the data wrangling above. If I knew a lot about LOD, I could do much better here. But the python libraries Im finding dont make this especially easy, so Im giving up on the LOD dream of being able to put it back together in a multilingual frame.

def parse_row(d):
    proxies = [f for f in json.loads(d) if 'http://www.openarchives.org/ore/terms/Proxy' in f['@type']]
    out = {}
    for k, v in proxies[1].items():
        if "purl.org/dc" in k:
            try:
                out['dc:' + k.split("/")[-1]] = v[0]['@value']
            except KeyError:
                out['dc:' + k.split("/")[-1]] = v[0]['@id']
    return out
{'dc:identifier': 'oai:fue.onb.at:EuropeanaNewspapers_Delivery_3:ONB_00286/1875/ONB_00286_18750610.zip',
 'dc:language': 'deu',
 'dc:relation': 'http://de.wikipedia.org/wiki/Neuigkeits-Welt-Blatt',
 'dc:source': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=nwb&datum=18750610',
 'dc:subject': 'http://d-nb.info/gnd/4067510-5',
 'dc:title': 'Neuigkeits-Welt-Blatt - 1875-06-10',
 'dc:type': 'http://schema.org/PublicationIssue',
 'dc:extent': 'Pages: 4',
 'dc:isPartOf': 'http://data.europeana.eu/item/9200300/BibliographicResource_3000095610170',
 'dc:issued': '1875-06-10',
 'dc:spatial': 'http://d-nb.info/gnd/4066009-6'
 }

This whole process can parse about 40 lines a second. That sounds kind of fast, maybe. But with 1.3 million metadata items it would take nine hours to run, single threaded in Python on my laptop. That is obscene. We can reduce this by batching by issue an getting it down to about an hourthere are only 154,000 records in here. But a good metadata format should be able to load a million rows of structured data in under a second, not in nine hours. This data could probably have been released in CSV on the Web, or JSON-LD, or some other format where this process would take a minute or two.

Anyhownine hours is too long for me because its the morning. Ill split this up into multiple processes that work on batches of 25,000 at a time, and set it running in a loop.


And Im back! So now Ive got data and Ive got texts. Joining these together is pretty easyI just pull apart the IIIF ID and merge them in. Now I need to figure out how to distribute these to the student. These are bigtoo big, probably to simply slap them into an e-mail.

But luckily, I set up a static hosting service on Google a few months ago, so I can just upload them into there. Ive created files for all of these newspapers now. So weve got one for the student, but also for you.

filestart dateend dateissuespagescompressed sizelink
Figaro1857-01-0453745741875-12-259.4 MBdownload
Tages-Post1865-01-181008920821875-12-3151.0 MBdownload
Salzburger Volksblatt: die unabhängige Tageszeitung für Stadt und Land Salzburg1871-01-0331706361875-12-2410.2 MBdownload
Nasa Sloga1870-06-01322791875-11-160.9 MBdownload
Wienerische Kirchenzeitung1784-01-2417882141789-12-242.4 MBdownload
Feldkircher Zeitung1861-08-0339879601875-12-2911.8 MBdownload
Österreichische Buchhändler-Correspondenz1860-02-0141544211875-12-257.8 MBdownload
Volksblatt für Stadt und Land1871-11-0944053191875-12-3120.9 MBdownload
Teplitz-Schönauer Anzeiger1861-05-0167445361875-12-1813.9 MBdownload
Linzer Volksblatt1870-01-03525611901875-12-2922.1 MBdownload
Extract-Schreiben oder Europaeische Zeitung1700-12-011621700-12-040.0 MBdownload
Grazer Volksblatt1868-01-021369214951875-12-3049.1 MBdownload
Nordböhmisches Volksblatt1873-10-044271873-12-130.2 MBdownload
Agramer Zeitung1841-01-06694312861858-06-3021.7 MBdownload
Neuigkeits-Welt-Blatt1874-01-0671044251875-12-3129.2 MBdownload
Die Neuzeit1861-09-1340123391872-12-209.3 MBdownload
Eideseis dia ta anatolika mere1811-07-05216271811-11-190.2 MBdownload
Die Debatte1864-11-13526010731869-09-3052.5 MBdownload
Die Bombe1871-01-0815121631875-12-314.1 MBdownload
Znaimer Wochenblatt1858-01-1749865691875-12-2414.2 MBdownload
Zeitschrift für Notariat und freiwillige Gerichtsbarkeit in Österreich1868-01-0813682601875-12-293.0 MBdownload
Frauenblätter1872-01-01285171872-12-150.5 MBdownload
Populäre österreichische Gesundheits-Zeitung1830-05-2643376851840-12-315.2 MBdownload
Union1872-01-07342831874-11-152.6 MBdownload
Prager Abendblatt1867-01-02943216971875-12-2228.4 MBdownload
Kikeriki1861-11-1434425921875-12-307.9 MBdownload
Vorarlberger Landes-Zeitung1863-08-11540212191875-12-2815.9 MBdownload
Hermes ho logios1811-02-0127911141819-12-153.4 MBdownload
Philologikos telegraphos1817-01-01400841820-12-150.9 MBdownload
Oesterreichisches Journal1870-08-0628543051875-12-1512.4 MBdownload
Weltausstellung: Wiener Weltausstellungs-Zeitung1871-08-1814462331875-11-195.0 MBdownload
Der Floh1869-01-0118931931875-12-196.3 MBdownload
Wiener Abendzeitung1848-03-284381061848-10-240.6 MBdownload
Feldkircher Anzeiger1866-01-0214982391875-12-211.0 MBdownload
Allgemeine Österreichische Gerichtszeitung1851-01-03918222331875-12-3122.1 MBdownload
Leitmeritzer Zeitung1871-07-0825302851875-12-317.3 MBdownload
Feldkircher Wochenblatt1810-02-1337627431857-12-222.9 MBdownload
Politische Frauen-Zeitung1869-10-17568691871-12-311.8 MBdownload
Militär-Zeitung1849-07-031217016281875-12-0835.3 MBdownload
Ellēnikos tēlegraphos: ētoi eidēseis dia ta anatolika mere1812-01-03534311821836-12-2710.9 MBdownload
Blätter für Musik, Theater und Kunst1855-02-02484011961873-12-2716.8 MBdownload
Cur-Liste Bad Ischl1842-06-0239986461875-09-112.7 MBdownload
Innsbrucker Nachrichten1854-01-264201043301875-12-3136.4 MBdownload
Der Humorist1837-01-021885044301862-05-0355.3 MBdownload
Bregenzer Wochenblatt1793-03-15873917251863-07-289.4 MBdownload
Ephemeris1791-01-0327743111797-12-112.7 MBdownload
Wiener Sonntags-Zeitung1867-01-0143265891875-12-2620.5 MBdownload
Österreichische Zeitschrift für Verwaltung1868-01-0211302801875-12-302.6 MBdownload
Vorarlberger Zeitung1849-04-06272671850-03-220.6 MBdownload
Die Gartenlaube für Österreich1867-01-28937671869-04-192.5 MBdownload
Allgemeine land- und forstwirthschaftliche Zeitung1851-07-0537423011867-12-277.1 MBdownload
Wiener Vororte-Zeitung1875-02-1552131875-11-010.3 MBdownload
Siebenbürgisch-deutsches Wochenblatt1868-06-1031821931873-12-317.3 MBdownload
Neue Wiener Musik-Zeitung1852-01-1512893121860-12-293.8 MBdownload
Österreichische Badezeitung1872-04-14600541875-08-221.6 MBdownload
Deutsche Zeitung1872-04-0292846041874-12-2963.3 MBdownload
Internationale Ausstellungs-Zeitung1873-05-02492791873-09-303.1 MBdownload
Janus1818-10-10236521819-06-300.4 MBdownload
Wiener Moden-Zeitung1862-01-01126131863-07-150.3 MBdownload
Die Emancipation1875-04-226481875-05-250.1 MBdownload
Die Vedette1869-11-0132531871875-12-195.8 MBdownload
Salzburger Chronik1873-07-019862381875-12-303.1 MBdownload
Wiener Feuerwehr-Zeitung1871-01-01336781875-12-150.7 MBdownload
Gerichtshalle1857-03-30613210051875-12-2314.6 MBdownload
Illustrirtes Wiener Extrablatt1872-03-2463546621875-12-3129.7 MBdownload
Wiener Salonblatt1870-03-1321701381875-12-245.0 MBdownload
Sonntagsblätter1842-01-1652772271848-09-176.1 MBdownload
Wiener Theater-Zeitung1806-07-151434531101838-12-2933.5 MBdownload
Wiener Landwirtschaftliche Zeitung1868-01-03746761869-12-182.3 MBdownload
Vorarlberger Volks-Blatt1866-06-1541436441875-12-3110.0 MBdownload
Marburger Zeitung1862-04-134471041870-11-301.6 MBdownload
Vaterländische Blätter für den österreichischen Kaiserstaat1808-05-1058618161820-12-279.0 MBdownload
Freie Pädagogische Blätter1867-01-1951363161875-12-257.0 MBdownload
Jörgel Briefe1852-01-02140867571875-12-0613.0 MBdownload
Österreichische Feuerwehrzeitung1865-08-15430951872-06-021.2 MBdownload
Österreichische Buchdrucker-Zeitung1873-02-11675961875-12-301.9 MBdownload