Let Them Speak

Simple Search

You can search for a simple word or a sequence of words in the testimony transcripts this edition contains. By inserting you or you went into the search box, you will be given a filterable concordance of their occurrences; following a click, you can read search results in context, i.e. in the interview transcripts (listening in context is not available; interviews are played from the beginning). By means of the box on the left side, search results can be also filtered in terms of the following metadata on interviews and interviewees:

- Collection
- Gender
- Ghetto
- Camp
- Interviewee Name
- Recording Year

In the Methodology section, you can read more about how metadata provided by each institution had been processed and made available here. In the Methodology section, you can also read about how transcripts were processed and incorporated into this edition.

Corpus Search

In addition to simple search, you can also search the collection of Holocaust interviews as a linguistic corpus. For this, you need to use the Corpus Query Language (henceforth, CQL). This is a pattern matching language used to extract information from large body of linguistically annotated texts. Below you can find a short explanation of what linguistic annotation and linguistic corpus mean.

By clicking here, you can download my comprehensive tutorial about the use of CQL.

Below you can also find short use case examples that highlight how CQL can offer insights into Holocaust experiences. The use case examples have the following structure. First, a Holocaust related research problem that traditional word search cannot resolve is outlined. Second, how CQL can help to tackle the research problem is explained. Third, examples of CQL queries are given.

Use case examples are organized according to three core layers of textuality: individual words, sequences of words, sentences. You can use CQL to search these three layers; moreover, you can combine these three layers into complex queries and include paratextual events such as for instance crying and pauses by interviewees. Finally, you can filter search results produced by CQL through the box appearing on the left side following a search.

The list of paratextual event codes provided by USC can be downloaded from here.

The list of paratextual event codes provided by the Fortunoff Archive can be downloaded from here.

What is a linguistic

In a linguistic corpus, the grammatical category of each word (named part-of-speech category and abbreviated as pos here), as well as its dictionary form (named lemma) are identified and stored in a specific database. The computer-assisted process of identifying the lemma and the pos here of every word in a collection of texts is called linguistic annotation. (How the linguistic corpus was constructed from interview transcripts is explained in the Methodology section.) The grammatical categories and their abbreviated forms used to annotate the interview corpus can be found here.

The database that stores the results of linguisitc annotation is a corpus engine. This facilitates not only the storage but also the efficient search through millions of words. The corpus engine that empowers this project is the BlackLab Engine developed by the Dutch Language Institute.

Word Level Search

1. Find all possible forms of a verb with the lemma attribute: flee, flees, fleeing, and fled

Research Problem:

Readers searching for moments in interviews when a victim is recalling the experience of fleeing face a difficulty: a simple search of flee would not find suffixed forms such as fleeing and flees.

Solution:

The corpus engine stores the lemma of every word in the 2700 transcripts; in more technical terms, each word in the corpus has a lemma attribute. As a result, readers can use the lemma attribute as a search criterium to find all possible suffixed forms of a noun or a verb such as flee. In CQL, attributes used as search criteria have to be placed between a pair of square brackets, which will then match individual words.

[lemma="flee"]

This pattern matches all words the dictionary form or lemma of which is flee; the engine will return a concordance of sentences where different forms of flee occur.

On the other hand, CQL can also find the occurrences of one given word form with the help of the word attribute. For instance, readers might want to find all occurrences of fleeing. In this case they can formulate their CQL query in the following way:

[word="fleeing"]

Note that matching by the word attribute is the same as a simple word search.

2. Disambiguate with part-of-speech information: fly (meaning insect) versus fly (meaning travel through air)

Research Problem:

Readers want to find textual contexts where victims talk about the experience of being bothered by flies. By entering fly or flies to the search box, they are also given textual contexts where fly means traveling through air.

Solution:

CQL enables the combination of lemma and grammatical category, defined through the pos attribute, which can be used for disambiguation.

The following query finds all occurrences of fly, including its suffixed forms, where fly is used as a noun.

[lemma="fly" & pos="N.*"]

The following query finds all occurrences of fly, including its suffixed forms, where fly is used as a verb.

[lemma="fly" & pos="V.*"]

This example highlights two very important features of CQL. First, attributes can be connected with the & operator; this expresses the logical relationship that natural languages express with and. In other words, the pattern above matches a given word if its lemma is fly and if it is used as a verb.

Second, when defining the content of an attribute, CQL enables character level pattern matching, also known as regular expression. In the example above, the pos attribute, standing for grammatical category, is defined by the sequence of V, dot, and an asterisk: V.* In this list, you will find the abbreviations of all grammatical categories used to annotate interview transcripts. But you will not find V; instead you will for instance find VB (base form of a verb) or VBN (past participle of a verb). V.* will still match all possible verbal formats thanks to character level pattern matching. The use of dot with asterisk indicates that after V there can be any number of additional characters. In more technical terms, dot stands for a wildcard character; the asterisk, known as a quantifier, tells that V can be followed by 0 or more wildcard characters. Hence, V.* covers both VB or VBN. In CQL, just as in regular expression, not only the asterisk but also other quantifiers are available (see my tutorial).

3. Search for synonyms with multiple lemmas: mother, mummy, etc.

Research Problem:

Readers want to find textual contexts where victims speak about the experience of mothers, which can be expressed through a number of synonyms (mum, mummy, mom, etc.). A simple word search does not include synonyms.

Solution:

With CQL readers can search for multiple lemmas at the same time; they can thus define an entire synonym set within one search.

[lemma="mother" | lemma="mum" | lemma = "mummy" | lemma = "mom" | lemma = "mommy"]

In this pattern, the elements of the mother synonym set are connected with |, which is the or operator in CQL. The pattern finds terms the dictionary form of which is mother or mum or mom.

4. Find terms with spelling variants: capo and kapo.

Research Problem:

The same term can be present in the data with different spellings. For instance, you can find both capo and kapo in the transcripts.

Solution:

[lemma = "(c|k)apo"]

This pattern will match a word if the first character of its lemma is either k or c and if this first character is followed by the sequence apo. The first character is isolated from the remaining characters by means of parenthesis, which is a grouping operator; the either or relation is expressed by |.

5. Find terms with both British and American spelling: labour and labor

Research Problem:

Transcripts follow sometimes the British sometimes the American spelling system. For instance, both labour and labor are present in the transcripts. It is therefore recommended to run searches in terms of both spelling systems.

Solution:

[lemma="labo(u?)r"]

This pattern will match both labour and labor. With the help of the ? operator, the presence of u is becoming optional. In other words, u, isolated as a group by means of parenthesis, can be absent or it can be present. The query above can be also expressed with 0,1 surrounded by curly bracket, which explicitly quantifies the minimum and maximum number of times a character can be present.

[lemma="labo(u{0,1})r"]

6. Differentiate homonymic terms with the help of case sensitivity: Joint (The American Joint Distribution Committee) versus joint (body part)

Research Problem:

Our default search is agnostic to case-sensitivity. By searching for joint or Joint, readers will be given occurrences where joint either refers to the colloquial name of The American Joint Distribution Committee or to a body part. One thus needs to differentiate the two meanings of joint.

Solution:

Since Joint as the colloquial name of The American Joint Distribution Committee always begins with a capital letter, case-sensitivity can be used to enforce CQL to find only those instances where the first letter is capitalized.

[word="(?-i)Joint"]

Case sensitivity is enforced by means of (?-i). At the same time, the pattern above still matches Joint as a body part if it is at the beginning of a sentence.

Sequence Matching

1. Search for possible word sequences: mothers were crying, mother cried, mother started to cry

Research Problem:

The retrieval of moments when an interviewee is speaking about mothers crying is difficult. This can be expressed in a variety of ways and between mother and cry there can be multiple terms.

Solution:

[lemma="mother"] []{0,3} [lemma="cry"]

This pattern matches sequences where a term, the dictionary form of which is mother, is followed by another term, the dictionary form of which is cry, within a window of maximum three words. {0,3} signs that between cry and mother there can be zero or maximum 3 terms; [] signs that the term in-between can be any word.

2. Match sequences with similar meaning through grouping operation: I will never forget and I will alway remember

Research Problem:

A key moment in an interview when a victim tells the phrase, I will never forget. But this can be also expressed as I will always remember, I couldn’t forget.

Solution:

First, one needs to write two sequences in which either I, never,n’t, which expresses negation, and forget or I, always, and remember occur.

[word="I"] []{0,5}[word="never" | word = "n't" ] [lemma="forget"]

[word="I"] []{0,5}[word="always"] [lemma="remember"]

Second, the two sequences need to be connected as groups with the or (|) operator; grouping is done with the help of parenthesis.

([word="I"] []{0,5}[word="never" | word = "n't" ] [lemma="forget"]) | ([word="I"] []{0,5}[word="always"] [lemma="remember"])

3. Find repetitive sequences: why, why, why

Research Problem:

The repetition of the same term can signal moments when traumatic memories are recalled. Finding repetitive uses of words is not possible with traditional word search.

Solution:

We need to create a sequence and declare that elements of the sequence are the same. In the example below, we create a sequence of three terms divided by commas, and in the last step we declare that elements of the sequence are the same words.When using this pattern, you need to insert your query between < and > sign.

<A:[] [","] B:[] [word=","] B:[] :: A.word = B.word>

Matching Sentences

1. Find rhetorical questions: why should I fear death?

Research Problem:

An interesting moment in testimonies when survivors ask - rhetorical - questions from themselves; sometimes these rhetorical questions are addressing the reason why something happened in the past.

Solution:

CQL can match complete sentences that contain certain patterns. In the example below, we are looking for a sententence that begins with why followed by I (with minimum one word in-between) and ends with a question mark. When searching for sentences you need to put your query < between and > signs.

<<s/>containing ([word="why"] []{1,3} [word="I"] []{0,10} [word="?"] )>

2. Find analogies: like animals

Research Problem:

Survivors often use analogies to describe their experiences. In English like is a preposition that can be used to express analogies. However like can be both a preposition meant to draw comparisons and a verb expressing wish or affection.

Solution:

CQL language can be used to distinguish like as a verb from like as a preposition. Furthermore, with CQL we can also form a sequence in which like as a prepositions is followed by a noun.

[lemma="like" & pos="IN"] [pos="N.*"]

3. Find possibilities: if I march and I fall, they will shoot me

Research Problem:

Survivors often recall troubling possibilities they faced during persecutions; retrieving these possibilities is almost impossible since they are often not expressed directly as possibilities.

Solution:

Conditional sentences or if sentences often convey possibilities experienced in the past; through the sequence if and i followed by a verb we can find examples for possibilities from the past.

[word="if"] [word="i"] [pos="V.*"]

3. Find moments of survivor guilt: I should have died, they should have lived

Research Problem:

Survivors' guilt is a leitmotif in testimonies, though survivors do not always express it explicitly.

Solution:

We can form a sequence consisting of I followed by should, have, and the past participle of a verb.

[word="i"] [word="should"] [word="have"] [pos="VBN"]

Finding Paratextual Events

1. Find moments of silence

Research Problem:

Testimonies are often interrupted by moments when survivors were unable to carry on with recalling their memories. (See above links to resources containing the paratextual event codes.)

Solution:

Since these moments are signed by the term PAUSES in uppercase, you can search for them with the following query.

[word="(?-i)PAUSES"]

2. Find moments of crying

Research Problem:

Testimonies are often interrupted by crying. (See above links to resources containing the paratextual event codes.)

Solution:

Since these moments are signed by the term CRYING in uppercase, you can search for them with the following query.

[word="(?-i)CRYING"]