======= Songbird Vocal Segmentation =======

==== Introduction ====


Here we share a database of complete day-long audio recording sets of single male zebra finches (Taeniopygia guttata) at different developmental stages, recorded in sound-proof isolation chambers. Recording was triggered by vocalizations (or other sounds), thus, recordings are unevenly spaced in time depending on the activity of the bird and each recording/file contains vocalizations with some silence before and after the vocalizations. 
The data has been recorded as part of scientific projects conducted in the labs of Richard Hahnloser (see Details of subset 1) and Dina Lipkind (see Details of subset 2) between 2011 and 2015. For a fraction of each day-long set, vocal segments (not further classified into vocalization types) have been later manually annotated to benchmark vocal segmentation algorithms for zebra finches across different developmental stages (Tomka et al., in prep). 

{{:2.png?600|}}

**Table 1: Dataset overview.** The age of the birds is specified in days-post-hatch (dph). The last four columns 
specify how many minutes of the day-long recordings have been annotated, the number of annotated vocalizations, the fraction of time with vocal activity in annotated recordings (“vocal density”), and the range of vocalization durations, respectively.


==== Data Description ====

The database is divided into two subsets: 
adult male zebra finch vocalizations (subset 1) 
juvenile male zebra finch vocalizations (subset 2).
Table 1 shows the meta data for each recorded day (specified by the age) of a given bird. 
In the following, we first describe general information about the structure and usage of the data, we then detail the experimental settings used for each subset, and lastly share a detailed description of the annotation conventions we have used.


 **Subset 1**

// Data Description//

Day-long recordings of 4 adult zebra finches, including an annotated gold-standard recording subset.

//Methods//

All birds were raised in the animal facility of the University of Zurich. During recording, birds were housed in single cages in custom made sound proof recording chambers equipped with a wall microphone (Audio-Technica Pro42), and a loudspeaker. The day/night cycle was 14/10 h. Vocalizations were saved using custom song-recording software (Labview, National Instruments Inc.). Sounds were recorded with a wall microphone and digitized at 32 kHz. Vocal activity in all birds was recorded for at least 3 days before the recorded day published here.
G17y2 has been recorded in an isolation chamber for another experiment prior to the experiment where this data is taken from. The bird was isolated again on 16.9.2015 in FUR10 (wooden isolation chamber) and recorded without any further manipulation before the day taken for this data set (27.10.2015).
G4p5 was isolated for the first time on 16.4.2013 in a metal isolation chamber ISO9 and recorded without any manipulation before the day taken for this data set (21.4.2013).
G19o3 was isolated for the first time on 6.4.2016  in a metal isolation chamber ISO10 and recorded without any manipulation before the day taken for this data set (14.4.2016).
G19o10 was isolated for the first time on 29.3.2016 in a wooden isolation chamber FUR10 and then moved to a metal isolation chamber ISO2 on 6.4.2016. It was recorded continuously until the day taken for this data set (24.5.2016). Before this day, we recorded some directed songs by exposing the bird to a female placed in front of a window in the isolation chamber (days: 15.4.2015 and 20.4.2015). Furthermore, we trained the bird to decrease its pitch by playing back a white noise burst (50 ms) whenever the pitch of a manually selected target syllable was above a manually set threshold (white noise playback played by an automated system). The bird was subjected to this contingent white noise feedback from the 1.5. to the 12.5.2015 before letting the syllable’s pitch recover back towards baseline. There was no further manipulation after the 12.5. Until the day taken for this data set (24.5.2015). 

**Subset 2**

//Data Description//

Day-long recordings of 4 juvenile zebra finches, including an annotated gold-standard recording subset.

//Methods//

Animal care and experimental procedures were conducted in accordance with the guidelines of the US National 
Institutes of Health and have been reviewed and approved by the Institutional Animal Care and Use Committee of 
Hunter College. Male zebra finches were bred at Hunter College, and reared in the absence of adult males between 
days 7–30 post hatch. Afterwards, birds were kept singly in sound attenuation chambers, and continuously recorded. 
From day 33–39 until day 43 birds were passively exposed to 20 playbacks per day of the source song, occurring at 
random with a probability of 0.005 per second. On day 43, each bird was trained to press a key to hear song 
playbacks, with a daily quota of 20. Once birds learned the source song, we switched to playbacks of the target song. 
Learning of the source was assessed by quantifying the percent of similarity (Sound Analysis Pro38, 67) between the 
bird’s song motifs and the source model motif in 10 randomly chosen song bouts per day. We considered the source song as
being learned when the similarity to the model was at least 70%. Since the sensitive period for song learning in 
zebra finches ends around day 90–100 post hatch, we had to select for relatively fast learners of the source song. 
Therefore, in tasks 1–3 and 5, we used only birds that learned the source before day 68 (mean switch day 62.0 ± 0.8; n = 20, 39% of the total birds trained with the source). Because the source song models in task 4 were more complex than in the other tasks, we extended the switch threshold to day 84 for this task (mean switch day 72.0 ± 2.5; n = 7; 12% of total birds trained with the source). Recording and training were done using Sound Analysis Pro38, 67, and continued until birds reached adulthood (day 99–158 post hatch). At these ages, males are sexually mature, and perform a crystalized song motif, which remains unchanged for the remainder of their lives33.
Source and target song models were synthetically composed of natural syllables. Each model included either one or 
two harmonic syllables, which we used to generate pitch mismatches between source and target syllables (GOLDWAVE v. 
5.68, www.goldwave.com). Each playback of a model included two motif renditions. To control for model-specific 
effects, we varied baseline pitch, and pitch shift size and direction across experimental birds.

The juvenile birds of this data set were exposed first to playback of one tutor song and, after successful learning 
of this first song, to playback of a second song. The second song was highly similar to the first song, only that it differed in the pitch of certain syllables. An overview of the birds’ tasks is given in Table 2.

{{:3.png?600|}}

**Table 2: Experimental task of juveniles featured in the dataset.** Syllables are enumerated with capital letters (A, 
B, C) according to their order in the song motif. Pitch changes in the second song are indicated with a +p or -p 
trailing the manipulated syllable, where p is the change in semitones. 


==== Data Processing ====

**Vocal segmentation conventions for microphone recordings of single birds**

Vocal signals tend to arise from discrete acoustic units, which is a characteristic shared across the polymorphic 
landscape of vocalizing species (1,2). Animal studies in monkeys, dogs, chicken, and songbirds have shown that animal calls can be used to communicate semantic meaningful information such as detection of predators, discovery of food, or attraction of mates (3–13). Nevertheless, the functions of animal vocalizations are generally unknown for most calls and species (1,14). To advance our understanding of vocal communication in animals, we need to study large and well-annotated data sets. Here we address the problem of how to segment audio recordings of a given species. The segmentation problem is to distinguish the times at which an animal vocalizes from the times at which it does not.
One of the simplest methods of segmenting vocalizations from continuous recordings is to consider sound amplitude and to define as vocalizations all sounds that are above a given threshold. However, this procedure will misclassify certain noises as vocalizations, which is why more refined approaches are needed that potentially make use of the statistics of the individual (15). In the extreme case, we need to inspect every single potential vocalization and decide based on expert knowledge where to cut the dividing line between vocalization and noise.
To standardize the segmentation task, we have created this set of guidelines based on two decisions boundaries for a 
vocalization:
a)   	The decision whether there is a silent period between two sounds, which we take by inspecting spectrograms (
Figure 1, left).
b)  	The decision whether a sound is vocal or non-vocal (Figure 1, right; Figure 2-3).
Birds, especially when young, tend to vary the gap between vocalizations. An example is shown in Figure 1 (yellow 
dotted box): This sequence of three vocal elements looks like a precursor of syllable C that the juvenile tries to 
imitate, but they appear with sufficiently large gaps, which is why we sometimes classify them as 3 distinct 
syllables. Thus for a) we infer a gap where we can visually detect one, irrespective of other singing attempts in the animal.
The second decision boundary (b) is harder to define universally from single-microphone recordings, ideally we would 
like to have simultaneous recordings from the trachea to measure sounds and air flow there. In practice, 
it is a human expert, who judges whether a sound is vocal or non-vocal by listening to examples and inspecting the 
corresponding spectrograms. Again, this task is relatively simple for highly stereotyped vocalizations, 
but more difficult for faint, short and variable vocalizations in juveniles (Figure 1, right; Figure 2, left, 
Figure 3). A special case consists of faint sounds (usually at around 6 kHz) that frequently occur after (or, 
less frequently, before) vocalizations (Figure 2, left). These marginally vocal sounds might be inhalation sounds (
15, 16) and we exclude them from the vocal dataset (default setting).
We have annotated those sounds only in juvenile birds, where they were often more prominent and diverse than in adults.

{{:4.png?600|}}

**Figure 1: Definition of vocal segments as continuous intervals of vocal activity.** (left) Zebra finch song examples 
at 59 day-post-hatch, aligned to notes that resemble the beginning of syllable C. At this stage, syllable C is 
surrounded by clear gaps most of the time (top 6 examples). However, in a minority of cases, no silent gap is 
visible between the preceding syllable B and the first note of syllable C (bottom 6 examples, boundary case 
indicated with magenta arrow). Gold-standard segmentation labels of syllable-C-notes (yellow) and of other 
vocalizations (orange, purple) are indicated by bars below the spectrograms. (right) Vocalizations recorded at 49 
day-post-hatch (red bars), aligned to examples that resemble syllable C. Short noisy sounds within syllable 
precursors (green arrow) have not been classified as vocal activity based on isolated visual inspection, but likely 
would be, if the context would be taken into account. The yellow dotted box marks three vocal elements that could 
potentially be interpreted as a unitary precursor of syllable C, if the developmental endpoint were to be taken into 
account. Bars as on the left.

{{:5.png?600|}}

**Figure 2: Decision-boundary between vocal and non-vocal sounds.** (left) Spectrogram examples of marginally vocal 
sounds (indicated with purple bars) observed in a zebra finch at 59 day-post-hatch (excluded in the gold standard by 
default). (right) Examples of non-vocal noises which may include prominent tones (green arrows), wide-band noise (
blue arrows), or very faint signals (magenta arrows).

{{:6.png?600|}}

**Figure 3: Detailed decision-boundary between vocal sounds and wing flaps.**  Spectrogram examples short noises. Wing 
flaps are easy to detect on spectrograms when occurring in serial repetition (i.e., when the bird is flying; magenta 
arrows). For short sounds, indicators of vocal activity can be harmonics (green arrow) or a strong skew in the 
spectral density towards certain frequencies (low frequency sounds indicated with blue arrows).

==== Discussion ====

The examples we provided illustrate our decision boundaries and the difficulties with segmentation approaches. In 
summary, we advocate the definition of vocal segments as tightly restricted intervals of continuous vocal activity. 
These segments should be defined independently from functional considerations.
How to extract functional units from vocal segments is an open question, the answer may depend on whether the vocal 
units are assessed in the domain of perception (receiver) or production (sender). Still, it is regarded as ideal to 
validate chosen segmentations based on the functional roles of the vocal signals (1,14,19). However, recent work in 
songbirds suggests that “syllables may not be perceptual units for songbirds as opposed to common assumption” (18).

===== Bibliography =====

  -  Kershenbaum A, Blumstein DT, Roch MA, Akçay Ç, Backus G, Bee MA, et al. Acoustic sequences in non-human animals: a tutorial review and prospectus. Biol Rev Camb Philos Soc. 2016 Feb;91(1):13–52.
  -  Hauser MD, Chomsky N, Fitch WT. The faculty of language: what is it, who has it, and how did it evolve? Science. 2002 Nov 22;298(5598):1569–1579.
  -  Slobodchikoff CN, Kiriazis J, Fischer C, Creef E. Semantic information distinguishing individual predators in the alarm calls of Gunnison’s prairie dogs. Anim Behav. 1991 Nov;42(5):713–719.
  -  Dittus WPJ. Toque macaque food calls: Semantic communication concerning food distribution in the environment. Anim Behav. 1984 May;32(2):470–477.
  -  Hauser M. Functional referents and acoustic similarity: field playback experiments with rhesus monkeys. Anim Behav. 1998 Jun;55(6):1647–1658.
  -  Seyfarth RM, Cheney DL, Marler P. Monkey responses to three different alarm calls: evidence of predator classification and semantic communication. Science. 1980 Nov 14;210(4471):801–803.
  -  Fischer J. Barbary macaques categorize shrill barks into two call types. Anim Behav. 1998 Apr;55(4):799–807.
  -  Gouzoules S, Gouzoules H, Marler P. Rhesus monkey (Macaca mulatta) screams: Representational signalling in the recruitment of agonistic aid. Anim Behav. 1984 Feb;32(1):182–193.
  -  Zuberbühler K, Cheney DL, Seyfarth RM. Conceptual semantics in a nonhuman primate. J Comp Psychol. 1999;113(1):33–42.
  - Marler P, Dufty A, Pickert R. Vocal communication in the domestic chicken: II. Is a sender sensitive to the presence and nature of a receiver? Anim Behav. 1986 Feb;34:194–198.
  - Marler P, Dufty A, Pickert R. Vocal communication in the domestic chicken: I. Does a sender communicate information about the quality of a food referent to a receiver? Anim Behav. 1986 Feb;34:188–193.
  - Suzuki TN. Semantic communication in birds: evidence from field research over the past two decades. Ecol Res. 2016 May;31(3):307–319.
  - Gill SA, Bierema AM-K. On the meaning of alarm calls: A review of functional reference in avian alarm calling. Ethology. 2013 Jun;119(6):449–461.
  - Sainburg T, Gentner TQ. Toward a computational neuroethology of vocal communication: from bioacoustics to neurophysiology, emerging tools and future directions. Front Behav Neurosci. 2021 Dec 20;15.
  - Tchernichovski O, Nottebohm F, Ho CE, Pesaran B, Mitra PP. A procedure for an automated measurement of song similarity. Anim Behav. 2000 Jun;59(6):1167–1176.
  - Goller, F., & Daley, M. A. (2001). Novel motor gestures for phonation during inspiration enhance the acoustic complexity of birdsong. Proceedings. Biological Sciences / the Royal Society, 268(1483), 2301–2305.
  - Riede, T., Schilling, N., & Goller, F. (2013). The acoustic effect of vocal tract adjustments in zebra finches. Journal of Comparative Physiology. A, Neuroethology, Sensory, Neural, and Behavioral Physiology, 199(1), 57–69.
  - Mizuhara T, Okanoya K. Do songbirds hear songs syllable by syllable? Behav Processes. 2020 Feb 24;104089.
  - Suzuki R, Buck JR, Tyack PL. Information entropy of humpback whale songs. J Acoust Soc Am. 2006 Mar;119(3):1849–1866.