A review of big data analysis methods for baleen whale passive acoustic monitoring

Correspondence Katie Kowarski, 202–32 Troop Avenue, Dartmouth, Nova Scotia B3B 1Z1, Canada. Email: katie.kowarski@gmail.com Abstract Many organizations collect large passive acoustic monitoring (PAM) data sets that need to be efficiently and reliably analyzed. To determine appropriate methods for effective analysis of big PAM data sets, we undertook a literature review of baleen whale PAM analysis methods. Methodologies from 166 studies (published between 2000–2019) were summarized, and a detailed review was performed on the 94 studies that recorded more than 1,000 hr of acoustic data (“big data”). Analysis techniques for extracting baleen whale information from PAM data sets varied depending on the research observed. A spectrum of methodologies was used and ranged from manual analysis of all acoustic data by human experts to completely automated techniques with no manual validation. Based on this assessment, recommendations are provided to encourage robust research methods that are comparable across studies and sectors, achievable across research groups, and consistent with previous work. These include using automated techniques when possible to increase efficiency and repeatability, supplementing automation with manual review to calculate automated detector performance, and increasing consistency in terminology and presentation of results. This work can be used to facilitate discussion for minimum standards and best practices to be implemented in the field of marine mammal PAM. Received: 17 April 2020 Accepted: 29 September 2020

rized, and a detailed review was performed on the 94 studies that recorded more than 1,000 hr of acoustic data ("big data"). Analysis techniques for extracting baleen whale information from PAM data sets varied depending on the research observed. A spectrum of methodologies was used and ranged from manual analysis of all acoustic data by human experts to completely automated techniques with no manual validation. Based on this assessment, recommendations are provided to encourage robust research methods that are comparable across studies and sectors, achievable across research groups, and consistent with previous work.
These include using automated techniques when possible to increase efficiency and repeatability, supplementing automation with manual review to calculate automated detector performance, and increasing consistency in terminology and presentation of results. This work can be used to facilitate discussion for minimum standards and best practices to be implemented in the field of marine mammal PAM. false positives, and false negatives (Knight et al., 2017;Sokolova & Lapalme, 2009). The evaluation metrics utilized depend on the nature of the data and the question at hand, with multiple measures often used (Sokolova & Lapalme, 2009). Emerging and experienced researchers must determine the appropriate methods to employ for effective analysis of big PAM data sets.
The present study undertakes a literature review on the reported methods of analysis in baleen whale PAM studies. Other marine animals, such as odontocetes, seals, and fish, were excluded from the review to keep the literature reviewed to a manageable size. First, we describe how the amount of PAM data varies by platform (e.g., surface or near-surface buoys, over-the-side hydrophones, bottom-mounted recorders). Then, we explore how big PAM data analysis methods vary (in terms of when manual, automated, and combined techniques are employed) depending on how many species are targeted and the research question of interest. We consider the metrics used to describe automated signal detector performance and the terminology applied to methods. This work aims to summarize past efforts for baleen whale PAM analyses and present recommendations for ways forward to standardize how results of PAM studies are reported to encourage application of robust, comparable, and reliable methodologies.

| METHODS
We used Web of Science, EndNote, and Covidence to perform a systematic literature search for peer-reviewed publications on baleen whale PAM. Detailed methods on the literature search and subsequent screening are provided in Supplementary A. The initial search resulted in 1,557 studies. Through title, abstract, and full text screening, the review was ultimately reduced to 166 studies on the topic of baleen whale PAM that were published in the years 2000-2019, presented original research (were not review or overview articles), did not implement real-time or near real-time analysis or aim to localize animals, and were not entirely based on describing automated detectors (Table S1). Real-time and localization studies were excluded to keep the review to a manageable size, as these topics have specific methodological requirements and the literature on these topics is large.
Data from the 166 studies (Table S2) were collected in two stages. In stage one, the overall data analysis method, data volume (in hours recorded), and the data collection platform (e.g., bottom mounted, over-the-side, tag) were recorded for each article (Table S3). The overall data analysis method was categorized as full manual (no automated detector, manually reviewed all acoustic data collected), partial manual (no automated detector, manually reviewed a subset of the acoustic data collected), automated (employed one or more automatic detectors, did not manually review any data), or manual and automated (employed some combination of automated and manual analysis). In the second stage of the literature review, questions specific to big data were investigated, where we defined big data as studies with 1,000 hr or more of acoustic data. Big data articles were subjected to a methodology-based review that considered how much data were manually analyzed, how many baleen whale species were targeted, how data for manual analysis were selected, and what metrics were used to report automated detector performance (Table S3). Each study's main research question was categorized as one of the following: occurrence (the spatial, temporal, and/or diel presence of vocalizations); characterization (the description of vocalization characteristics that can vary from frequency-time characteristics of discrete non-song vocalizations to characterizing changing patterns over time and space of songs); a combination of occurrence and characterization; or, other (e.g., anthropogenic impacts).
Data extracted from the literature reviewed (in stage one and two) were summarized by data volume (seven categories ranged from 0-10 to 100,000-860,000 hr), number of species targeted (studies targeting 1-2 species or studies targeting all baleen whale species possibly acoustically present in the PAM data set), and the overarching research question (as defined above) to determine commonalities and differences in data analysis methods across studies. Similarities and differences among studies were examined to determine if there are general best practices commonly used and to develop recommendations for future studies. Some variables, such as data volume and percent of data manually analyzed, were not always explicitly provided and were calculated or estimated based on available information where possible.

| RESULTS
The systematic literature review resulted in 166 articles (from 110 different first authors) on the topic of PAM and baleen whales. Detailed tabular data by study can be found in Supplementary Results. With increasing data volume included in a study, we observed a shift in both the data collection platform and the data analysis method. Studies with relatively small amounts of acoustic data were associated with varied data collection platforms, including tags directly attached to animals (Videsen et al., 2017), drifting buoys (Širovi c et al., 2006), and even snorkelers (Zoidis et al., 2008) (Figure 1). In contrast, studies using big data predominantly collected data from autonomous bottom, or near-bottom, mounted systems (e.g., Burnham & Duffus, 2019;Soldevilla et al., 2014;Wright et al., 2019; Figure 1).
With increasing data volume, the predominant data analysis method switched from full manual analysis to a combination of manual and automated analysis ( Figure 2).

| Big PAM data
The following results are restricted to articles with 1,000 hr or more of data (94 articles) and are considered big PAM data. Studies of big PAM data sets for baleen whales covered a range of species and research questions. The majority (71 out of 94 papers) targeted blue (Balaenoptera musculus), humpback, fin, or right whales with proportionally fewer F I G U R E 1 Percent of baleen whale PAM papers (N = 166) utilizing each data collection platform for different data volumes (hours of acoustic data collected). Some papers had insufficient information to confidently determine data volume (unclear).
F I G U R E 2 Percent of baleen whale PAM papers (N = 166) utilizing each data analysis method for different data volumes (hours of acoustic data collected). Some papers had insufficient information to confidently determine data volume (unclear). studies (24%) on bowhead (Balaena mysticetus), Bryde's (Balaenoptera brydei), gray (Eschrichtius robustus), or minke whales (Balaenoptera acutorostrata, Balaenoptera bonaerensis; Figure 3). Most research questions focused on vocalization characterization (16 articles), occurrence (46 articles), or a combination of the two (16 articles) and the frequency of questions varied across target species (Figures 3 and 4). Studies on humpback whales were more focused on vocalization characterization than studies on blue, right, and fin whales, which more commonly described occurrence ( Figure 3). Six studies investigated less common research questions and were considered "other" (Figure 3).
These studies investigated the impacts of anthropogenic activities on baleen whale acoustics (Castellote et al., 2012a;Cerchio et al., 2014;Melcon et al., 2012;Risch et al., 2012), the acoustic response of baleen whales to predation (Burnham & Duffus, 2019), and baleen whale density (Marques et al., 2011). Studies with "other" research questions were too varied to observe patterns and draw meaningful conclusions or recommendations; therefore, they will not be described further, limiting the total number of papers described below to 88 of the 94 big PAM data studies.

| Studies targeting 1-2 baleen whale species
Of the 88 articles that investigated characterization and/or occurrence, 75 targeted one species, 4 targeted two species (fin and blue whales), and 9 targeted all baleen whale species acoustically present in the acoustic recordings F I G U R E 3 Distribution big PAM data (1,000+ hr) baleen whale papers (N = 94) by each research question and species targeted where "all present" indicates that all acoustically active species in the acoustic data were targeted.
F I G U R E 4 Number of big PAM data baleen whale papers that used each data analysis technique for research For Peer Review questions involving characterization, occurrence, or both (N = 88). Papers investigating both occurrence and characterization were duplicated to show how each question was addressed separately within the same study.
( Figure 3). In this section we consider studies targeting 1 or 2 species (N = 79) and how data analysis methods varied across different research questions (Tables S5-S10). Studies were separated based on number of species because those looking at all species will face challenges (e.g., many different vocalization types requiring analysis of a potentially broader frequency range, more automated detectors, greater breadth of human expertise) beyond those faced by studies focused on 1 or 2 species.

| Characterization studies
Studies focused solely on characterization of vocalizations encompassed 19% (15/79) of the big PAM data papers reviewed that targeted one or two species (Figure 4). Ten of the papers focused on aspects of baleen whale song, including how songs vary in space, time, between populations, and in different behavioral contexts, while the remaining papers described characteristics of non-song vocalizations. Analysis methods included full manual analysis, partial manual analysis, and a combination of automated and manual analysis with specific analysis protocols varying across studies and the percent of data manually reviewed ranging from 3% to 100% (Table S5). The most common methodology for characterizing song was the full manual analysis of all acoustic data (Table S5). Most of these studies had data sets less than 2,000 hr in size (Cholewiak et al., 2018;Johnson et al., 2015;Mercado, 2018;Miksis-Olds et al., 2008;Miksis-Olds et al., 2018) though one study spanned 10 years and still seemed to have employed full manual review (Garland et al., 2015). Partial manual analysis was the most commonly employed methodology for the characterization of non-song vocalizations with protocols for selecting data varying across studies. For example,  (Table S5).

| Occurrence studies
Understanding the occurrence of a vocal species was the most common research question for big PAM data set analysis, encompassing 59% (47/79) of the papers reviewed that targeted one or two species (Tables S6 to S8). The temporal breadth of occurrence studies ranged from 64 days to 10 years, with the majority (35/47) ranging from 1-4 years. All four analysis methods were used to determine occurrence: full manual (seven papers), partial manual (four papers), fully automated (four papers), and manual and automated (32 papers; Figure 4).
Manual analysis methods to determine species occurrence varied greatly across studies both for full manual and partial manual techniques (Table S6). For partial manual analysis studies, both the amount reviewed (10%-44% of data) and manner in which reviewed data were selected (randomly versus systematically) varied (Table S6). Similarly, full manual studies differed in how data were reviewed, with some authors interested in finding presence over some timeframe (e.g., manually reviewed every day until a vocalization was observed to determine daily presence), while others determined the occurrence of every individual vocalization (Table S6).
All studies that solely used automated techniques to determine species occurrence targeted fin and/or blue whales (Gavrilov et al., 2018;Leroy et al., 2016;Leroy et al., 2018a;Matsuo et al., 2013). For three out of four of the studies, automated detector performance was either not calculated or not reported. Gavrilov et al. (2018) reported both missed and false detection rates that had been previously calculated in earlier publications.
A combination of manual and automated analysis was the most common methodology applied to determine species occurrence (Figure 4). Factors that were inconsistent across studies included: how much data were manually analyzed, how files for manual analysis were selected, if automated detector performance was determined, and how automated detector performance was determined. These studies were broadly categorized into those that manually reviewed every automated detection and those that did not (Table S7). Only two studies could not be categorized, as they applied different analysis techniques to different portions of their data sets (Buchan et al., 2015;Kerosky et al., 2012).
Most manual and automated occurrence studies (21/32) completed some form of manual review of every automated detection or automated detection event (e.g., days with detections). Nine studies applied automated detectors and then manually checked all automated detections and removed false positives, but never determined how often their detector missed vocalizations (Mellinger, Nieukirk, et al., 2007;Morano, Rice, et al., 2012;Risch et al., 2014;Salisbury et al., 2016;Širovi c et al., 2009;Stafford et al., 2004Stafford et al., , 2011Stafford et al., 2005;Whitt et al., 2013). In addition to manually checking all detections and removing false positives, seven studies analyzed a subset of data without automated detections to determine the missed detection rate of their automated detector Buchan et al., 2018;Davis et al., 2017;Hodge et al., 2015;Lammers et al., 2011;Mussoline et al., 2012;. The amount of data reviewed and data selection protocol to determine missed detection rate varied greatly across these studies, with <1%-33% of data reviewed that were selected randomly, based on ambient sound levels, or selected systematically (e.g., every third day; Table S7).
Some studies that reviewed every automated detection undertook manual analysis independent of the automated detection results and did not use the analysis to determine detector performance. In addition to checking every detection, Bort et al. (2015) carried out a systematic partial manual analysis (reviewed every third day). Munger et al. (2008) carried out full manual analysis via long-term spectral averages where successive fast Fourier transforms are averaged into long-term spectrograms for review.
Ten occurrence studies that utilized both automated and manual techniques did so without checking every automated detection (see references in Table S7). Instead, they applied automated detectors and subsequently completed variable amounts of manual data review (<1%-7% or unclear to reader) to determine automated detector performance, optimize automated detectors, or create ROC curves (Table S7). The subset of acoustic data for manual review were randomly selected, spread over time, spread over locations, spread over number of detections, or not specified, depending on the study (Table S7).
The presentation of baleen whale occurrence results in big PAM data varied across studies, with results generally presented as one of two measures: number of vocalizations/detections or presence (Table S8). Half of studies (23/47) presented results as number of vocalizations or number of automated detections over some duration (e.g., number of automated detections per week; Table S8). In contrast, 18/47 studies presented results as presence within a timeframe, commonly over some larger timeframe, e.g., number of days with presence per week (Davis et al., 2017; Table S8). Risch et al. (2014), Morano, Rice et al. (2012), and Wright et al. (2018) presented results using both measures. A few studies presented results using completely different measures such as a daily index (Leroy, Samaran, et al., 2018;Nieukirk et al., 2012;Simon et al., 2010). The most common descriptors of occurrence were number of vocalizations (or automated detections) per month (nine studies) or per day (six studies) and number of hours (or percent of hours) with presence per day (four studies) or per month (three studies; Table S8).
Diel patterns were explored in 14 of the 47 occurrence papers with an additional two papers that commented on diel patterns in the occurrence and characterization papers. Most diel studies incorporated all days where vocalizations occurred while some limited diel analysis to a portion of the data (e.g., only months where the vocalizations were most common; Table S9). The number of light regimes in a 24-hr day to explore diel patterns varied across studies with the most common being three light regimes (light, dark, and twilight; four studies) or four light regimes (light, dark, dusk, dawn; three studies; Table S9). Light regimes were most frequently defined using nautical twilight determined by angle of the sun (11/15 studies; Table S9). The most common unit to compare vocalizations across light regimes was hourly mean adjusted number of vocalizations (or presence of vocalizations) per hr (7/15 studies; Table S9).

| Combined occurrence and characterization studies
A number of studies (22%; 17/79) explored the acoustic occurrence of baleen whale species and characterized some aspect(s) of the recorded vocalizations ( Figure 4). While the overarching aim of the research was often focused on either occurrence or characterization, both questions were addressed throughout each study. The characterization aspect of these studies largely targeted non-song vocalizations (12/17). Similar to when these research questions were explored independently, studies exploring both characterization and occurrence had a wide range of methodologies with few commonalities between them (Table S10). Some studies (7/17) undertook the same analysis methodology for both research questions (e.g., Leroy et al., 2017;Stafford et al., 2007), but the majority utilized two largely unrelated methodologies (Table S10). For the question of occurrence, the most common method was a combination of manual and automated detection (8/17), though specific protocols varied across studies. For example, Clark et al. (2002) manually validated 1 hr every 18 days of data, whereas Vu et al. (2012) manually reviewed all automated detections. For the question of characterization, the most common method was partial manual (11/17) with an emphasis on high signal-to-noise ratio (SNR) vocalizations, where what was considered "high" SNR varied across studies. Again, specific analysis protocols varied from a selection of 1 hr per day  to a random selection of 6 days per month (Vu et al., 2012; Table S10). The percent of acoustic data manually reviewed was often unclear as it was guided by automated detectors, but values ranged from <1% to 100% to determine occurrence and from 1.9% to 11% for characterization analysis (Table S10).

| Studies targeting all species present
Of the 94 big PAM data baleen whale papers reviewed, only nine targeted all species present in the acoustic data and of those, only five described the occurrence of three or more species (Table S11; Figure 4). Multi-species studies used data ranging in timespan from 1 to 6 years with the majority (6/9 studies) spanning 1 to 2 years. Occurrence was investigated in every all-species study, with McDonald (2006) also investigating density and Nieukirk et al. (2004) also characterizing vocalizations. Methodologies to determine occurrence varied across studies, with the percent of data manually reviewed ranging from 5% to 100% (Table S11). The most commonly used method to investigate occurrence of all baleen whale species was a combination of manual and automated techniques (3/9 studies; Table S11) where protocols included manually reviewing a portion of each file (Hannay et al., 2013)

| Describing automated detector performance
The description of automated detector performance in baleen whale big PAM studies was variable across the literature reviewed. Of the 53 studies that incorporated automated detectors, 25 made no commentary on automated detector performance. Three studies commented on detector performance (e.g., noted that automated detector(s) may have missed some vocalizations or performed poorly), but performance was not calculated, or, if it was, the results were not presented Stafford et al., 2011Stafford et al., , 2005. Five studies commented on how the automated detector(s) performed in previous studies but did not calculate performance for their data set (Table 1). Some aspect of automated detector performance was calculated and presented in only 20 of the 53 studies that used automated detectors (Table 1).
The majority of studies that incorporated automation (with or without manual review) investigated acoustic occurrence (Figure 4), with the exception of three that investigated characterization only (Delarue, 2008;Magnúsdóttir et al., 2015;Rekdahl et al., 2017). None of the characterization studies reported automated detector performance, which is unsurprising given that automated detector performance would not impact the reliability of their results as they simply used the automated detectors to find signals that were then manually characterized. Different terminology was used across studies to describe the same automated detector performance metric (see performance metric definitions and formulas summarized in Figure 5); in some cases, the definition was assumed when equations were not provided. False negative detection errors, missed call rate, false negative percent, misdetection rate, rate of missed detections, and missed detection rate were all assumed to refer to FNR. Rate of false detections, false detection rate, negative positive rate, false alarm rate, and false positive percent were all interpreted as FPR. Correct detection rate, percent identified, percent of calls found, true positive percent, and percent probability of detection were interpreted as R.
The performance metrics varied across studies. FNR, FPR, and R were the most commonly reported metrics (Table 1). P was reported in three instances, and ACC was only described on one occasion (Table 1). For some studies, describing FPR, R, or P was unnecessary as every automated detection was manually checked and the FP were removed (Table S7). Some authors plotted true positive rate against false positive rate to create ROC curves (Español Jiménez & van der Schaar, 2018;Hodge et al., 2015;Tsujii et al., 2016;Vu et al., 2012). The values of performance metrics varied greatly across metrics and species (Table 1). With an ideal value of 0%, FNR ranged from 0% to 54% and FPR from 0% to 100%. Conversely, with an ideal value of 100%, P and R ranged from 20% to 95% and 15% to 100%, respectively (Table 1). Wide ranging values were in some cases reported within studies. For example, Balcazar et al. (2015) T A B L E 1 Summary of metrics used to describe automated detector performance where "-" indicates the metric was not described. Metrics include false negative rate (FNR), false positive rate (FPR), precision (P), recall (R), and accuracy (ACC). This summary only includes the 25 studies that provided information on automated detector performance (28/53 studies did not and are not included here). concluded that right whale automated detector performance varies by location with R values of 46%-64% and FNRs of 36%-54%. Risch et al. (2013) found that minke whale FNR for automated detection ranged from 0%-51%, depending on vocalization quality in the recorded data. No species consistently had better automated detector performance than others, though such comparisons are challenging given the range in metrics employed, automated detectors applied, vocalization types, and variability in number of studies per species (Table 1).

| Terminology associated with data analysis methods
An inconsistency in terminology related to analysis methods in big PAM data was apparent throughout this literature review. For example, one study may describe false positives while another describes incorrect detections, two equivalent detector performance metrics. Human manual analysis of data was referred to in a variety of ways throughout the literature including "manually," "visually," and "aurally" "examined", "viewed," "logged," "inspected," "analyzed," and "detected." Automated analysis was described as "autodetection," "automated detection algorithms," "automatically detected," and "search algorithm." The terms "manual," "visual," and "aural" most commonly referred to some manner of human analysis, while the term "automated" indicated the use of an algorithm. "Detected," "detector," and "detection" were used interchangeably between manual and automated methods.

| DISCUSSION
This systematic literature review of baleen whale PAM studies revealed a shift in methodologies applied to increasing volumes of data collected. With the use of modern technologies allowing for the remote autonomous collection F I G U R E 5 Detector performance metric definitions and formulas for a binary detector/classifier. Where a detector (automated or manual) is binary if a positive or negative result is defined for each unit (e.g., acoustic file or period of time). Where number of detections is given rather than a binary output, TN is generally not defined. When considering automated detector performance, the manual analysis results (Truth) are compared to automated detector results (Detector).
of large volumes of acoustic data, the data analysis methods have appropriately shifted from predominantly relying on human analysts to methods incorporating both automated and manual techniques (Figure 2), a trend similarly observed in other fields tackling big data (Lewis et al., 2013;Sivarajah et al., 2017).
Big PAM baleen whale studies were largely limited to humpback, blue, fin, and right whales, with a small proportion of research investigating more than one species at a time. The bias in reporting toward these species likely reflects a combination of factors, including their conservation status, distributions that overlap or conflict with human activity, propensity towards being vocally active and thus captured on recorders, or public interest. For example, the acoustically prolific humpback whale song first became famous in the 1970s, and now has a great deal of public and scientific interest (Payne & McVay, 1971). Conversely, other baleen species are in locations where it is more difficult to collect data such as the bowhead whale in the Arctic (Blackwell, Richardson, Greene, & Streever, 2007;Charif et al., 2013;Stafford et al., 2012) or have vocalizations that are more rare, poorly understood, or difficult to study acoustically such as the Omura's whale (Balaenoptera omurai; Cerchio et al., 2015). As more work describes the sounds of these species, their presence in the literature will likely increase.
One of the great benefits of PAM data collection is that it captures signals from all species acoustically active in the area that vocalize within the parameters of the recording system. Therefore, by restricting research efforts to only describing the occurrence of a single species, a practice that likely reflects many factors including budget, time, access to different automated detectors, and priorities, researchers are not fully utilizing the investment they have made in data collection. Thus far, most multispecies studies involve fin and blue whales and low sampling rate data. This is unsurprising given that these species produce regular, predictable, species-specific acoustic signals for which reliable automated detectors can be more easily developed relative to other baleen whales such as humpback whales, which produce more varied signals that can change considerably over time (Rekdahl et al., 2017). Further, blue and fin whales produce low-frequency signals that can be captured by recorders sampling at 100 or 250 Hz, requiring less power and memory storage in recording devices. However, acoustic recorders are now sufficiently powerful to sample at higher rates for long durations, allowing for a greater range of signals to be captured and more species to be described (Moloney et al., 2018). While few studies in the present review explored all species acoustically present, high sampling rates in new generation recorders, an increasing knowledge of marine mammal vocalizations allowing more species to be identified, and an impetus towards ecosystem studies (Sherman et al., 2005) should lead towards more multispecies studies. The development of more automated detectors, or suites of automated detectors that capture and differentiate many types of acoustic signals could encourage more multispecies studies. Worth noting, the present review could not effectively differentiate between studies that investigated and reported on a single species and those that investigated all species but reported on species separately across publications. Therefore, the practice of multispecies analysis may be more common than is perceived here. As research groups include more species, putting greater strain on analysis efforts, the shift to utilizing a combination of automated and manual analysis techniques is expected to continue, as was observed in studies that investigated all acoustically active species in this review.
While a trend toward incorporating automation into big PAM baleen whale analysis methods was observed, protocols showed great variation across studies, which likely reflects differences in research goals, time, budget, acoustic signal(s) of interest, and experience of research groups. Some variation may be due to different requirements from the journal of choice for publication, with some potentially requiring more stringently described methodologies (e.g., automated detector performance evaluation) than others. The acoustic research community must work to standardize methods among different projects to increase comparability/consistency across studies (Mellinger, Stafford, et al., 2007). Therefore, we provide the following recommendations to encourage future research towards this goal.

| Recommendations
Recommendations for future work on baleen whale PAM are based on the present literature review as well as recommendations from other fields that utilize big data. Suggestions seek to find a balance between (1) being reasonable and achievable for researchers, regardless of budget and time restraints; (2) maintaining consistency with previous baleen whale PAM literature; and (3) achieving standards utilized beyond the marine PAM field to encourage interdisciplinary consistency (e.g., PAM in the terrestrial realm). In many instances, our recommendations are limited by a lack of consistency observed in the literature (in big PAM data or other big data studies). These should therefore not be considered hard and fast rules, but rather a starting point to a larger methodological discussion within the PAM data community.

| Recommendation 1: Methodology for characterization studies
For researchers seeking to characterize baleen whale acoustic signals or acoustic behavior, we recommend analysis driven by the SNR of acoustic signals. This was commonly employed in studies that investigated both occurrence and characterization in the present literature review. By restricting characterization to signals of high SNR, researchers reduce the chance that they misidentify, misclassify, or inaccurately characterize the signal (Mellinger, 2004). The method of calculating SNR and what is considered "high" must be clearly defined in the study.
One approach is to compare signals to the sound level of the appropriate frequency band (Castellote et al., 2012b;Gavrilov et al., 2012), which can be made more specific by using the ambient sound immediately before the signal (for the same duration of the signal; Kowarski et al., 2019). There is arguably value to additionally characterizing low SNR signals if the quality and context is such that the researcher can reliably identify species. If the characteristics of lower SNR vocalizations are shared with the acoustics community, future researchers will be better able to classify faint signals in their own data, which can be particularly important for endangered or rare species. Where researchers include varying SNR vocalizations in characterizations, they should clearly specify and differentiate these in their results. Automated methods can be utilized to identify high SNR signals for manual review, though effective automated detectors may not be available before a signal has been adequately characterized. Where automation is not feasible to guide manual review, we recommend employing partial manual review to identify acoustic signals of interest with high SNR.
Partial manual review was the most common characterization method in the present review for non-song vocalizations and is arguably the most appropriate method when effective automation is unavailable, and the quantity of data is such that full manual review is unachievable. Specific protocols to select data for partial manual review and the amount of data reviewed were varied in the literature reviewed, likely reflecting different research questions, goals, acoustic signals of interest, time, and budget. While we recommend that protocols incorporate SNR, one must then decide how to ensure the sample is appropriate for the research question. For example, if a study is claiming to describe vocalizations of a species, selecting high SNR vocalizations from three days is arguably not representative as vocal behavior may vary across space and time. Rather, vocalizations should be analyzed across time of day, day of year, and locations to increase the chance of a representative sample. Ancillary information can also be considered when determining analysis protocols such as seasonal ice cover, known migration times, and information from visual surveys.
Studies that incorporated characterization of songs commonly employed full manual review, but these were almost entirely limited to relatively small studies (2,000 hr or less of acoustic data) and may become increasingly unrealistic for research groups to undertake in the future as data sets continue to grow. Furthermore, where a full manual review does not limit its characterization to high SNR events, it risks mischaracterization. Therefore, when full manual review is unachievable, we argue partial manual review driven by SNR (or vocalization quality) is sufficient for song characterization research.
Our recommended guidelines for characterization methodology are summarized as follows: • Select vocalizations for analysis based on SNR. As a minimum the analysis should focus on high SNR vocalizations; however, describing vocalizations at all signal to noise ratios is recommended.
• If available, use automated techniques to identify high SNR vocalizations.
• Where automated methods are inapplicable, undertake a partial manual review, at minimum, to identify signals of interest. The manual review protocol must capture a representative sample and should consider ancillary information.
• Include a wide temporal window (over season, day, hour) and diverse spatial locations where possible to fully capture the variability on vocalization types.

| Recommendation 2: Methodology for occurrence studies
For the question of baleen whale spatial or temporal acoustic occurrence, we recommend utilizing a combination of manual and automated methods, the most common technique observed in the present literature review. This recommendation applies to studies solely investigating occurrence and those that include additional research questions.
The sole use of manual review (full or partial) would only be necessary where automated detectors are ineffective (e.g., signal is too variable, there are other sounds that interfere with automated detection) or where automated detectors cannot be developed because too little is understood about the acoustic signal of interest. Full manual review would be unrealistic for many research groups, and partial manual review inevitably restricts findings to only a portion of the data. Automated detection without the verification of manual review was rare in the literature and should be avoided. Indeed, the present review revealed that no species has consistently had high automated detector performance reported such that one could argue no manual review is required. Conclusions drawn from research studies solely reliant on automated detectors that are known to have variable performance depending on factors such as location, season, and other ocean sounds should be viewed with caution (Erbs et al., 2017;Hodge et al., 2015;Širovi c et al., 2015).
The question then remains, what protocol should be followed when implementing a combination of manual and automated techniques to determine baleen whale occurrence and how can researchers effectively determine automatic detector performance? Many researchers in the present literature review opted to manually review every automated detection. This technique can result in an intensive workload where effort in terms of time and cost will vary depending on the experience level of analysts. Such effort is likely beyond the reach of some research groups and could become completely unattainable when investigating more than one or two species. Indeed, the present literature review found that no study investigating more than two species checked every automated detection. We therefore argue that, unless the number of automated detections is reasonably small or the research requirements are such that not a single false automated detection is acceptable in the results, the manual review of each automated detection or automated detection event is unnecessary. Rather, researchers should follow the trend of many big data studies and manually review a subset of data to determine automated detector performance metrics (Knight et al., 2017;see Recommendation 4;Lewis et al., 2013;Ofli et al., 2016). The topic of automated detector performance metrics is discussed further in Recommendation 3.
To effectively compare occurrence results between studies, researchers must not only aim for consistent methodologies but also consistent presentation of results, especially within species. The present literature review observed two main categories of units used in results: number of vocalizations or automated detections and presence of vocalizations or automated detections, where presence is given the same weight whether one detection happened within that timeframe or 200. Number of detections can allow a finer comparison of relative occurrence between e.g., two recording stations, but is susceptible to misrepresenting findings if detector performance (manual or automated) is variable across space or time. Presence is arguably more comparable across studies. The same automated detector or human analyst is rarely applied to multiple studies, and if they were, they would be expected to perform differently if the acoustic environment between the two studies is not comparable (Erbs et al., 2017). When there is a high density of vocalizations, this difference in performance is reduced by looking at presence rather than the number of vocalizations (e.g., Hodge et al., 2015 found that missed detections were high when assessed per vocalization but low when assessed per day). What's more, presence allows TN to be defined and calculated, which is not the case when considering number of detections ( Figure 5). We therefore argue that presence is the preferred unit for occurrence results. In the present review, presence results were most commonly reported as hours per day or per month. Maintaining this trend would increase comparability between studies, though the appropriate unit of time analyzed and presented will depend on the specific research questions being addressed, and, with data sets growing ever larger, days per week, month, season, or year may become more achievable and common units.
In contrast to most methodologies observed in the present literature review, those investigating diel occurrence showed some consistency between studies that should continue to be upheld in future research where authors seek to directly compare studies. Diel studies should incorporate all days where vocalizations were deemed present, define light regimes based on nautical twilight (angle of the sun), and present results as hourly mean adjusted number of vocalizations (or presence of vocalizations) per hour for each light regime. There is some discrepancy in how many light regimes are used. Most studies reviewed here used three or four, where those with three had one regime for twilight, while those with four separated twilight into dawn and dusk. Given that acoustic occurrence did vary between dawn and dusk in some studies Kowarski et al., 2019;Mussoline et al., 2012), it seems reasonable to have them separated and to use four light regimes. These recommendations based on trends in the literature may not be applicable in all regions of the world (e.g., in high latitudes where using nautical twilight would be inappropriate) and should be adapted appropriately. Incorporating novel approaches to diel analysis could be useful, though we argue there is value in additionally presenting results as per the typical methods to allow comparison and discussion.
Our recommended guidelines for occurrence methodology are summarized as follows: • Use a combination of automated and manual methods where the automated detector performance is determined from the manual review of a subset of data.
• Present results as presence (e.g., as hours per day or days per month).
• Define light regimes for diel analysis based on nautical twilight (where appropriate), and compare the mean adjusted number of vocalizations per hour for each light regime.

| Recommendation 3: Automated detector metrics
This review highlights previous findings that automated detectors in PAM vary in performance depending on location and timeframe (Erbs et al., 2017;Hodge et al., 2015;Širovi c et al., 2015). This is unsurprising given the wide range of acoustic conditions encountered in the marine realm. For example, the performance of an automated right whale upcall detector would be different in a month where right whales were the only species present than in a month where humpback whales were singing because humpbacks produce similar signals to right whales (Parks et al., 2011). We therefore propose that any application of automated techniques must be paired with manual review to determine automated detector performance, a conclusion similarly reached by Knight et al. (2017) after reviewing PAM avian literature. Where studies span numerous locations and seasons, researchers should further consider calculating automated detector performance for each circumstance as was completed by Erbs et al. (2017).
Our review revealed that automated detector performance metrics are underrepresented in the baleen whale PAM literature with more than half of studies that used automated detectors not describing performance. Studies that did describe performance used varying metrics that were often poorly defined in terms of formula and the unit of time being considered. This variation makes it difficult to compare results across studies, species, locations, and seasons. While FNR and FPR were the most commonly reported metrics in the present review, Knight et al. (2017) argue that P, R, F-score ( Figure 5), and area under the (P and R) curve (AUC) are required for PAM studies to maintain standards with other disciplines. We concur that researchers should strive to present these metrics, particularly P and R, as F-score and AUC are aggregate metrics to represent both P and R. Where P provides the proportion of automated detections (or detection events per unit time) that were correct and R provides the proportion of true detections (or detection events per unit time) that were captured by the automated detector ( Figure 5). Another less common metric to summarize automated detector performance is Matthew's correlation coefficient ( Figure 5) which has the benefit of incorporating TN, a metric not represented in P, R, F-score, or AUC, and is appropriate for the imbalanced data often seen in PAM (Boughorbel et al., 2017). Though, TN can only be determined where automated detector output is binary (e.g., presence/absence). A formula must always be presented alongside any performance metric as well as the unit of performance (e.g., it must be clear if the metric is describing the automated detector performance per vocalization or per unit time, relative to the manual analysis). Indeed, the variation in performance metric values in the present review may be partially due to studies evaluating different units.
The significance of automated detector metrics will vary between studies as there will almost always be a tradeoff between P and R. For some studies, achieving high R will be important because the acoustic signals of interest are rare or the species is endangered, but this will likely result in a lower P (Buchan et al., 2018;Kerosky et al., 2012;Mellinger, Stafford, et al., 2007). Studies on species that are more vocally prolific may allow a lower R in favor of optimizing P (Mellinger, Stafford, et al., 2007). Authors should indicate whether they have developed automated detectors with the intent to optimize P, R, or some metric that balances both such as F-score, AUC, or MCC.
Once performance metrics are calculated, researchers must critically interpret them and present results in a meaningful way. Where automated detector performance is found to be low, the automated detector may be deemed ineffective and either the automated detector must be improved, or an appropriate manual analysis protocol must be undertaken. At what level automated detector performance is deemed too poor will depend on the research goals. Researchers may improve or optimize automated detector results post processing. For example, some automated detectors can have confidence levels imposed, or a minimum number of automated detections per timeframe requirement can be applied (Delarue et al., 2018;Mouy et al., 2016). Presenting automated detector results alongside performance metrics will help ensure results are interpreted correctly. For example, an R of 0.50 tells the reader that the results only represent half of the true occurrence of the acoustic signal. Further, where authors understand what factors impact automated detector performance, they should present such information along with their acoustic occurrence results (e.g., anthropogenic activities, weather events, ambient sound levels).
In addition to evaluating automated detectors, researchers should strive to validate the effectiveness of the manual review, given the potential for discrepancies between human analysts that produce the "truth data." This can be done by having multiple humans review the same data (at least for a subset) to calculate the agreement between them (e.g., Leroy, Thomisch, et al., 2018). Where only employing a single analyst is possible (given the limited resources of many research groups and scarcity of qualified individuals), papers should provide sufficient evidence that the experience of the analyst is such that the truth data can be considered sufficiently reliable.
Our recommended guidelines for automated detector metrics are summarized as follows: • Pair automated detector implementation with manual review to determine automated detector performance metrics.
• Evaluate automated detector performance metrics separately for different acoustic circumstances (e.g., each season and location).
• At minimum, evaluate P and R for each automated detector.
• Include a formula and the unit investigated for all automated detector performance metrics reported.
• Evaluate the reliability of the human created "truth data."

| Recommendation 4: Data selection to calculate automated detector metrics
Regardless of the research question, one of the most variable protocols observed in the present literature review was how data were selected for manual analysis to validate automated detectors. The amount of acoustic data reviewed ranged greatly with little to no consistency between studies. The manner data were selected for manual review was inconsistent. Depending on the study, data for manual validation of automated detections were selected completely randomly or was based on automated detector counts, time to capture the entire recording period, location to capture all recording sites, ambient sound, or suspicion that automated detections were incorrect. Given the known variability in automated detector performance that can occur across these variables, we recommend that researchers account for all of them when selecting data for manual validation. Proceeding otherwise could result in performance metrics that are not truly representative of the results.
The question then remains, how much data should be manually validated? This will likely be determined by the constraining factors of time, budget, and resources, but need to be sufficiently high enough to ensure a representative sample was manually analyzed. A manner to check if the sample data set is sufficiently large is to calculate the P and R of the automated detectors based on the sample, which may be 5% of acoustic data, then calculate P and R as if only 2.5% of acoustic was reviewed. If P and R do not vary between the two sample sizes, the plateau of performance has likely been reached, and the performance metric values are reliable.
Our recommendation for samples of data selected to validate automated detectors is: • Samples must represent the entire breadth of conditions present in the data set of interest to effectively report on automated detector performance.

| Recommendation 5: Terminology
Terminology used to describe methods to analyze big PAM baleen whale data varied across studies, resulting in a lack of clarity when attempting to compare methods and results during the present review. For example, detections can refer to the outcome of human manual analysis, automated detectors, or both, depending on the study. Further, studies incorporating sighting surveys also have visual detections (a term also applied to manual PAM analysis). Knight et al. (2017) similarly found a lack of consistency in avian PAM literature terminology, particularly regarding automated detector performance metrics. Future research should strive to maintain consistency and clarity. To this end, we have compiled the following guidelines that are recommended for future studies: • Clearly define the terminology that will be used in the paper. For example, if using an algorithm that first detects a signal, then classifies it as a vocalization, the author should clarify that the term "automated detector" refers to an automated detector-classifier. A small glossary section would add great clarity to PAM articles.
• Qualify any use of the term "detection." For example, "manual detection" describes a human-detected signal, "automated detection" describes an algorithm-detected signal, and "manually validated detection" describes a signal that was first detected automatically then validated by a human.
• Use the term "manual" or "human" to qualify detections from human analysts. These are more broadly applicable beyond the field of acoustics than "aural" or "visual," for example.
• Use the term "automated" to qualify results from a computer algorithm.

| Conclusions
We present a literature review of baleen whale PAM methods and propose guidelines for future work to enhance consistency and rigor of methods. The PAM community should work towards implementing consistent standards while considering the sometimes-limited resources of research groups and how requirements vary across research questions. Big data will likely continue to increase as technology matures and analysis methodologies must find the balance between adapting appropriately, while maintaining consistency. Results from PAM marine mammal research, especially those investigating occurrence, can reliably inform management decisions. However, standards or best practices are required and must be upheld for findings to be considered reliable and taken into consideration as appropriate.

ACKNOWLEDGMENTS
We are grateful for the efforts of many who provided feedback on the present paper, including Hal Whitehead, Andy Horn, and David Barclay of Dalhousie University, Salvatore Cerchio of the New England Aquarium, Joy Stanistreet of the Bedford Institute of Oceanography, and Bruce Martin, Julien Delarue, Briand Gaudet, and Karen Scanlon of JASCO Applied Sciences. Thank you Laura Joan Feyrer for your guidance and advice prior to this literature review.
This work was supported by Mitacs through the Mitacs Accelerate program. We sincerely thank all authors who contributed to the wonderful body of literature that we had the pleasure of reviewing. Thank you to the anonymous reviewers for your thoughtful feedback that contributed to this paper.

AUTHOR CONTRIBUTIONS
Hilary Moors-Murphy: Conceptualization; funding acquisition; methodology; project administration; supervision; writing-review and editing.