Chapter 19 Validation

As explained in the Overview, LTabundR was developed , among other reasons, to replace and build upon the FORTRAN software ABUND. To demonstrate that this purpose has been achieved, and to validate the results that are generated with LTabundR, this chapter is dedicated to comparing and contrasting the results of the two software packages.

We have tried to develop LTabundR with the flexibility either to replicate ABUND results or to produce customizable results that could potentially vary from ABUND quite significantly (e.g., formatted for habitat modeling). However, even when we use LTabundR settings intended to replicate ABUND results, there are some intentional updates the processing routine that will lead to some small differences.

This chapter requires the following packages:

library(LTabundR)
library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)

Data processing

To validate the data processing functions within LTabundR, we can compare its output to that of ABUND9, written by Jay Barlow (NOAA Fisheries). First, we bring in the ABUND9 output files for the same DAS data:

# Local paths to these files
SIGHTS <- read.csv('data/SIGHTS.csv')
EFFORT <- read.csv('data/EFFORT.csv')

You may download these files here: SIGHTS.csv and EFFORT.csv.

Sightings

Pivot and format the ABUND SIGHTS data…

abund <-
  SIGHTS %>%
  pivot_longer(cols = 31:101,
                      names_to = 'species',
                      values_to = 'best') %>%
  filter(best > 0) %>%
  mutate(Region = gsub(' ','',Region)) %>%
  mutate(DateTime = paste0(Yr,'-',Mo,'-',Da,' ',Hr,':',Min))

…then summarize counts of species within each cruise:

abund_summ <-
  abund %>%
  group_by(cruise = CruzNo, species) %>%
  summarize(ntot_abund = n(),
            nsys_abund = length(which(! Region %in% c('NONE', 
                                                      'Off-Transect') & 
                                        EffortSeg > 0))) %>%
  mutate(species = gsub('SP','',species))

Then do the same for LTabundR output:

load('whiceas_cruz.RData')

(This cruz object was produced in the Data Processing chapter.)

ltabundr <- 
  cruz$cohorts$all$sightings %>% 
  # Filter out species that ABUND ignored based on its INP file
  filter(!species %in% c('CU', 'PU'))
  
ltabundr_summ <-
  ltabundr %>%
  filter(OnEffort == TRUE) %>% 
  group_by(cruise = Cruise, species) %>%
  summarize(ntot_ltabundr = n(),
            nsys_ltabundr = length(which(included == TRUE & 
                                           EffType %in% c('S','F'))))

Now join these two datasets by cruise and species code:

mr <- full_join(abund_summ, ltabundr_summ, by=c('cruise', 'species'))

mr %>% head
# A tibble: 6 × 6
# Groups:   cruise [1]
  cruise species ntot_abund nsys_abund ntot_ltabundr nsys_ltabundr
   <dbl> <chr>        <int>      <int>         <int>         <int>
1    901 002              3          2             3             2
2    901 013              4          4             4             4
3    901 015              2          1             2             1
4    901 018              2          1             2             1
5    901 031              1          1             1             1
6    901 032              2          0             2             0

Compare the total On-Effort sightings in both outputs:

mr$ntot_abund %>% sum(na.rm=TRUE)
[1] 3223
mr$ntot_ltabundr %>% sum(na.rm=TRUE)
[1] 3227

Compare total sightings valid for use in density estimation (EffType "S" or "F" only, as well as other criteria such as Bft 0 - 6):

mr$nsys_abund %>% sum(na.rm=TRUE)
[1] 2478
mr$nsys_ltabundr %>% sum(na.rm=TRUE)
[1] 2478

Let’s find the rows with discrepancies in sighting counts:

bads <- which(mr$nsys_abund != mr$nsys_ltabundr | 
              mr$ntot_abund != mr$ntot_ltabundr | 
              is.na(mr$ntot_abund) | 
              is.na(mr$ntot_ltabundr) | 
              is.na(mr$nsys_abund) | 
              is.na(mr$nsys_ltabundr))
bads %>% length
[1] 7

Let’s look at those rows in the joined dataframe:

mr[bads, ]
# A tibble: 7 × 6
# Groups:   cruise [5]
  cruise species ntot_abund nsys_abund ntot_ltabundr nsys_ltabundr
   <dbl> <chr>        <int>      <int>         <int>         <int>
1   1203 015              1          0             2             0
2   1203 049              1          1             2             1
3   1621 073              5          4             5             3
4   2001 051              2          2             1             1
5   2001 059              1          0             2             1
6   1165 047             NA         NA             1             1
7   1631 003             NA         NA             1             0

To investigate these 5 discrepancies, we will write a helper function that returns sightings details from both outputs for a given cruise-species:

sight_compare <- function(abund, ltabundr, cruise, spp){
  message('ABUND:')
  abund %>%
    filter(CruzNo == cruise, species == paste0('SP',spp)) %>%
    select(5, 34, 26, 29, 33, 3) %>%
    mutate(use_sit = EffortSeg != 0) %>%
    select(-EffortSeg) %>%
    arrange(desc(use_sit)) %>%
    print

  abund %>%
    filter(CruzNo == 1631) %>% pull(species) %>% table
  
  
  message('\nLTabundR:')
  ltabundr %>%
    filter(Cruise == cruise, species == spp) %>%
    select(Cruise, DateTime, Bft, mixed, best, OnEffort, EffType, use, included) %>%
    rename(use_sit = included, use_effort = use) %>%
    arrange(desc(use_effort)) %>%
    tibble %>% print
}

Discrepancies

Below we investigate each discrepancy identified above. Note that there are a few intentional design features that may produce differences in the sightings counts between ABUND9 and LTabundR. Here are a few:

(1) In ABUND9, only sightings that occur while OnEffort == TRUE are returned; in contrast, LTabundR does not remove any sightings (it just flags them differently, using the included column variable). This will be evident in one of the discrepancies below, in which a sighting that occurred during a Beaufort Sea State of 7 is not returned by ABUND9, since the INP file for the ABUND routine specified that only sightings within Bft 0-6 should be used in analysis. Note, however, that one can readily filter LTabundR sightings to emulate ABUND9 output if needed.

(2) LTabundR includes one additional criterion for inclusion in analysis: the sighting must occur at or forward of the beam (note that this restriction can be deactivated in load_cohort_settings().

(3) Since point-in-polygon calculations are very different in the two programs, it is possible that sightings occurring very near geostratum margins may be included/excluded differently.

Cruise 1203, Species 015

In this case, LTabundR has a non-systematic sighting that ABUND has ignored.

sight_compare(abund, ltabundr, 1203, '015')
ABUND:
# A tibble: 1 × 6
  CruzNo DateTime        Beauf Mixed  best use_sit
   <int> <chr>           <int> <chr> <dbl> <lgl>  
1   1203 2012-5-16 11:58     4 " F"   74.9 TRUE   

LTabundR:
# A tibble: 2 × 9
  Cruise DateTime              Bft mixed  best OnEffort EffType use_effort
   <dbl> <dttm>              <dbl> <lgl> <dbl> <lgl>    <chr>   <lgl>     
1   1203 2012-05-13 13:19:44     5 TRUE  NaN   TRUE     N       TRUE      
2   1203 2012-05-16 11:58:30     4 FALSE  74.9 TRUE     N       TRUE      
# ℹ 1 more variable: use_sit <lgl>

Looking at the sighting details from LTabundR

(ltabundr %>% filter(Cruise == 1203, species == '015'))[1,]
    Event            DateTime      Lat       Lon OnEffort Cruise Mode OffsetGMT
270     S 2012-05-13 13:19:44 11.96733 -161.1727     TRUE   1203    C       -10
    EffType ESWsides Course SpdKt Bft SwellHght WindSpdKt RainFog HorizSun
270       N        2     18   9.2   5         6        21       5       12
    VertSun Glare Vis ObsL Rec ObsR ObsInd EffortDot EventNum
270      12 FALSE   6  073 235  280   <NA>      TRUE      268
                                file_das line_num stratum_HI_EEZ
270 CenPac1986-2020_Final_alb_edited.das   493027          FALSE
    stratum_OtherCNP stratum_WHICEAS year month day yday km_valid km_int
270             TRUE           FALSE 2012     5  13  134     TRUE      0
      km_cum ship  stratum  use eff_bloc seg_id SightNo Subgroup SightNoDaily
270 191053.4  OES OtherCNP TRUE     57-2   2330     069     <NA>  20120513_20
    Obs ObsStd Bearing Reticle DistNm Cue Method Photos Birds CalibSchool
270 235   TRUE      90      NA    0.1   3      1      Y     Y           N
    PhotosAerial Biopsy CourseSchool TurtleSp TurtleGs TurtleJFR TurtleAge
270            N      N           NA     <NA>       NA      <NA>      <NA>
    TurtleCapt PinnipedSp PinnipedGs BoatType BoatGs PerpDistKm species best
270       <NA>       <NA>         NA     <NA>     NA     0.1852     015  NaN
    low high  prob mixed   ss_tot lnsstot ss_percent n_sp n_obs n_best n_low
270 NaN  NaN FALSE  TRUE 6.956522 1.93968        NaN    2     1      1     1
    n_high calibr ss_valid mixed_max spp_max included
270      1   TRUE    FALSE     FALSE    <NA>     TRUE

According to ABUND, this sighting is not mixed-species, but LTabundR says it is. Looking at the raw DAS

das_file <- 'data/surveys/CenPac1986-2020_Final_alb.das'
das <- das_readtext(das_file)
i <- which(substr(das$das, 6, 18) == '131944 051312')
das$das[i]
[1] "268S.131944 051312 N11:58.04 W161:10.36  069  235    3    1  090      0.10              N    N    N"                                             
[2] "269A.131944 051312 N11:58.04 W161:10.36  069         Y    Y  033  015    "                                                                       
[3] "269C.131944 051312 N11:58.04 W161:10.36  Overall estimate for full group- never saw all at once.  -EMO"                                          
[4] "268G.131944 051312 N11:58.04 W161:10.36  069    A  235    1  090      0.10 "                                                                     
[5] "269A.131944 051312 N11:58.04 W161:10.36  069         Y    Y  033  015"                                                                           
[6] "269C.131944 051312 N11:58.04 W161:10.36  begin PC protool, first sighting is subgroup 'A', acoustics already tracking        "                   
[7] "269C.131944 051312 N11:58.04 W161:10.36  photos taken during sighting indicate Steno (015) present. Not seen or estimated during sighting.  -EMO"

We see that this was a sighting of false killer whales during which species 015 was picked up during photo-ID. In the absence of a percent composition estimate for this sighting, it was ignored by ABUND.

Cruise 1203, Species 049

In this case, LTabundR has a non-systematic sighting that ABUND has ignored.

sight_compare(abund, ltabundr, 1203, '049')
ABUND:
# A tibble: 1 × 6
  CruzNo DateTime       Beauf Mixed  best use_sit
   <int> <chr>          <int> <chr> <dbl> <lgl>  
1   1203 2012-5-3 14:21     6 " F"   1.16 TRUE   

LTabundR:
# A tibble: 2 × 9
  Cruise DateTime              Bft mixed  best OnEffort EffType use_effort
   <dbl> <dttm>              <dbl> <lgl> <dbl> <lgl>    <chr>   <lgl>     
1   1203 2012-05-03 14:21:40     6 FALSE  1.16 TRUE     S       TRUE      
2   1203 2012-05-07 10:54:09     7 FALSE  1    TRUE     N       FALSE     
# ℹ 1 more variable: use_sit <lgl>

Looking at the sighting details from LTabundR

(ltabundr %>% filter(Cruise == 1203, species == '049'))[2,]
     Event            DateTime      Lat       Lon OnEffort Cruise Mode
1544     S 2012-05-07 10:54:09 5.960333 -162.1255     TRUE   1203    C
     OffsetGMT EffType ESWsides Course SpdKt Bft SwellHght WindSpdKt RainFog
1544       -10       N        2    269   8.8   7         9        30       5
     HorizSun VertSun Glare Vis ObsL Rec ObsR ObsInd EffortDot EventNum
1544        5       1 FALSE 4.5  238 328  073   <NA>      TRUE      145
                                 file_das line_num stratum_HI_EEZ
1544 CenPac1986-2020_Final_alb_edited.das   489960          FALSE
     stratum_OtherCNP stratum_WHICEAS year month day yday km_valid km_int
1544             TRUE           FALSE 2012     5   7  128     TRUE      0
       km_cum ship  stratum   use eff_bloc seg_id SightNo Subgroup SightNoDaily
1544 189739.1  OES OtherCNP FALSE     54-1   2309     052     <NA>   20120507_9
     Obs ObsStd Bearing Reticle DistNm Cue Method Photos Birds CalibSchool
1544 238   TRUE     310      10   0.52   3      4      N     N           N
     PhotosAerial Biopsy CourseSchool TurtleSp TurtleGs TurtleJFR TurtleAge
1544            N      N           NA     <NA>       NA      <NA>      <NA>
     TurtleCapt PinnipedSp PinnipedGs BoatType BoatGs PerpDistKm species best
1544       <NA>       <NA>         NA     <NA>     NA  0.7377314     049    1
     low high  prob mixed ss_tot lnsstot ss_percent n_sp n_obs n_best n_low
1544   1    1 FALSE FALSE      1       0          1    1     1      1     1
     n_high calibr ss_valid mixed_max spp_max included
1544      1   TRUE     TRUE      TRUE     049    FALSE

This is a sighting of a single Ziphiid whale. It appears to be within the geostratum:

cruzi <- filter_cruz(cruz, spp='049', years = 1988)
map_cruz(cruzi)

Loooking at the raw DAS data …

i <- which(substr(das$das, 6, 18) == '105409 050712')
das$das[(i[1] - 10):(i[1] + 3)]
 [1] "135V.104143 050712 N05:57.61 W162:05.66    7   09  070      30.0"                                   
 [2] "136N.104143 050712 N05:57.61 W162:05.66  272 09.3"                                                  
 [3] "137W.104143 050712 N05:57.61 W162:05.66    5   05   01  050  4.5"                                   
 [4] "138*.104234 050712 N05:57.62 W162:05.79"                                                            
 [5] "139*.104434 050712 N05:57.62 W162:06.09"                                                            
 [6] "140*.104634 050712 N05:57.63 W162:06.39"                                                            
 [7] "141N.104713 050712 N05:57.63 W162:06.49  269 08.8"                                                  
 [8] "142*.104834 050712 N05:57.63 W162:06.69"                                                            
 [9] "143*.105034 050712 N05:57.63 W162:06.99"                                                            
[10] "144*.105234 050712 N05:57.63 W162:07.29"                                                            
[11] "145S.105409 050712 N05:57.62 W162:07.53  052  238    3    4  310 10.0 0.52              N    N    N"
[12] "146A.105409 050712 N05:57.62 W162:07.53  052         N    N  049               "                    
[13] "   1                                     238    1    1    1  100"                                   
[14] "147*.105434 050712 N05:57.62 W162:07.59"                                                            

Note that the Beaufort sea state for this sighting is 7. When the ABUND routine was run, the INP file instructed it to disregard any sightings beyond Bft 6. This is why ABUND has no record of this sighting.

Cruise 1165, Species 047

In this case, LTabundR has a systematic sighting of a pygmy sperm whale that ABUND has ignored.

sight_compare(abund, ltabundr, 1165, '047')
ABUND:
# A tibble: 0 × 6
# ℹ 6 variables: CruzNo <int>, DateTime <chr>, Beauf <int>, Mixed <chr>,
#   best <dbl>, use_sit <lgl>

LTabundR:
# A tibble: 1 × 9
  Cruise DateTime              Bft mixed  best OnEffort EffType use_effort
   <dbl> <dttm>              <dbl> <lgl> <dbl> <lgl>    <chr>   <lgl>     
1   1165 1988-07-30 19:28:00     0 FALSE  1.16 TRUE     S       TRUE      
# ℹ 1 more variable: use_sit <lgl>
(ltabundr %>% filter(Cruise == 1165, species == '047'))[1,]
     Event            DateTime  Lat       Lon OnEffort Cruise Mode OffsetGMT
1390     S 1988-07-30 19:28:00 26.3 -121.1167     TRUE   1165    C        NA
     EffType ESWsides Course SpdKt Bft SwellHght WindSpdKt RainFog HorizSun
1390       S        2    163  10.5   0        NA        NA       1       NA
     VertSun Glare Vis ObsL Rec ObsR ObsInd EffortDot EventNum
1390      NA    NA  NA  038 068  051   <NA>      TRUE       92
                                 file_das line_num stratum_HI_EEZ
1390 CenPac1986-2020_Final_alb_edited.das    50173          FALSE
     stratum_OtherCNP stratum_WHICEAS year month day yday km_valid km_int
1390             TRUE           FALSE 1988     7  30  212     TRUE      0
       km_cum ship  stratum  use eff_bloc seg_id SightNo Subgroup SightNoDaily
1390 15232.08  MAC OtherCNP TRUE     51-0   2279      07     <NA>  19880730_13
     Obs ObsStd Bearing Reticle DistNm Cue Method Photos Birds CalibSchool
1390 051   TRUE      45    3.38    0.9   3      4   <NA>     N        <NA>
     PhotosAerial Biopsy CourseSchool TurtleSp TurtleGs TurtleJFR TurtleAge
1390         <NA>   <NA>           NA     <NA>       NA      <NA>      <NA>
     TurtleCapt PinnipedSp PinnipedGs BoatType BoatGs PerpDistKm species
1390       <NA>       <NA>         NA     <NA>     NA   1.178606     047
        best low     high  prob mixed  ss_tot   lnsstot ss_percent n_sp n_obs
1390 1.15942   1 1.259921 FALSE FALSE 1.15942 0.1479201          1    1     3
     n_best n_low n_high calibr ss_valid mixed_max spp_max included
1390      3     3      3   TRUE     TRUE      TRUE     047     TRUE

To investigate this sighting, we can filter our cruz object and take a look at a map of this sighting:

cruzi <- filter_cruz(cruz, spp='047', years = 2012)
map_cruz(cruzi)

Using that map we see that this sighting occurred just inside of the OtherCNP geostratum. It is likely that the point-in-polygon subroutines inside ABUND9 decided that this sighting was out of the study area, and therefore excluded it. The subroutines used by LTabundR, which are based in the R package sf, should not be wrong in this case.

Cruise 1621-073

In this case, both LTabundR and ABUND logged the same total number of sightings, but ABUND determined that that one was not valid for for density estimation, whereas LTabundR determined that two of them were not valid.

sight_compare(abund, ltabundr, 1621, '073')
ABUND:
# A tibble: 5 × 6
  CruzNo DateTime         Beauf Mixed  best use_sit
   <int> <chr>            <int> <chr> <dbl> <lgl>  
1   1621 2002-11-8 14:19      3 " F"   3.48 TRUE   
2   1621 2002-11-20 16:2      4 " F"   3.97 TRUE   
3   1621 2002-11-22 10:27     4 " F"   4.96 TRUE   
4   1621 2002-11-22 12:20     4 " F"   1.16 TRUE   
5   1621 2002-11-29 13:48     5 " F"   2.32 TRUE   

LTabundR:
# A tibble: 7 × 9
  Cruise DateTime              Bft mixed  best OnEffort EffType use_effort
   <dbl> <dttm>              <dbl> <lgl> <dbl> <lgl>    <chr>   <lgl>     
1   1621 2002-11-08 14:19:04     3 FALSE  3.48 TRUE     S       TRUE      
2   1621 2002-11-20 16:02:58     4 FALSE  3.97 TRUE     S       TRUE      
3   1621 2002-11-22 10:27:28     4 FALSE  4.96 TRUE     S       TRUE      
4   1621 2002-11-22 12:20:33     4 FALSE  1.16 TRUE     S       TRUE      
5   1621 2002-11-29 13:48:49     5 FALSE  2.32 TRUE     N       TRUE      
6   1621 2002-11-20 12:59:22     4 FALSE  1.16 FALSE    S       FALSE     
7   1621 2002-11-21 11:57:55     4 FALSE  1.16 FALSE    S       FALSE     
# ℹ 1 more variable: use_sit <lgl>

The discrepancy is in the November 8, 2002 sighting at 14:19:04. Looking at the sighting details from LTabundR

(ltabundr %>% filter(Cruise == 1621, species == '073'))[1,]
      Event            DateTime    Lat       Lon OnEffort Cruise Mode OffsetGMT
11736     S 2002-11-08 14:19:04 22.539 -171.7638     TRUE   1621    C        10
      EffType ESWsides Course SpdKt Bft SwellHght WindSpdKt RainFog HorizSun
11736       S        2    108   9.4   3         5        10       1        3
      VertSun Glare Vis ObsL Rec ObsR ObsInd EffortDot EventNum
11736       2 FALSE   6  200 073  208   <NA>      TRUE      129
                                  file_das line_num stratum_HI_EEZ
11736 CenPac1986-2020_Final_alb_edited.das   273857           TRUE
      stratum_OtherCNP stratum_WHICEAS year month day yday km_valid km_int
11736             TRUE           FALSE 2002    11   8  312     TRUE      0
       km_cum ship stratum  use eff_bloc seg_id SightNo Subgroup SightNoDaily
11736 91181.6  DSJ  HI_EEZ TRUE    155-7    543     275     <NA>  20021108_57
      Obs ObsStd Bearing Reticle DistNm Cue Method Photos Birds CalibSchool
11736 208   TRUE      93     0.5   2.75   6      4      Y     N        <NA>
      PhotosAerial Biopsy CourseSchool TurtleSp TurtleGs TurtleJFR TurtleAge
11736         <NA>   <NA>           NA     <NA>       NA      <NA>      <NA>
      TurtleCapt PinnipedSp PinnipedGs BoatType BoatGs PerpDistKm species
11736       <NA>       <NA>         NA     <NA>     NA    5.08602     073
          best low high  prob mixed   ss_tot  lnsstot ss_percent n_sp n_obs
11736 3.478261   3    3 FALSE FALSE 3.478261 1.246532          1    1     6
      n_best n_low n_high calibr ss_valid mixed_max spp_max included
11736      6     6      6   TRUE     TRUE      TRUE     073    FALSE

LTabundR was correct to exclude this sighting because the bearing was past the beam (93 degrees). The bearing of 93 is also given in the raw DAS data…

das_file <- 'data/surveys/CenPac1986-2020_Final_alb.das'
das <- das_readtext(das_file)
i <- which(substr(das$das, 6, 18) == '141904 110802')
das$das[i]
[1] "129S.141904 110802 N22:32.34 W171:45.83  275  208    6    4  093  0.5 2.75"
[2] "130A.141904 110802 N22:32.34 W171:45.83  275         Y    N  073"          

… so it is not clear why ABUND did not exclude this sighting as invalid as well. It may be that ABUND was not expecting bearings above 90 degrees in the on-effort data.

Cruise 1631, Species 003

In this case, there was a non-systematic sighting of species 003 that was found by LTabundR but not by ABUND.

sight_compare(abund, ltabundr, 1631, '003')
ABUND:
# A tibble: 0 × 6
# ℹ 6 variables: CruzNo <int>, DateTime <chr>, Beauf <int>, Mixed <chr>,
#   best <dbl>, use_sit <lgl>

LTabundR:
# A tibble: 1 × 9
  Cruise DateTime              Bft mixed  best OnEffort EffType use_effort
   <dbl> <dttm>              <dbl> <lgl> <dbl> <lgl>    <chr>   <lgl>     
1   1631 2006-09-02 07:35:14     5 TRUE    NaN TRUE     N       TRUE      
# ℹ 1 more variable: use_sit <lgl>
(ltabundr %>% filter(Cruise == 1631, species == '003'))
     Event            DateTime      Lat       Lon OnEffort Cruise Mode
2289     S 2006-09-02 07:35:14 19.28883 -156.8227     TRUE   1631    P
     OffsetGMT EffType ESWsides Course SpdKt Bft SwellHght WindSpdKt RainFog
2289        10       N        2    155    10   5         4        18       1
     HorizSun VertSun Glare Vis ObsL Rec ObsR ObsInd EffortDot EventNum
2289       10       2 FALSE   7  073 196  197   <NA>      TRUE      026
                                 file_das line_num stratum_HI_EEZ
2289 CenPac1986-2020_Final_alb_edited.das   405222           TRUE
     stratum_OtherCNP stratum_WHICEAS year month day yday km_valid km_int
2289             TRUE            TRUE 2006     9   2  245     TRUE      0
       km_cum ship stratum  use eff_bloc seg_id SightNo Subgroup SightNoDaily
2289 139591.4  Mc2 WHICEAS TRUE    217-1   1149     090     <NA>  20060902_51
     Obs ObsStd Bearing Reticle DistNm Cue Method Photos Birds CalibSchool
2289 197   TRUE      59     0.3   4.12   2      4      N     N        <NA>
     PhotosAerial Biopsy CourseSchool TurtleSp TurtleGs TurtleJFR TurtleAge
2289         <NA>   <NA>           NA     <NA>       NA      <NA>      <NA>
     TurtleCapt PinnipedSp PinnipedGs BoatType BoatGs PerpDistKm species best
2289       <NA>       <NA>         NA     <NA>     NA   6.540392     003  NaN
     low high  prob mixed   ss_tot  lnsstot ss_percent n_sp n_obs n_best n_low
2289 NaN   NA FALSE  TRUE 3.478261 1.246532        NaN    2     1      1     1
     n_high calibr ss_valid mixed_max spp_max included
2289      0   TRUE    FALSE     FALSE    <NA>     TRUE

The map indicates that this is not a geostratum boundary issue:

cruzi <- filter_cruz(cruz, spp='003', years = 2006)
map_cruz(cruzi)

Looking at the raw DAS data …

das$das[405190:405220]
 [1] "   C 120000 090106                      In transit to study area Sep 1. A.J. 10/18/06."
 [2] "001B.061813 090206 N19:28.97 W156:55.51 1631    p   10    Y"                           
 [3] "002R.061813 090206 N19:28.97 W156:55.51    N"                                          
 [4] "003P.061813 090206 N19:28.97 W156:55.51  197  007  277"                                
 [5] "004V.061813 090206 N19:28.97 W156:55.51    5   04  150      18.0"                      
 [6] "005N.061813 090206 N19:28.97 W156:55.51  154 10.4"                                     
 [7] "006W.061813 090206 N19:28.97 W156:55.51    1   10   03  032  7.0"                      
 [8] "007C.061823 090206 N19:28.94 W156:55.50"                                               
 [9] "008*.062019 090206 N19:28.63 W156:55.33"                                               
[10] "009*.063019 090206 N19:27.12 W156:54.53"                                               
[11] "010P.063950 090206 N19:25.68 W156:53.75  196  197  007"                                
[12] "011C.063950 090206 N19:25.68 W156:53.75"                                               
[13] "012V.063950 090206 N19:25.68 W156:53.75    5   04  150      18.0"                      
[14] "013N.063950 090206 N19:25.68 W156:53.75  151 10.1"                                     
[15] "014W.063950 090206 N19:25.68 W156:53.75    1   10   03  032  7.0"                      
[16] "015*.064019 090206 N19:25.61 W156:53.71"                                               
[17] "016N.064253 090206 N19:25.22 W156:53.50  153 10.3"                                     
[18] "017*.065019 090206 N19:24.11 W156:52.90"                                               
[19] "018*.070019 090206 N19:22.62 W156:52.11"                                               
[20] "019*.071019 090206 N19:21.10 W156:51.30"                                               
[21] "020*.072019 090206 N19:19.59 W156:50.53"                                               
[22] "021P.072113 090206 N19:19.46 W156:50.45  073  196  197"                                
[23] "022V.072113 090206 N19:19.46 W156:50.45    5   04  150      18.0"                      
[24] "023N.072113 090206 N19:19.46 W156:50.45  155 10.0"                                     
[25] "024W.072113 090206 N19:19.46 W156:50.45    1   10   02  032  7.0"                      
[26] "025*.073019 090206 N19:18.09 W156:49.75"                                               
[27] "026S.073514 090206 N19:17.33 W156:49.36  090  197    2    4  059  0.3 4.12  013"       
[28] "027A.073514 090206 N19:17.33 W156:49.36  090         N    N  177  003"                 
[29] "   1                                     197              3"                           
[30] "028*.074019 090206 N19:16.53 W156:48.94"                                               
[31] "029V.074337 090206 N19:16.01 W156:48.67    4   04  150      12.0"                      

It appears that this was a multi-species sighting, but no species percentages were provided. In the absence of a percent composition estimate for this sighting, it was ignored by ABUND.

Group sizes

This plot compares the group size estimates returned by ABUND9 and LTabundR. The results should be identical:

# Format ABUND
abund <- 
  SIGHTS %>% 
  tidyr::pivot_longer(cols = 31:101,
                      names_to = 'species',
                      values_to = 'best') %>%
  mutate(Region = gsub(' ','',Region)) %>%
  filter(best > 0,
         ! Region %in% c('NONE', 'Off-Transect'),
         EffortSeg > 0) %>% 
  select(Cruise = CruzNo, TotSS, LnTotSS, species, best) %>% 
  mutate(Software='ABUND9')

# Format LTabundR
ltabundr <- 
  cruz$cohorts$all$sightings %>% 
  filter(OnEffort == TRUE, 
         included == TRUE, 
         EffType %in% c('S', 'F')) %>% 
  select(Cruise, TotSS = ss_tot, LnTotSS = lnsstot, species, best) %>% 
  mutate(Software='LTabundR')

# Combine the datasets
ss <- rbind(abund, ltabundr)

# Plot the datasets
ggplot(ss,
       aes(x=best, 
           y=factor(Cruise), 
           col=Software, 
           pch=Software)) + 
      geom_point(position=ggstance::position_dodgev(height=0.5),
                 alpha=.6) +
      scale_x_continuous(trans='log', breaks=c(1,2, 5,10,25,50,100,500,1000,2500,5000)) +
      xlab('log Estimated School Size') +
      ylab('Cruise') + 
  theme_light()

Note two important differences in how the two programs calibrate group size:

(1) If an observer is not included in the Group Size Calibration Coefficients .DAT file, ABUND applies a default coefficient (0.8625) to scale group size estimates; however, it applies this calibration to groups of *all** sizes, including solo animals or small groups of 2-3. In LTabundR, this is also the default, but users can choose to restrict calibrations for unknown observers to group size estimates of any size (see load_cohort_settings()).

(2) Note that ABUND9 calibrates school sizes slightly differently than earlier versions of the software. The ABUND9 release notes mention a bug in previous versions that incorrectly calibrated school size. LTabundR corresponds perfectly with ABUND9 school size calibrations, but not with ABUND8 or earlier.

Effort

Perform basic formatting before doing any filtering , first for ABUND data…

# Format ABUND
abund <- 
  EFFORT %>% 
  mutate(Region = gsub(' ','',Region)) %>%
  mutate(dt = paste0(Yr,
                     stringr::str_pad(Mo, width=2, pad='0', side='left'),
                     stringr::str_pad(Da, width=2, pad='0', side='left'))) %>% 
  rename(Cruise = CruzNo, Date = dt, km = length) %>% 
  mutate(effort = ifelse(Region == 'Off-Transect', 'N or F', 'S')) %>% 
  mutate(OnEffort = 'TRUE') %>% 
  mutate(software = 'ABUND 9') %>% 
  select(Region:Cruise, Date, effort, software)

…then for LTabundR data. This helper function will be used below to format our cruz object:

ltabundr_prep <- function(cruz){
  cruz$cohorts$all$das %>% 
    mutate(dt = gsub('-','',substr(DateTime, 1,10))) %>% 
    rename(Cruise = Cruise, Date = dt, km = km_int) %>% 
  mutate(effort = ifelse(EffType == 'S', 'S', 'N or F')) %>%
  mutate(software = 'LTabundR') %>% 
  filter(OnEffort == TRUE) %>%
  filter(!is.na(effort)==TRUE) %>% 
  group_by(software, Cruise, Date, effort) %>% 
  summarize(km = sum(km))
}

Here is one more helper function to generate an interactive plot that compares systematic effort for each cruise in ABUND vs LTabundR:

eff_plot <- function(abund, ltabundr){
  # Join datasets
  eff <- 
    rbind(abund %>% dplyr::select(Cruise, Date, km, effort, software), 
          ltabundr %>% dplyr::select(Cruise, Date, km, effort, software)) %>% 
    filter(effort == 'S') %>% 
    group_by(Cruise) %>% 
    summarize(km_abund = sum(km[software == 'ABUND 9']) %>% round,
                     km_ltabundr = sum(km[software == 'LTabundR']) %>% round)
  
  # Prepare plot
  p <- 
    ggplot(eff, 
           aes(x = km_abund, 
               y = km_ltabundr, 
               col=factor(Cruise))) +
    geom_abline(slope=1, intercept=0, lty=3) + 
    geom_point(alpha=.6) + 
    ylab('LTabundR') + 
    xlab('ABUND9') + 
    scale_x_continuous(breaks = seq(0, 15000, by=2500)) + 
    scale_y_continuous(breaks = seq(0, 15000, by=2500)) + 
    labs(title='Effort per cruise', col='Cruise') + 
    theme_light()
  
  print(p)
}

This helper function is a tabular version of the comparison plot:

error_table <- function(df){
  dfi <- 
    df %>% 
  filter(effort == 'S') %>% 
  group_by(Cruise) %>% 
  summarize(year = lubridate::year(lubridate::ymd(Date))[1],
            km_abund = sum(km[software == 'ABUND 9']) %>% round,
            km_ltabundr = sum(km[software == 'LTabundR']) %>% round) %>% 
  mutate(diff = km_abund - km_ltabundr) %>% 
  mutate(prop = ((abs(diff) / km_abund)) %>% round(5)) %>% 
  arrange(desc(prop)) %>% 
  as.data.frame()
  
  lm(km_ltabundr ~ km_abund, data=dfi) %>% summary %>% print
  
  return(dfi)
}

Now we format the LTabundR data and produce the comparison plot, which shows near-perfect correspondence:

# Format LTabundR data for comparison
ltabundr <- ltabundr_prep(cruz)

# Interactive plot
eff_plot(abund, ltabundr)

And here is the tabular version of that plot, sorted by the difference in KM, relative to total length of the Cruise (according to ABUND 9). The statistical summary of a linear regression is also returned:

# Combine software data
df <- rbind(abund %>% select(software, Cruise, Date, effort, km), 
            ltabundr %>% select(software, Cruise, Date, effort, km))

# Produce comparison table
error_table(df)

Call:
lm(formula = km_ltabundr ~ km_abund, data = dfi)

Residuals:
   Min     1Q Median     3Q    Max 
-8.659 -6.692 -2.933  3.094 34.505 

Coefficients:
             Estimate Std. Error  t value            Pr(>|t|)    
(Intercept) 5.6871663  2.3871073    2.382              0.0229 *  
km_abund    1.0023296  0.0004711 2127.734 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.337 on 34 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.527e+06 on 1 and 34 DF,  p-value: < 0.00000000000000022
   Cruise year km_abund km_ltabundr diff    prop
1    1426 1991       11          28  -17 1.54545
2    1628 2005        2           1    1 0.50000
3    1508 1993        3           2    1 0.33333
4    1081 1987       50          60  -10 0.20000
5    1108 2011     2493        2539  -46 0.01845
6    1203 2012     1492        1507  -15 0.01005
7    1623 2003     4970        5011  -41 0.00825
8    1604 2016     1557        1568  -11 0.00706
9    1303 2013     3814        3838  -24 0.00629
10   2001 2020     4417        4442  -25 0.00566
11    901 2009     2004        2013   -9 0.00449
12   1631 2006     4040        4058  -18 0.00446
13   1370 1990     5425        5447  -22 0.00406
14   1616 2000     4960        4979  -19 0.00383
15   1165 1988     3433        3446  -13 0.00379
16   1706 2017     8965        8999  -34 0.00379
17   1004 2010     1123        1127   -4 0.00356
18   1705 2017     7345        7370  -25 0.00340
19   1080 1987     4367        4381  -14 0.00321
20   1629 2005    11072       11106  -34 0.00307
21   1610 1998     3163        3172   -9 0.00285
22   1642 2010     6347        6365  -18 0.00284
23   1614 1999     5883        5899  -16 0.00272
24   1164 1988      804         806   -2 0.00249
25   1607 1997     6669        6685  -16 0.00240
26   1622 2002     4599        4610  -11 0.00239
27   1641 2010    11721       11746  -25 0.00213
28   1621 2002    12436       12462  -26 0.00209
29   1611 1998     4055        4063   -8 0.00197
30   1268 1989     3750        3757   -7 0.00187
31    990 1986     3793        3800   -7 0.00185
32    989 1986      747         748   -1 0.00134
33   1369 1990      923         924   -1 0.00108
34   1001 2010     1134        1135   -1 0.00088
35   1617 2001        1           1    0 0.00000
36   1624 2003      763         763    0 0.00000

Note that there are a few intentional design features that may produce differences in effort calculation and segment divisions between the ABUND9 and LTabundR programs. Here are a few:

(1) LTabundR works with DAS data that are loaded and formatted using swfscDAS:das_read() and das_process(). It is possible that these functions categorize events as On- or Off-Effort slightly differently than ABUND, or apply other differences that would be difficult for us to know or track.

(2) After loading the data, LTabundR removes rows with invalid Cruise numbers, invalid times, and invalid coordinates. As far as we can tell, ABUND does not remove such missing data. This is a relatively minor point; in processing the 1986-2020 data (623,640 rows), 287 rows are missing Cruise info; 1,430 are missing valid times; and 556 are missing valid coordinates, for a total of 2,273 rows removed out of more than 625,000 (0.3% of rows). Many of these rows with missing data have the exact same coordinates and timestamps as complete rows nearby, since WinCruz can sometimes produce multiple lines at the same time when setting up metadata for the research day, which means that this row removal will rarely, if ever, effect total effort calculated.

(3) In ABUND, custom functions are used to calculate whether DAS coordinates occur within geostrata are difficult to validate, and it is possible that they differ from the functions used in R for the same purpose. LTabundR uses functions within the well-established sf package to do these same calculations.

(4) Both ABUND and LTabundR calculate the distance surveyed based on the sum of distances between adjacent rows in the DAS file. They do this differently, which may yield minor differences in total effort segment track lengths, but the default inputs for the load_survey_settings() function were selected to come as close to replicating the ABUND9 routine as possible. The ABUND9 routine (and therefore the LTabundR defaults) allow for large gaps (as much as 100 km) between subsequent rows within a single day of effort. The ABUND9 subroutine prints a warning message when the gap is greater than 30 km, but does not modify its estimate of distance traveled. This allows for the possibility that, in rare cases, estimates of distance surveyed will be spuriously large.

(5) While ABUND uses a minimum length threshold to create segments, such that full-length segments are never less than that threshold and small remainder segments always occur at the end of a continuous period of effort, LTabundR uses an approach more similar to the effort-chopping functions in swfscDAS: it looks at continuous blocs of effort, determines how many full-length segments can be defined in each bloc, then randomly places the remainder within that bloc according to a set of user-defined settings (see load_survey_settings(). This process produces full-length segments whose distribution of exact lengths is centered about the target length, rather than always being greater than the target length.

(6) To control the particularities of segmentizing, LTabundR uses settings such as segment_max_interval, which controls how discontinuous effort is allowed to be pooled into the same segment. These rules may produce slight differences in segment lengths.

(7) Note that, since ABUND is a loop-based routine while LTabundR is modular, segments identified by the two program will never be exactly identical, and a 1:1 comparison of segments produced by the two programs is not possible.