Google Cloud Platform (GCP) Data Upload

Google Cloud Platform Bulk Upload Overview

Instructions for bulk upload of PIFSC acoustic data files to GCP (Google Cloud Platform) using gsutils.

For more detail on this overall effort see the PAM-Cloud Cloud Storage page.

  • gcloud guides/references can be found here.
  • gsutil guides/references can be found here.

Install GoogleCloudSDK using Windows Powershell

Install using Windows Powershell. Paste in the below two commands and hit Enter. Use all defaults during installation. First time it runs it should ask you to login with NOAA Google account. When prompted to select the could project, choose ggn-nmfs-pamdata-prod-1.

Run this:

(New-Object Net.WebClient).DownloadFile("https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe", "$env:Temp\GoogleCloudSDKInstaller.exe")
& $env:Temp\GoogleCloudSDKInstaller.exe

GCP authorization

Use the Windows Command Prompt to interact with gsutils once it is installed.

To check which accounts are authorized

gcloud auth list

To login to an account

gcloud auth login

A web browser window will open where you login with your NOAA Google Account email address and password.

To logout of all accounts (e.g., when done on shared computer)

gcloud auth revoke --all

Manually upload a folder (cp)

Use the below command to copy an entire folder. This will upload everything regardless if it is already present on the cloud and will upload contents of all subdirectories.

Change <local_source> to the path to the local folder and <cloud_destination> to the GCP parent folder taht you want the local folder to be copied into. If that folder doesn’t yet exist on the cloud it will be created.

gsutil -m cp -c -r <local_source> <cloud_destination>

Example:

The below command will copy the entire wav folder into the recordings folder in the GCP pifsc-1 bucket.

gsutil -m cp -c -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/
  • The -m flag will use multi-threading/multi-processing to process multiple files at once to speed things up. But a command using -m cannot be exited out of using Ctrl + C
  • The cp argument will copy all files from the local location to the could regardless if they are already present on the cloud
  • The -c flag is for cp and means that the copying will continue if an error occurs
  • The -r flag is for recursive meaning it will operate on folders (rather than a single file) and will include any subfolders

Include a log file

Optionally, a log file can be created and saved:

gsutil -m cp -c -r -L <logfile> <local_source> <cloud_destination>
  • the -L flag outputs a log to the location at <logfile>. Enter the full path and desired logfile name such as C:/users/user.name/desktop/log_file.txt

If an existing log file is loaded as <logfile>, gsutils will compare the contents of the cloud folder with the files listed in the log file and skip any files that have already been uploaded. It will append new uploads to the specified log file.

Re-starting things can lead to unexpected issues with nested subdirectories - see the trailing slashes note below.

If you want to copy an entire folder and preserve it’s original name (e.g., wav), the source folder should not have a trailing slash (/) and the cloud destination should have a trailing slash. The cloud destintation must be the parent folder where you want the local source folder to be copied into. In the above example, all the files will be copied to a wav folder within the gs://pifsc-1/glider/recordings/ folder. If wav doesn’t exist it will be created. If wav does exist, the local contents will be added to it.

If you want to copy an entire folder but change its name in the process (aka just copy the contents of a folder to a new location), then neither the source folder or the destination should have a trailing slash and the cloud destination should be the new name. Following the above example, if you wanted to rename wav to sound_files (but still have it within the recordings folder) use gsutil -m cp -c -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/sound_files. Be careful this can lead to nesting issues if the cloud directory already exists!

These issues are exacerbated if an upload gets interrupted and is restarted.

If the local source folder is being uploaded in it’s entirety to the destination folder (e.g., uploading ‘Wake’ into the ‘200kHz’ folder where the parent folder is specified as the <cloud_destination>, as is being done for HARP data), then this cp process can be restarted with the log and files should appear where expected.

If the folder being copied is being renamed (like is being done for towed array data) then this could cause unexpected nesting issues - you cannot restart with the original command or the source directory will be created again nested within the original source directory name. For example, if a recordings folder was being copied to a year_cruise_number folder, and the process gets restarted, the remaining files will be put in a year_cruise_number/recordings/recordings folder. To avoid this, restart with the following:

gsutil -m cp -c -r -n -L <logfile> //local/path/* gs://cloud_bucket/path/
  • the -n flag will check if a file already exists and skip it if it does. It only checks file names and will not check for dates or sizes (like rsync)
  • the * indicates to copy the contents of the path folder rather than the folder itself

Manually sync a folder (rsync) - preferred method

Use the below command to sync the contents of a folder. This will check what files are already present and only uploads new or changed files (without a log). The syncing function rsync operates on contents rather than folders so the destination path is slightly different than when using cp - specify the actual target folder rather than the parent folder. Nesting subfolders is less of a problem.

Running rsync after cp

If a folder was previously copied with cp and you run rsync on it to update it, it will need to run a series of checks and update timestamps which will take a while (i.e. 24 hours for 8 TB). But once that is completed once future rsync runs will go much faster.

Set <local_source> to the path to the local folder that you want to upload and <cloud_destination> to target GCP folder (not the parent folder). If that folder doesn’t yet exist on the cloud it will be created. rsync will copy the contents of <local_source> into <cloud_destination>.

gsutil -m rsync -r <local_source> <cloud_destination>

Example:

gsutil -m rsync -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/wav/
  • The -m flag will use multi-threading/multi-processing to process multiple files at once to speed things up. But a command using -m cannot be exited out of using Ctrl + C
  • The rsync argument will sync files across the local and cloud folders so will check what is on each first and only copy what is new/updated (it checks modification times)
  • The -r flag is for recursive meaning it will operate on folders (rather than a single file) and will include any subfolders

Optional flags

  • The -n flag can be placed after rsync and can be used to do a ‘dry run’ to check what would actually happen if you ran without the -n flag but won’t actually copy/delete/modify anything. Useful for testing!
  • The -d flag can be placed after rsync and means it will delete files from the cloud that aren’t on the source location any more. Be careful with this because you could accidentally delete stuff if local/cloud get switched!!’=

gsutil may suggest using parallel composite uploads to speed up transfer times. To do this add -o GSUtil:parallel_composite_upload_threshold=100M after the -m flag and before the rsync flag. It overrides a config setting temporarily and then sets up parallel composite uploads for files larger than 100 MB, meaning it will split the file into chunks and upload the chunks in parallel to go faster.

gsutil -m -o GSUtil:parallel_composite_upload_threshold=100M rsync -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/wav/

You will see evidence of this on the GCP pifsc-1 bucket - folders with long numbers or short phrases may show up in the main bucket during the sync. These are the temporary files and they will be automatically deleted after the run is complete. However, if the process gets interupted, temporary folders may be left behind. If so, I recommend making sure the folder is completely synced (re-runs of rsync will pick up and re-use some of these pieces) and then manual deleting of those folders on the GCP site.

rsync is not as sensitive to trailing slashes on the source path because it will always operate on contents not the folder itself. That being said, it can still be finicky with trailing slashes on the destination path so best to specify the whole path with a trailing slash like the above example.

Write to a log file

You can print out the command window contents to a log file when running rsync. Specify a log file as a final argument > log_file.log 2>&1. This will print out both standard output and error outputs. To append to the log instead of ovewrite, use >> instead of >.

gsutil -m rsync -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/wav/ > C:/users/user.name/desktop/logfile.log 2>&1

Compare number of files

As a sanity check you can compare the number of files locally and on the GCP.

Locally in a command prompt, run:

dir /s /b /a-d "//piccrpnas/local/data" | find /c /v ""

In the GCP Cloud Shell, run:

gsutil ls -r gs://pifsc-1/data/path/** | grep -v '/$' | wc -l

Bulk upload via R script

Use one of the PIFSC-specific R scripts to batch upload different types of data in smaller batches.

Functions and dependencies

These scripts/functions require the here and openxlsx R packages. Make sure those are installed before running for the first time. This only has to be done once.

install.packages("here")
install.packages("openxlsx")

All CRP-specific scripts and functions are in the CRPTools GitHub repo in the GCP folder.

General functions are defined in the GCP_functions.r file and the two HARP specific functions are GCP_initial_upload_HARP.r and GCP_check_HARP.r. The run scripts will source these functions at the start of the script. Alternatively, manually source them using source(<function_file_name). The full path to the function file will need to be defined e.g., C:/users/user.name/github/CRPTools/GCP/GCP_functions.r or use the here() package.

Generic bulk upload: GCP_generic_upload.r

Basic/generic script for uploading a single folder with no checks for weekend/night

  • The user must define the path to the local folder and the path to the cloud bucket and choose the method (either cp for a full copy or rsync to sync only new/changed files)
  • It will copy recursively so will include any subdirectories and will copy all file types
  • cp method will also create a log file (to be saved where the user specifies)
  • rsync method will not delete files from the cloud if they are no longer in the local folder, but this can be changed with a -d flag.

HARP bulk upload: run_GCP_upload_HARPs.r

HARP-specific script that should be used to upload single frequency/site combinations of HARP data. It will loop through all deployments and disks for that site and copy to GCP preserving the server’s folder structure. It includes an initial upload step and a checking step.

  • Specify the broad target cloud location (e.g., pifsc-1/bottom_mounted/HARP), the local server drive (e.g., //piccrp4nas/indopctus), the sampling frequency (e.g., '200kHz'), and the location (e.g., 'Wake')
  • Set offHoursCopy to TRUE to check that it is night or weekend upload of each disk. This will limit a single upload to no more than ~1.5 TB/8 hours and not start any new 8-hour processes during the workday
  • This uses the cp method so all files are copied regardless of what is already on the server
    • It is possible to re-run using an old log and it will only upload the items that did not properly upload originally. This is pretty fast!
  • A log is created and appended to as it loops through each deployment and disk and at the end the log is ‘cleaned up’ and saved as a more readable xlsx file that identifies any errors

Reorganizing files in the GCP buckets

Individual files can be moved by clicking the three vertical dots to the right of each file…but this is tedious for lots of files! You also cannot manually rename folders via the web.

The Cloud Shell Terminal at the bottom of the pifsc-1 bucket can be used to do more major reorganization. This should pop up when you first open the GCP buckets as a half screen that asks for you to ‘Authorize’ or ‘Reject’. If it does not (or you have previously closed it), re-open it using the square icon with >_ in the top right.

The basic format uses the mv command to move folders or files:

gsutil mv <originalpath> <newpath>

Some examples:

To move an individual file, specify the path and filename. This example would move just the test.wav file from the recordings folder to the archive folder.

gsutil mv gs://pifsc-1/glider/sg680/recordings/test.wav gs://pifsc-1/glider/archive/test.wav

If you want to just move all files of a certain type, use a wildcard *. This example will move all the .wav files out of the recordings folder and into a recordings_new folder.

gsutil mv gs://pifsc-1/glider/sg680/recordings/*.wav gs://pifsc-1/glider/sg680/recordings_new/

If you want to rename a folder just specify folder names as <originalpath> and <newpath>. Do not put any trailing slashes on the paths. This will delete the original folder.

gsutil mv gs://pifsc-1/glider/sg680/recordings gs://pifsc-1/glider/sg680/recordings_new 

To move an entire folder inside of a different folder, put a trailing slash on the <newpath>. In this example, the wav folder will be moved out of the recordings folder and into the main sg680 folder. Whatever else is in recordings will remain there.

gsutil mv gs://pifsc-1/glider/sg680/recordings/wav gs://pifsc-1/glider/sg680/ 

If you want to move all contents out of a subfolder into the directory above it, use * for the input path and specify the upper directory for the output path, with a trailing slash. This example will move all the contents of the recordings folder into the sg680 folder. The original recordings folder will still exist, it will just now be empty. It can be deleted on the web with the three vertical dots menu.

gsutil mv gs://pifsc-1/glider/sg680/recordings/* gs://pifsc-1/glider/sg680/

GCP_HARP_upload.r provides a HARP-specific script for uploading batches of HARP data.

  • The user specifies the broad target cloud location (e.g., pifsc-1/bottom_mounted/HARP), the local server drive, the sampling frequency, and the location (e.g., 200 kHz, Wake) and the code will loop through all deployments and disks for that site and copy to GCP using the same folder structure
  • A check for nights and weekends will occur before running each disk, hopefully limiting a single upload to no more than ~1.5 TB/8 hours and not starting any new 8-hour processes during the workday
  • This uses the cp method so all files are copied regardless of what is already on the server
    • It may be possible to re-run using an old log and it will only upload the items that did not properly upload originally. This needs more investigation
  • A log is created and appended to as it loops through each deployment and disk and at the end the log is ‘cleaned up’ and saved as a more readable xlsx file that identifies any errors