Google Cloud Platform (GCP) Data Upload
Google Cloud Platform Bulk Upload Overview
Instructions for bulk upload of PIFSC acoustic data files to GCP (Google Cloud Platform) using gsutils
.
For more detail on this overall effort see the PAM-Cloud Cloud Storage page.
Install GoogleCloudSDK using Windows Powershell
Install using Windows Powershell. Paste in the below two commands and hit Enter. Use all defaults during installation. First time it runs it should ask you to login with NOAA Google account. When prompted to select the could project, choose ggn-nmfs-pamdata-prod-1
.
Run this:
(New-Object Net.WebClient).DownloadFile("https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe", "$env:Temp\GoogleCloudSDKInstaller.exe") & $env:Temp\GoogleCloudSDKInstaller.exe
Manually upload a folder (cp
)
Use the below command to copy an entire folder. This will upload everything regardless if it is already present on the cloud and will upload contents of all subdirectories.
Change <local_source>
to the path to the local folder and <cloud_destination>
to the GCP parent folder taht you want the local folder to be copied into. If that folder doesn’t yet exist on the cloud it will be created.
gsutil -m cp -c -r <local_source> <cloud_destination>
Example:
The below command will copy the entire wav
folder into the recordings
folder in the GCP pifsc-1 bucket.
gsutil -m cp -c -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/
- The
-m
flag will use multi-threading/multi-processing to process multiple files at once to speed things up. But a command using-m
cannot be exited out of usingCtrl + C
- The
cp
argument will copy all files from the local location to the could regardless if they are already present on the cloud - The
-c
flag is forcp
and means that the copying will continue if an error occurs - The
-r
flag is for recursive meaning it will operate on folders (rather than a single file) and will include any subfolders
Include a log file
Optionally, a log file can be created and saved:
gsutil -m cp -c -r -L <logfile> <local_source> <cloud_destination>
- the
-L
flag outputs a log to the location at<logfile>
. Enter the full path and desired logfile name such asC:/users/user.name/desktop/log_file.txt
If an existing log file is loaded as <logfile>
, gsutils
will compare the contents of the cloud folder with the files listed in the log file and skip any files that have already been uploaded. It will append new uploads to the specified log file.
Re-starting things can lead to unexpected issues with nested subdirectories - see the trailing slashes note below.
If you want to copy an entire folder and preserve it’s original name (e.g., wav
), the source folder should not have a trailing slash (/
) and the cloud destination should have a trailing slash. The cloud destintation must be the parent folder where you want the local source folder to be copied into. In the above example, all the files will be copied to a wav
folder within the gs://pifsc-1/glider/recordings/
folder. If wav
doesn’t exist it will be created. If wav
does exist, the local contents will be added to it.
If you want to copy an entire folder but change its name in the process (aka just copy the contents of a folder to a new location), then neither the source folder or the destination should have a trailing slash and the cloud destination should be the new name. Following the above example, if you wanted to rename wav
to sound_files
(but still have it within the recordings
folder) use gsutil -m cp -c -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/sound_files
. Be careful this can lead to nesting issues if the cloud directory already exists!
These issues are exacerbated if an upload gets interrupted and is restarted.
If the local source folder is being uploaded in it’s entirety to the destination folder (e.g., uploading ‘Wake’ into the ‘200kHz’ folder where the parent folder is specified as the <cloud_destination>
, as is being done for HARP data), then this cp
process can be restarted with the log and files should appear where expected.
If the folder being copied is being renamed (like is being done for towed array data) then this could cause unexpected nesting issues - you cannot restart with the original command or the source directory will be created again nested within the original source directory name. For example, if a recordings
folder was being copied to a year_cruise_number
folder, and the process gets restarted, the remaining files will be put in a year_cruise_number/recordings/recordings
folder. To avoid this, restart with the following:
gsutil -m cp -c -r -n -L <logfile> //local/path/* gs://cloud_bucket/path/
- the
-n
flag will check if a file already exists and skip it if it does. It only checks file names and will not check for dates or sizes (likersync
) - the
*
indicates to copy the contents of thepath
folder rather than the folder itself
Manually sync a folder (rsync
) - preferred method
Use the below command to sync the contents of a folder. This will check what files are already present and only uploads new or changed files (without a log). The syncing function rsync
operates on contents rather than folders so the destination path is slightly different than when using cp
- specify the actual target folder rather than the parent folder. Nesting subfolders is less of a problem.
rsync
after cp
If a folder was previously copied with cp
and you run rsync
on it to update it, it will need to run a series of checks and update timestamps which will take a while (i.e. 24 hours for 8 TB). But once that is completed once future rsync
runs will go much faster.
Set <local_source>
to the path to the local folder that you want to upload and <cloud_destination>
to target GCP folder (not the parent folder). If that folder doesn’t yet exist on the cloud it will be created. rsync
will copy the contents of <local_source>
into <cloud_destination>
.
gsutil -m rsync -r <local_source> <cloud_destination>
Example:
gsutil -m rsync -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/wav/
- The
-m
flag will use multi-threading/multi-processing to process multiple files at once to speed things up. But a command using-m
cannot be exited out of usingCtrl + C
- The
rsync
argument will sync files across the local and cloud folders so will check what is on each first and only copy what is new/updated (it checks modification times) - The
-r
flag is for recursive meaning it will operate on folders (rather than a single file) and will include any subfolders
Optional flags
- The
-n
flag can be placed afterrsync
and can be used to do a ‘dry run’ to check what would actually happen if you ran without the-n
flag but won’t actually copy/delete/modify anything. Useful for testing! - The
-d
flag can be placed afterrsync
and means it will delete files from the cloud that aren’t on the source location any more. Be careful with this because you could accidentally delete stuff if local/cloud get switched!!’=
gsutil
may suggest using parallel composite uploads to speed up transfer times. To do this add -o GSUtil:parallel_composite_upload_threshold=100M
after the -m
flag and before the rsync
flag. It overrides a config setting temporarily and then sets up parallel composite uploads for files larger than 100 MB, meaning it will split the file into chunks and upload the chunks in parallel to go faster.
gsutil -m -o GSUtil:parallel_composite_upload_threshold=100M rsync -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/wav/
You will see evidence of this on the GCP pifsc-1 bucket - folders with long numbers or short phrases may show up in the main bucket during the sync. These are the temporary files and they will be automatically deleted after the run is complete. However, if the process gets interupted, temporary folders may be left behind. If so, I recommend making sure the folder is completely synced (re-runs of rsync
will pick up and re-use some of these pieces) and then manual deleting of those folders on the GCP site.
rsync
is not as sensitive to trailing slashes on the source path because it will always operate on contents not the folder itself. That being said, it can still be finicky with trailing slashes on the destination path so best to specify the whole path with a trailing slash like the above example.
Write to a log file
You can print out the command window contents to a log file when running rsync
. Specify a log file as a final argument > log_file.log 2>&1
. This will print out both standard output and error outputs. To append to the log instead of ovewrite, use >>
instead of >
.
gsutil -m rsync -r //PICCRPNAS/CRP4/recordings/wav gs://pifsc-1/glider/recordings/wav/ > C:/users/user.name/desktop/logfile.log 2>&1
Compare number of files
As a sanity check you can compare the number of files locally and on the GCP.
Locally in a command prompt, run:
dir /s /b /a-d "//piccrpnas/local/data" | find /c /v ""
In the GCP Cloud Shell, run:
gsutil ls -r gs://pifsc-1/data/path/** | grep -v '/$' | wc -l
Bulk upload via R script
Use one of the PIFSC-specific R scripts to batch upload different types of data in smaller batches.
Functions and dependencies
These scripts/functions require the here
and openxlsx
R packages. Make sure those are installed before running for the first time. This only has to be done once.
install.packages("here") install.packages("openxlsx")
All CRP-specific scripts and functions are in the CRPTools GitHub repo in the GCP
folder.
General functions are defined in the GCP_functions.r
file and the two HARP specific functions are GCP_initial_upload_HARP.r
and GCP_check_HARP.r
. The run
scripts will source these functions at the start of the script. Alternatively, manually source them using source(<function_file_name)
. The full path to the function file will need to be defined e.g., C:/users/user.name/github/CRPTools/GCP/GCP_functions.r
or use the here()
package.
Generic bulk upload: GCP_generic_upload.r
Basic/generic script for uploading a single folder with no checks for weekend/night
- The user must define the path to the local folder and the path to the cloud bucket and choose the method (either
cp
for a full copy orrsync
to sync only new/changed files) - It will copy recursively so will include any subdirectories and will copy all file types
cp
method will also create a log file (to be saved where the user specifies)rsync
method will not delete files from the cloud if they are no longer in the local folder, but this can be changed with a-d
flag.
HARP bulk upload: run_GCP_upload_HARPs.r
HARP-specific script that should be used to upload single frequency/site combinations of HARP data. It will loop through all deployments and disks for that site and copy to GCP preserving the server’s folder structure. It includes an initial upload step and a checking step.
- Specify the broad target cloud location (e.g.,
pifsc-1/bottom_mounted/HARP
), the local server drive (e.g.,//piccrp4nas/indopctus
), the sampling frequency (e.g.,'200kHz'
), and the location (e.g.,'Wake'
) - Set
offHoursCopy
toTRUE
to check that it is night or weekend upload of each disk. This will limit a single upload to no more than ~1.5 TB/8 hours and not start any new 8-hour processes during the workday - This uses the
cp
method so all files are copied regardless of what is already on the server- It is possible to re-run using an old log and it will only upload the items that did not properly upload originally. This is pretty fast!
- A log is created and appended to as it loops through each deployment and disk and at the end the log is ‘cleaned up’ and saved as a more readable xlsx file that identifies any errors
Reorganizing files in the GCP buckets
Individual files can be moved by clicking the three vertical dots to the right of each file…but this is tedious for lots of files! You also cannot manually rename folders via the web.
The Cloud Shell Terminal at the bottom of the pifsc-1 bucket can be used to do more major reorganization. This should pop up when you first open the GCP buckets as a half screen that asks for you to ‘Authorize’ or ‘Reject’. If it does not (or you have previously closed it), re-open it using the square icon with >_
in the top right.
The basic format uses the mv
command to move folders or files:
gsutil mv <originalpath> <newpath>
Some examples:
To move an individual file, specify the path and filename. This example would move just the test.wav
file from the recordings
folder to the archive
folder.
gsutil mv gs://pifsc-1/glider/sg680/recordings/test.wav gs://pifsc-1/glider/archive/test.wav
If you want to just move all files of a certain type, use a wildcard *
. This example will move all the .wav
files out of the recordings
folder and into a recordings_new
folder.
gsutil mv gs://pifsc-1/glider/sg680/recordings/*.wav gs://pifsc-1/glider/sg680/recordings_new/
If you want to rename a folder just specify folder names as <originalpath>
and <newpath>
. Do not put any trailing slashes on the paths. This will delete the original folder.
gsutil mv gs://pifsc-1/glider/sg680/recordings gs://pifsc-1/glider/sg680/recordings_new
To move an entire folder inside of a different folder, put a trailing slash on the <newpath>
. In this example, the wav
folder will be moved out of the recordings
folder and into the main sg680
folder. Whatever else is in recordings
will remain there.
gsutil mv gs://pifsc-1/glider/sg680/recordings/wav gs://pifsc-1/glider/sg680/
If you want to move all contents out of a subfolder into the directory above it, use *
for the input path and specify the upper directory for the output path, with a trailing slash. This example will move all the contents of the recordings
folder into the sg680
folder. The original recordings
folder will still exist, it will just now be empty. It can be deleted on the web with the three vertical dots menu.
gsutil mv gs://pifsc-1/glider/sg680/recordings/* gs://pifsc-1/glider/sg680/
GCP_HARP_upload.r
provides a HARP-specific script for uploading batches of HARP data.
- The user specifies the broad target cloud location (e.g.,
pifsc-1/bottom_mounted/HARP
), the local server drive, the sampling frequency, and the location (e.g., 200 kHz, Wake) and the code will loop through all deployments and disks for that site and copy to GCP using the same folder structure - A check for nights and weekends will occur before running each disk, hopefully limiting a single upload to no more than ~1.5 TB/8 hours and not starting any new 8-hour processes during the workday
- This uses the
cp
method so all files are copied regardless of what is already on the server- It may be possible to re-run using an old log and it will only upload the items that did not properly upload originally. This needs more investigation
- A log is created and appended to as it loops through each deployment and disk and at the end the log is ‘cleaned up’ and saved as a more readable xlsx file that identifies any errors