GCSFilesystem provides an unified interface for mounting Google Cloud Storage buckets to your local system. After a bucket has been mounted, you can view and access the files and folders in the bucket using your file browser as if they are stored locally. You must has GCSDokan on Windows or gcsfuse on Linux or macOs prior to using this package. You can find documents on how to install the dependencies by clicking the links above.
The package uses Google application default credentials to authenticate with Google. In most case, you need to provide a service account JSON file to verify your identity. For automatically finding your credentials when using the package, the path to the JSON file can be stored at the environment variable
GOOGLE_APPLICATION_CREDENTIALS. You can also manually provide the JSON file to the package by specifying the argument
key_file to the function
gcs_mount. The document of how to create a service account can be found here.
Note that there are some buckets which allow anonymous access. For windows, these buckets can be directly mounted without a credentials. For Linux and macOs, the current version of
gcsfuse does not support anonymous access and you still need to provide a credentials file to access the buckets.
You can use
gcs_mount to mount a bucket on your machine. In the example, we will mount the bucket
genomics-public-data to a temporary directory in R
Note that the function can also be used to mount a directory inside a bucket. For example, you can use
gcs_mount("genomics-public-data/clinvar", temp_dir) to mount the folder
clinvar to your temporary directory. After mounting the package, you can browse the files in your file explore. Here we can list all files in R using
list.files(temp_dir) #>  "1000-genomes" "1000-genomes-phase-3" #>  "clinvar" "cwl-examples" #>  "ftp-trace.ncbi.nih.gov" "gatk-examples" #>  "linkage-disequilibrium" "NA12878.chr20.sample.bam" #>  "NA12878.chr20.sample.DeepVariant-0.7.2.vcf" "platinum-genomes" #>  "precision-fda" "README" #>  "references" "resources" #>  "simons-genome-diversity-project" "test-data" #>  "ucsc"
You can find all mount points by
Finally, after using the bucket, you can unmount it via
Some buckets have billing project enabled, which means you are responsible for all the charges that occurs when accessing the bucket. For avoiding unintentional cost, the billing project is not enabled by default. If you want to access this type of buckets, you must specify your project Id in the argument
billing when calling
gcs_mount("bucket", "mount-point", billing = "my-project-Id")). Otherwise, you would not be able to see the files in the bucket. Please note that it is not recommended to add the argument
billing for all the call to
gcs_mount. If you did it, you WILL be changed by Google even if you are trying to access a bucket without billing project enabled.
Since accessing remote files are relatively expensive, the information of files and folders in the mounted bucket will be cached for a certain period of time. The changes to the remote bucket will not be immediately visible until the local information has been expired. By default, the refresh rate is 60 seconds. You can change the refresh rate via the argument
refresh when mounting a bucket.
Certain optimization can be made to facilitate your access to the Google Bucket. Since accessing remote files has much higher latency than accessing local files, using local cache can greatly reduce the number of remote requests and reuse the data that has been downloaded before. The cache will be enabled by default and the cache data will be stored on disk. Only Windows users are allowed to change the cache setting, the available cache types are
memory. They can be changed via the argument
cache_arg in the function