Understanding Resumable Upload in Google Cloud Storage and cURL example.
First of all we must understand what “Resumable Upload” is and how it works. I don’t know how old you are or how much time you have spent in the “land of the internet” but if you are over 30 years old and started using the internet in the 90’s or 00’s, you surely remember a useful program to download programs/videos/music using dial-up connection: Getright! I think it was the first Resumable Download/Upload that I had contact with.
According to GCP’s definition: “a Resumable Upload allows you to resume data transfer operations to Cloud Storage after a communication failure has interrupted the flow of data. Resumable uploads work by sending multiple requests, each of which contains a portion of the object you’re uploading. This is different from a simple upload, which contains all of the object’s data in a single request and must restart from the beginning if it fails part way through.”
We can define this functionality of GCP as a “multi-part upload”, where you don’t need to send all the data in just one package, you can split it into several small packages and send whatever you want. Using this approach, you can solve network problems like: slow connections, unstable 4G connections, bandwidth limits, upload failures, etc.
The upload using Resumable Upload would be defined as show in the image below. You have the original object, then you split it into multiple parts and send each part in a request.
Note: even though GCP Resumable Upload allows to you upload the file using multiple requests/packages, you still can upload using the “traditional” method and send all the data in a single request. Btw, GCP recommends the traditional method still as best practice.
You may be asking yourself now: how does GCP know the package size or when the upload is complete? What HTTP response status do I receive in each request? I’ll explain in a special chapter.
Special chapter: GCP Resumable Upload “magic”
To control Resumable Upload and know whether the upload is completed or not, GCP uses 2 headers: Content-Length and Content-Range.
Content-Length: The Content-Length header controls the data that you send in the body, for example: Content-Length: 4096. You are sending 4kb in the body size. According to rfc-2616
The Content-Length entity-header field indicates the size of the entity-body, in decimal number of OCTETs, sent to the recipient or, in the case of the HEAD method, the size of the entity-body that would have been sent had the request been a GET.
Content-Range: The Content-Range header controls the range of data that you are sending in the body, for example: Content-Range: 2000–3000/4096. You are sending the file bytes from 2000 to 3000, what means that 1096 bytes still remain to complete the upload. According the rfc-7233
The “Content-Range” header field is sent in a single part 206 (Partial Content) response to indicate the partial range of the selected representation enclosed as the message payload, sent in each part of a multipart 206 response to indicate the range enclosed within each body part, and sent in 416 (Range Not Satisfiable)
responses to provide information about the selected representation.
Now that you are aware of the headers, you must be aware of the HTTP response status also. The HTTP response status is something important that you must monitor/check when performing Resumable Upload.
GCP uses the following headers to control the Resumable Upload status:
- HTTP 200/201: These response codes means that you upload is complete and you sent all bytes.
- HTTP 308: This response code means that you upload is not complete and you must send more bytes to finish the upload. Note that you’ll receive a range header stating the byte offset that you already sent.
- HTTP 499: This response code means that you deleted a Resumable Upload successfully.
- HTTP 500/503: These response codes means that your upload was interrupted and you must continue the upload. In this case, you must check the upload status (I’ll show bellow)
For more information about headers and response codes, please read the official documentation.
Hands-on:
Ok, now you know what GCP Resumable Upload is and how it works, so let’s go hands-on. To execute this lab you must have the following:
Part 1: Uploading using cURL
- Go to your GCP Console, create a Service Account and Create New Key (JSON Format) and download this key. In my case, I added this account as owner, but you must be conscious of “The least privilege”
2. Configure your gsutil to use this service account.
Run the command:
gcloud auth activate-service-account ACCOUNTNAME --key-file=ACCOUNTFILE.json
The ACCOUNTNAME parameter must be equal to the key “client_email” in the JSON file.
3. Create your bucket
4. To demonstrate the Resumable Upload, let’s create a random file with ~1GB of data.
base64 /dev/urandom | head -c 1000000000 > example-file.txt
5. Now that our environment is set up and the file/bucket created, let’s start our Resumable Upload. GCP allows Resumable Uploads through SignedURL and we must create one before starting the upload.
gsutil signurl -c "text/plain" -m RESUMABLE my-auth.json gs://themediumarticle/example-file.txt
Explaining the command:
- “gsutil signurl”: gsutil command and signurl action
- “-c text/plain”: content type for the signed url
- “-m RESUMABLE”: to define that upload is a Resumable Upload
- “my-auth.json”: your service account key file. You can omit it, but I like to include it in every request, so I can control the account that I’m using.
- “gs://themediumarticle/example-file.txt”: the bucket name and the object name (file name).
This command will generate a SignedURL to use in the upload and you must copy the “Signed URL” response. Look at the response:
In this case the Signed URL is: https://storage.googleapis.com/themediumarticle/example-file.txt?x…
6. Execute the cURL start Resumable Upload command to get the Location URI to upload the file:
export SIGNEDURL="<SignedURL>"
curl -v -X "POST" -H "content-type: text/plain" -H "x-goog-resumable:start" -d '' $SIGNEDURL
This command will generate a URI Location to Upload our files. Look at the response and copy the “Location” header value.
7. Now we have the final Location URI to upload the file in a single request:
export UPLOADURL="<Location header>"
Uploading the file in a single request. This command must returns HTTP 200 and finish the upload.
curl -v -X PUT --upload-file example-file.txt $UPLOADURL
8. Uploading the file using Content-Range header to control the bytes already sent:
Let’s repeat 6 and 7 steps to get a new Location to upload the file. In this case I’ll simulate a network problem and stop sending the file 3 seconds after the start:
curl -v -X PUT --upload-file example-file.txt $UPLOADURL
Now we have a new command to check the upload status:
curl -i -X PUT -H "Content-Length: 0" -H "Content-Range: bytes */1000000000" -d "" $UPLOADURL
We receive the header range informing that we sent 222298111 bytes, and so is missing 777701889 bytes to send. The response code was HTTP/2 308.
Let’s divide the file and get the missing part. You can do this before uploading, you can split the file and send each part, you can split using programmatic tools, this is just a demonstration.
dd skip=222298112 if=example-file.txt of=remains.txt ibs=1
curl -v -X PUT --upload-file remains.txt -H "Content-Range: bytes 222298112-999999999/1000000000" $UPLOADURL
This request must respond HTTP 200 because you finished the upload!
So, that’s it folks! I hope you enjoyed this post and I’ll update it with Python and Go examples :)
Thanks to my friend Computer15776 and his contribution :D
Thank you for reading this!