Solved: Re: Storage copies have been slowing for the last ...

ccjon · 02-06-2023 08:42 PM

I'm posting this instead of submitting a ticket, because making my full case in the support ticketing system takes a lot of back and forth. To summarize, I'm concerned that a degradation of Cloud Storage over the last few days, might be highlight a deeper problem worth looking into.

I'll try to describe the issue I am experiencing and some of the effect it has on my operations.
My vm startup scripts are divided into a number of small sections. At the end if each section I record the time that section took into cloud logging (and create a monitoring metric). In order for loadbalancing to function well during startups, the cooldown period must exceed the total time it takes for the app to become ready to accept connects from the loadbalancer.
One startup script section pulls configuration objects and files from Cloud Storage onto each vm. And this section typically averages about 20.4 seconds with very low deviation from that average. Occasionally, this value shoots up to 50, 70 or 80 seconds pushing the total startup time above the cooldown period value. These are outliers, and rare enough to not create a runaway autoscaler trying to catch up because the app didn't respond to healthchecks from the LB in time.
But lately (few days) these outliers are becoming more common. Network usage during these periods is not the peak usage rate and so seems to be an independent variable.

More of a report than a question, but I wonder if anyone else is experiencing a Cloud Storage slowdown in us-central1

ccjon

Thanks, Kolban.
It's not a typo. The intent was to cover /usr/local/sbin as well. The find traversal is limited to files with .sh extension, and the only variability I've seen in descendants of that dir have been in Python libs. The total scan count for that find is around 230 files about 40% of that being a hit on the --name filter.
The find command takes about 0.004 secs with the file count around 230. If there are 20230 files in that tree, the find command takes about 0.033 seconds. It would take a huge file count and/or some really high cpu/io pressure to fully account for the delays I'm seeing. Delays have jumped to 10x in the last few days. Still relatively rare, though; and it's not creating autoscaler havoc.

Thanks for reminding me about perfdiag.... I've run it, but a long time ago when I didn't need to run it for any real troublehooting. I'll try it out
The suggestion of

View solution in original post

kolban

If you have a support contract with Google Cloud, I do suggest that you submit a ticket. That way, support can also investigate your specific environment to ensure that there isn't something amiss with your specific configuration/environment. I had a look at the Google status report here and, as of this moment, everything appears green across the board including Cloud Storage and Compute Engines. Given the relatively common practice of Compute Engine startups pulling data from Cloud Storage during startup, if there were a systematic 2x to 4x degradation in access and download rates, I have to believe that there would be a serious problem that would result in a status report of an outage (or at least an issue). In your report, you mention that you are downloading objects from GCS in a particular script section and that you have bracketed the script section with logging to describe the duration and are seeing it take longer than expected. What else is in that section? What are you doing with the downloaded objects? Are they being unpacked, installed or other processing other than laying them down in your local filesystem? What OS is your Compute Engine running? How are you obtaining the objects from GCS? Are you using gsutil or GCS Fuse or some other mechanism? If you are copying the files from GCS into your local filesystem, I am imagining that activity being composed of multiple parts:

Local compute to run the gsutil command and send the request
GCS processing the request and streaming the data
Local filesystem I/O as it writes the data to local disk

This might imply that the issue is other than GCS processing. For example, what is the CPU load on the machine (is it being starved of CPU)? What is the state of the filesystem and I/O subsystem ... might you need to defrag the disk or is there some new/concurrent I/O going on?

I'm also wondering if something might have changed in the GCS buckets/objects you are downloading. For example, have their sizes change or are there now more objects than before and you are downloading folders or wildcards?

ccjon

Thank you, @kolban for the response. I'll put together this information into a ticket.

To answer most of your questions I'll paste the bash code directly

echo "Grabbing latest /usr/local/bin scripts..."

echo "Setting up /usr/local/bin..."
BUCKET="$(curl -H 'Metadata-Flavor: Google' $METADATA/project/attributes/CONFIG_BUCKET)"

mkdir -p /usr/local/bin
gsutil -m cp -r $BUCKET/usr/local/bin/* /usr/local/bin/
find /usr/local/ -name "*.sh" -exec chmod 0755 {} \;

echo "Setting up /usr/local/sbin..."
gsutil -m cp -r $BUCKET/usr/local/sbin/* /usr/local/sbin/

echo "Setting up /usr/local/lib..."
mkdir -p /usr/local/lib
gsutil -m cp -r $BUCKET/usr/local/lib/* /usr/local/lib/
find /usr/local/lib -type f -exec chmod 0644 {} \;

echo "Setting up systemd services...(getting latest files)"
gsutil -m cp $BUCKET/lib/systemd/system/* /lib/systemd/system/
systemctl daemon-reload

So the bullet points are correct.
This is a startup script so the state of the system is that in which the startup-script-url code is invoked. Somewhat variable in terms of what is happening in the system during this time, but as I've mentioned, the rate at which this section goes beyond 20 seconds total is low and beyond 25 seconds, very very low. If there is any OS binding that is interfering, it has surfaced in the last week.

The file sizes/numbers are essentially unchanged with some edits that go in, but probably on the order of <.05 % of byte-growth overall over the last several months. The contents of the buckets are tracked with git and tagged, the git repository is reset to the tagged state (git checkout <tag>) and untracked files are removed before being pushed to the bucket.

kolban

It may be a typo, but I noticed that on line 8 of your script, you are executing a find against /usr/local. Might you have meant /usr/local/bin? Scanning the whole of /usr/local might increase execution time in unexpected ways. I also notice the systemctl command. Maybe take a log before and after that to see what it contributes to the time.

Another thought in my mind is to look at gsutil perfdiag. While I haven't used it myself, it feels like it might be a useful thing to insert into the script during debugging to get a handle on the underlying Cloud Storage access times. Maybe run that within your environment and from Cloud Shell and (maybe) for good measure in a newly created Compute Engine. See if the numbers are consistent.

ccjon

Thanks, Kolban.
It's not a typo. The intent was to cover /usr/local/sbin as well. The find traversal is limited to files with .sh extension, and the only variability I've seen in descendants of that dir have been in Python libs. The total scan count for that find is around 230 files about 40% of that being a hit on the --name filter.
The find command takes about 0.004 secs with the file count around 230. If there are 20230 files in that tree, the find command takes about 0.033 seconds. It would take a huge file count and/or some really high cpu/io pressure to fully account for the delays I'm seeing. Delays have jumped to 10x in the last few days. Still relatively rare, though; and it's not creating autoscaler havoc.

Thanks for reminding me about perfdiag.... I've run it, but a long time ago when I didn't need to run it for any real troublehooting. I'll try it out
The suggestion of

Storage copies have been slowing for the last few days