Cloud SQL Postgres instance automatically restarted for no apparent reason.

Hi there. At around 11:00 UTC, our main managed Postgres (14.9) instance was restarted with no apparent reason after spiking to almost 100% cpu usage. This is one of the quieter times of the day for us traffic-wise. The cpu spike was not reflected on the database's replica either.

The only relevant entry in the logs seems to be: 
ALERT 2024-02-09T10:11:26.176283Z 2024-02-09 10:11:15.368 UTC [365597]: [1-1] db=[...],user=dbdatastream FATAL: the database system is shutting down

We are using the DataStream service to exfiltrate data from the instance. Can DataStream cause a database instance restart?

0 3 188
3 REPLIES 3

Cloud SQL instances, including PostgreSQL, can experience restarts due to various reasons, ranging from system maintenance, resource limits being hit, to issues with connected services. The DataStream service, which you're using to exfiltrate data from your Cloud SQL instance, is designed to continuously capture data changes and stream them to a destination such as GCS, BigQuery, or another Cloud SQL database. While DataStream itself is intended to operate without causing disruptions to the source database, certain conditions or misconfigurations could potentially lead to issues.

Given the scenario you described, where the CPU usage spiked to nearly 100% before the instance restarted and the log indicated a shutdown, there are a few possibilities to consider:

  • Resource Overload: If DataStream or any other process causes a significant increase in workload (e.g., a large volume of data changes needing to be captured and streamed), it could lead to resource constraints such as CPU spikes. If the system deems it necessary to protect the integrity of the database or underlying infrastructure, it might initiate a restart.
  • Maintenance or Automated Health Checks: Google Cloud occasionally performs maintenance tasks or automated health checks that can cause instances to restart. These should typically be scheduled or communicated in advance, but unexpected issues might trigger automated recovery actions, including restarts.
  • Configuration or Operational Limits: If there are specific operational limits set within Cloud SQL or DataStream that were exceeded (e.g., connection limits, memory usage, etc.), it might lead to automated actions to ensure the stability of the service.
  • Underlying Infrastructure Issue: Sometimes, the problem might not be directly related to your workload or configurations but due to an issue with the underlying Google Cloud infrastructure. This could trigger automated failover mechanisms or restarts.
  • Change Data Capture (CDC) Specificity: DataStream relies on Change Data Capture to detect modifications in your database. Some CDC methods (like logical decoding in PostgreSQL) are less intrusive than others. Investigate the specific CDC mechanism in use, as it influences potential impact on the database.

To troubleshoot and prevent such occurrences:

  • Review DataStream Configuration: Ensure that the DataStream configuration is optimized for your workload. Check for any settings that might lead to excessive load or conflicts with your database's operational parameters.
  • Monitor Logs and Metrics: Keep a close eye on both Cloud SQL and DataStream logs and metrics. Look for patterns or anomalies that precede the restart. Google Cloud's operations suite can be instrumental in this.
  • Transient vs. Persistent Load: Determine if the CPU spike was a temporary event or reflects a consistently high load on the database. Understanding this will guide you towards addressing temporary stressors vs. needing more fundamental configuration or capacity changes.
  • Resource and Quota Management: Review your Cloud SQL instance's resource usage and quotas. Consider scaling up resources if you're consistently hitting limits.
  • Google Cloud Support: If the issue persists and you're unable to identify the cause, reaching out to Google Cloud Support can provide more insights. They can help analyze logs and system behavior that might not be visible to end-users.

While DataStream is not typically known to cause database restarts directly, the increased load or misconfigurations could lead to situations requiring a restart. It's crucial to ensure that both your Cloud SQL instance and DataStream configurations are aligned with your operational needs and resource capacities.

The automatic restart happened some more times due to a DDoS attack and this last time it won't restart with the error: 

2024-02-09 18:45:01.275 UTC [97382]: [1-1] db= [...],user=dbuser PANIC: stuck spinlock detected at WaitBufHdrUnlocked, third_party/postgres/servers/postgres14gce/src/backend/storage/buffer/bufmgr.c:4645

Can someone help here? There's nothing we can do apparently.

The "stuck spinlock" error is indicative of a concurrency issue within PostgreSQL, where multiple processes are contending for the same resource, leading to a deadlock. This is a rare condition and often suggests a bug in PostgreSQL itself, a system-level issue, or extreme conditions such as those caused by DDoS attacks.

Given the complexity and potential severity of this issue, direct intervention from Google Cloud Support is the most effective path forward. They can provide specific guidance, potential workarounds, or fixes that are not available in public documentation.