Skip to main content

18 Finops Lessons across multiple Cloud Use Cases

· By Tech With Mohamed · 10 min read

Cloud cost overruns are one of the top headaches across startups and Cloud teams. After working on dozens of cloud cost audits and architecture reviews — especially in GCP — we've seen the same patterns repeat.

This post isn’t theory or vendor fluff. It’s a collection of real, actionable FinOps lessons learned from multiple experiences , presented in a clear format:

Use Case / Solution / Benefits / Effort — written for cloud engineers, architects, and data teams.

At the end, you’ll find a summary table and helpful links to official docs.

The scenarios and insights in this article are based on anonymized patterns observed across multiple cloud architecture reviews and public best practices. No confidential, proprietary, or client-specific information is shared. All examples have been generalized to protect the privacy and interests of any organizations involved. The views expressed here general points and do not represent the official position of any employer or any cloud vendor.

Legend

Effort Tags

  • ✅ Quick Win — Minimal effort, fast ROI
  • 🔁 Strategic — Requires architecture refactor or longer implementation

Role Tags

  • 🛠️ Infra Tips — For Cloud Engineers & DevOps
  • 📊 Data Tips — For Data Engineers & Analysts
  • 🔐 Security + Cost — For SecOps and Platform Owners

Infra Tips — Idle VMs for Internal Tools ✅ Quick Win

Use Case:
A company’s internal reporting dashboard ran on a GCE VM that was online 24/7—yet only used a few hours each week by finance and ops teams.

Solution:
We containerized the app and deployed it to Cloud Run, GCP’s serverless container platform. With scale-to-zero and on-demand startup, the app now runs only when needed. Zero manual scaling, no OS patching, and no VM sprawl.

💰Impact:

  • ~60% monthly cost reduction
  • Improved security via minimized surface area
  • DevOps time saved from no longer maintaining a full VM

⏱️ Migration Effort: ~2–4 hours (including Dockerization and testing)

Learn more about Cloud Run

# Sample Dockerfile for Cloud Run
FROM python:3.9-slim
COPY app.py /app/app.py
WORKDIR /app
CMD ["python", "app.py"]

Pro Tip: Running Cloud Run with min-instances=0 is great for cutting idle costs, but it can introduce cold starts. If your service is used at predictable times, combine Cloud Scheduler + Pub/Sub to trigger a warm-up request before users arrive—keeping performance snappy without paying for always-on instances.

Data Tips — Expensive BigQuery Queries 🔁 Strategic

Use Case:
A SaaS analytics company had dashboards powered by weekly BigQuery queries—each scanning entire datasets (>10 TB) even for narrow date filters. The result? High costs and sluggish performance.

Solution:
We partitioned tables by event_date and clustered by customer_id, dramatically reducing the amount of data scanned for time- or customer-specific queries.

Here's an example of how to create a partitioned and clustered BigQuery table using SQL, based on event_date for partitioning and customer_id for clustering:

CREATE OR REPLACE TABLE my_dataset.analytics_events
PARTITION BY DATE(event_date)
CLUSTER BY customer_id
AS
SELECT
  event_date,
  customer_id,
  event_type,
  metadata
FROM
  my_dataset.raw_events;

Impact:

  • ~60% reduction in query costs
  • 2–3× faster dashboard loads
  • Lower data warehouse pressure during peak hours

⏱️ Implementation Effort: ~1–2 days including testing & redeploy


Pro Tip: Use EXPLAIN and bytes_processed preview to catch unbounded scans early—and enforce filters in Looker or your BI layer to avoid full-table reads.

Table partitioning in BigQuery
Table clustering in BigQuery


Infra Tips — Dev Environments Always On ✅ Quick Win

Use Case:
A company had full dev and staging environments running 24/7—despite being unused outside standard business hours. This included compute instances, databases, and other non-prod resources.

Solution:
We implemented a Cloud Scheduler job that automatically triggers a Cloud Function to shut down resources each evening and bring them back online each morning—based on the team’s working hours.

Impact:

  • ~35% reduction in compute spend
  • No disruption to developer workflows
  • Simple automation, high ROI

⏱️ Effort: ~1–2 hours to set up

Pro Tip: Extend this pattern with labels (e.g., env:dev) and scripts that dynamically stop/start all non-critical resources across projects.

Cloud Scheduler Docs


Infra Tips — Old Data on Hot Storage ✅ Quick Win

Use Case:
A Company was storing years of logs and historical reports in GCS Standard storage—despite rarely accessing them after the first few weeks.

Solution:
We set up GCS lifecycle rules to automatically move older files to Coldline after 30 days, and Archive after 90 days.

💰 Impact:

  • ~70% storage cost savings within 3 months
  • No impact on access or retention policies
  • Set-it-and-forget-it automation

⏱️ Effort: ~30 minutes via console or IaC

GCS Lifecycle Management

Pro Tip: Audit buckets regularly—many teams treat GCS like a backup drive, until the invoice shows up.

Infra Tips — Overprovisioned Kubernetes Clusters 🔁 Strategic

Use Case:
A company's GKE cluster was running at ~10% average node utilization—provisioned for peak load, but rarely hitting it.

Solution:
We enabled Cluster Autoscaler and right-sized node pools (smaller instance types, preemptibles for dev workloads) to match actual usage patterns.

💰 Impact:

  • ~45% infrastructure cost savings
  • Reduced idle capacity without impacting reliability
  • More predictable spend with less manual scaling

⏱️ Effort: ~1 day including testing and rollout

GKE Cluster Autoscaler

Pro Tip: Use kubectl top and GKE Monitoring to visualize waste—and consider node auto-provisioning to automate the rest.

Infra Tips — Unused Reserved IPs ✅ Quick Win

Use Case:
During a routine GCP audit, we discovered several static external IPs still reserved—despite the underlying VMs and load balancers having been deleted weeks ago. They were quietly accruing charges.

Solution:
Ran a quick gcloud compute addresses list to identify and release unused IPs. Documented the cleanup as part of our monthly cloud hygiene checklist.

💰 Impact:

  • Eliminated waste from “orphaned” resources
  • Cleaner billing and tighter environment governance
  • Zero risk, high ROI in under 15 minutes

Release static IPs

Pro Tip: Automate this check monthly—or better, tag all IPs with ownership metadata for easier tracking.

Data Tips — BigQuery Scheduled Queries Left Running ✅ Quick Win

Use Case:
A company had multiple scheduled queries still running weekly—powering dashboards that had been deprecated months ago. No alerts, no errors, just silent, unnecessary spend.

Solution:
Audited scheduled queries via the BigQuery UI and bq ls --transfer_config to identify and disable unused jobs. Also tagged active ones for ownership going forward.

💰 Impact:

  • Saved $800/month in avoidable query costs
  • Reduced warehouse load and improved billing clarity
  • Took just an hour, no code changes needed

BigQuery Scheduled Queries

Pro Tip: Build a quarterly cleanup cadence into your data ops—especially after product or team sunsets.

Infra Tips — Overloaded Cloud NAT 🔁 Strategic

Use Case:
A company funneled over 100 VMs across multiple regions through a single Cloud NAT gateway. This caused intermittent egress failures, NAT IP pool exhaustion, and degraded performance for services relying on external APIs.

Solution:
We restructured the network layout by implementing per-region Cloud NAT configurations, each with its own IP pools and scaling rules based on local demand. This better aligned with GCP's NAT best practices and removed a major single point of failure.

💰 Impact:

  • Improved egress reliability and throughput
  • Reduced unexpected NAT usage spikes and costs
  • Better regional fault isolation and scalability

⏱️ Effort: ~2 days including testing and rollout

Cloud NAT overview

Pro Tip: If you’re running GKE or autoscaled VM groups, monitor NAT IP utilization closely. Set min ports per VM and use multiple IPs per region for scale resilience.

Security + Cost — Misused Stackdriver Logs ✅ Quick Win

Use Case:
A fast-scaling company was exporting all Stackdriver logs—INFO, DEBUG, even health checks— directly to BigQuery for long-term analysis. Over time, this led to ballooning log volumes and a surprise spike in logging costs.

Solution:
We audited their log sinks and applied exclusion filters to drop noisy, low-value logs (like routine 200s, gRPC pings, and verbose library output). Only actionable logs (WARN+, custom app logs, errors) were retained for export.

💰 Impact:

  • ~50% reduction in log storage and BigQuery analysis costs
  • Leaner log pipelines with no loss in observability or compliance
  • Took just 1 hour, no app changes required

Logging exclusions

Pro Tip: Use Log Analytics in Cloud Logging instead of BigQuery for many use cases—it’s cheaper, faster, and purpose-built. Always tag sinks with owners to avoid silent overspending.

Infra Tips — Network Egress via Multi-Region Buckets 🔁 Strategic

Use Case:
A data science team was pulling large datasets from multi-region GCS buckets into compute resources running in specific regions (e.g., us-central1). The result? Cross-region egress fees that added up silently over time.

Solution:
We migrated hot datasets to region-specific GCS buckets colocated with the compute workloads—without disrupting access patterns or breaking pipelines.

💰 Impact:

  • Saved ~$900/month in unnecessary network egress fees
  • Improved data transfer speeds and pipeline efficiency
  • Took ~2 days including planning, IAM updates, and bucket rewiring

Bucket locations

Pro Tip: Always co-locate storage and compute. Multi-region is great for global redundancy, but it can become a costly default for regional workloads.

Data Tips — Flat vs Nested BQ Schemas 🔁 Strategic

Use Case:
A company’s BigQuery warehouse relied heavily on flat, wide tables—treating all data as top-level columns. Even simple queries scanned millions of unnecessary rows, especially when working with repeated or structured data.

Solution:
We redesigned key tables to use nested and repeated fields (e.g., arrays for events, structs for dimensions), aligning with BigQuery’s columnar storage and storage-aware compute model.

💰 Impact:

  • 30–40% drop in query costs
  • More flexible analytics with better schema evolution support
  • ~3 days of effort including modeling, migration, and downstream adjustments

Nested and repeated fields

Pro Tip: Use TABLESAMPLE SYSTEM, bytes_processed, and query plans to pinpoint waste. BigQuery works best when your schema matches the shape of your data.

Security + Cost — Cloud Armor Default Rules ✅ Quick Win

Use Case:
A company had a Cloud Armor policy with default rate limiting enabled, but without customizing rules or IP allowlists. This caused false positives that blocked legitimate traffic and generated unnecessary bandwidth charges.

Solution:
We disabled the default rate limiting rules and implemented a targeted IP-based allowlist tailored to trusted traffic sources, reducing noise and improving user experience.

💰 Impact:

  • Fewer false positives and support tickets
  • Lower bandwidth and security costs
  • Quick fix completed in about 1 hour

Cloud Armor rules

Pro Tip: Always review and customize Cloud Armor policies before rollout—defaults are a good start but rarely fit production traffic patterns.

Infra Tips — Cloud SQL Idle Read Replicas ✅ Quick Win

Use Case:
A company had a Cloud SQL read replica running 24/7, primarily to support a weekly ETL job—but the replica stayed online and accruing costs every other day, unused.

Solution:
We disabled the read replica by default and automated its startup only during the ETL window using Cloud Scheduler and Cloud Functions.

💰 Impact:

  • Saved ~$400/month in unnecessary instance charges
  • Maintained ETL reliability with no manual intervention
  • Setup completed in 30 minutes

Cloud SQL Replicas

Pro Tip: Regularly audit read replicas and other auxiliary resources—many become “always on” by default, quietly inflating your bill. Automate start/stop wherever possible.

Data Tips — Data Studio + BigQuery Caching ✅ Quick Win

Use Case:
A company’s Data Studio dashboards were triggering full BigQuery table scans every time users refreshed or interacted with charts—leading to unexpectedly high query costs.

Solution:
We enabled BigQuery result caching and applied strict time window filters on dashboard controls to limit the data scanned per query.

💰 Impact:

  • ~75% reduction in BigQuery query costs
  • Faster dashboard load times and better user experience
  • Implemented in just 1–2 hours

BQ Caching

Pro Tip: Combine caching with materialized views for even better performance and cost savings on frequently accessed reports.

Infra Tip — Overprovisioned Autopilot GKE Pods = Quiet Waste 🔁 Strategic

Use Case:
A company used GKE Autopilot but had generous CPU/memory requests for most pods—even low-traffic services. Autopilot charges for requested resources, not actual usage.

Solution:
We audited resource requests with kubectl top and right-sized deployments based on actual usage metrics. Added HPA (Horizontal Pod Autoscaling) and VPA recommendations where appropriate.

💰 Impact:

  • 30–60% reduction in GKE Autopilot costs
  • Improved cluster efficiency without affecting reliability
  • Effort: ~1 day of tuning and testing

Security + Cost — Default VPC Firewall Rules ✅ Quick Win

Use Case:
In a company’s test projects, we discovered default VPC firewall rules still active—including open ingress from 0.0.0.0/0 to all ports. This created a quiet but serious exposure to unauthorized access, DDoS risk, and potential egress cost from abuse.

Solution:
We implemented least-privilege firewall rules, removed open ingress, and enforced project-level audits using VPC Service Controls and IAM policies.

💰 Impact:

  • Closed off a potential attack surface
  • Avoided unnecessary egress and bandwidth costs
  • Effort: ~2 hours for full review and remediation

VPC Firewall Rules

Pro Tip: Treat test environments with production-grade care. Automate firewall auditing using GCP’s Security Command Center or Forseti. Open ingress is never harmless.

Data Tips — Compression on BQ Exports ✅ Quick Win

Use Case:
A company’s data pipeline was exporting daily BigQuery results as uncompressed CSVs to GCS for downstream processing. This led to excessive storage usage and high egress costs, especially when transferring data out of GCP.

Solution:
We updated the export process to use GZIP compression (.csv.gz), reducing file size dramatically without changing how downstream systems consume the data.

💰 Impact:

  • ~60% drop in egress and storage costs
  • Faster transfers + easier archiving
  • Effort: 1 hour, config-only change

Export formats

Pro Tip: Always compress exports—especially when dealing with large volumes or external data movement. BigQuery supports GZIP, DEFLATE, and SNAPPY out of the box.

Data Tips — Scheduled Materialized Views 🔁 Strategic

Use Case:
A company’s BI dashboards re-ran heavy analytical queries every hour, scanning large fact tables repeatedly—driving BigQuery costs up to $200/day.

Solution:
We refactored the core logic into materialized views and scheduled them to refresh hourly using BigQuery’s built-in scheduler, dramatically reducing compute per user query.

💰 Impact:

  • Daily cost dropped from $200 → $70
  • Dashboards now load faster with no logic changes
  • Effort: ~2 days for view design + testing

Materialized Views

Pro Tip: Pair materialized views with BI Engine or cache-aware tools like Looker for maximum speed and savings.

Summary Table

# Tip Role Effort Impact
1 Idle GCE to Cloud Run 🛠️ Infra 50% VM savings
2 BQ Partition + Cluster 📊 Data 🔁 60% query savings
3 Dev env auto-shutdown 🛠️ Infra 35% compute saved
4 GCS lifecycle policies 🛠️ Infra 70% storage saved
5 GKE autoscaler 🛠️ Infra 🔁 45% infra cost down
6 Static IP cleanup 🛠️ Infra Fast + clean billing
7 BQ scheduled cleanup 📊 Data $800 saved/mo
8 Cloud NAT refactor 🛠️ Infra 🔁 Reduced failures/cost
9 Logging exclusions 🔐 Sec + Cost 50% less log cost
10 Regional buckets 🛠️ Infra 🔁 $900 saved/mo
11 Nested BQ schemas 📊 Data 🔁 40% query reduction
12 Cloud Armor tuning 🔐 Sec + Cost Avoid false charges
13 SQL replica scheduling 🛠️ Infra $400 saved/mo
14 BQ caching + filters 📊 Data 75% less dashboard cost
15 Cloud Run for cold start 🛠️ Infra 🔁 50% lower invocations
16 Tighten VPC rules 🔐 Sec + Cost Avoid DDoS cost risk
17 GZIP BQ exports 📊 Data 60% egress cut
18 Materialized views 📊 Data 🔁 $130 saved/day

Let me know if you'd like this available as a downloadable PDF !


Final Words

Every example shared here comes from real-world GCP engagements with tangible results. These optimizations worked not just because they were technically sound, but because they were practical, measurable, and embraced by the teams implementing them.

💬 When engineering and cost awareness align, FinOps becomes less about restriction—and more about building smarter, faster, and more sustainably.

Want help doing the same for your infra? Hit me up at techwithmohamed.com.


💬 Join the Conversation

What’s the best cloud cost save you’ve pulled off? Share it in the comments or tag me on LinkedIn.

Updated on Jun 6, 2025