New

Site Reliability Engineer, Consultant

Blue Shield of CA
United States, California, Oakland
601 12th Street (Show on map)
Feb 02, 2026
Your Role We are seeking an Experienced Site Reliability Engineer (SRE) to lead reliability, scalability, and performance initiatives across our production systems. In this role, you will blend software engineering, automation, and systems operations to ensure that our platforms are resilient, efficient, and continuously improving. You will be part of a cross-functional team responsible for designing, implementing, and maintaining reliable systems that support millions of requests daily. This position requires a deep understanding of distributed systems, cloud infrastructure, automation, and incident response. Your Knowledge and Experience Education & Experience Requires a Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience); Master's degree a plus. 7+ years of experience in building, supporting, and improving production systems and infrastructure. Cloud Platforms Minimum 5 years of hands-on experience with Azure, AWS, or GCP. Demonstrated expertise in virtual machines (VMs), containers, cloud networking, identity and access management (IAM), monitoring, storage, and serverless functions. Comfortable deploying and managing cloud-native services and infrastructure. Programming & Scripting Proficiency in one or more languages such as Python, Go, Java, Bash, PowerShell, or similar. Ability to write clean, maintainable code for automation and tooling. Containerization & Orchestration Experience working with Kubernetes, Docker, and tools like Helm or Red Hat OpenShift. Familiarity with managing containerized applications in production environments. Monitoring & Observability Working knowledge of tools such as Prometheus, Grafana, Datadog, New Relic, ELK Stack, Dynatrace, Splunk, Big Panda, SolarWinds. Ability to set up dashboards, alerts, and metrics to ensure system health and performance. CI/CD & Configuration Management Experience with CI/CD pipelines using tools like Jenkins, GitHub Actions, GitLab CI, Argo CD, Spinnaker. Familiarity with configuration management tools such as Ansible, Chef, Puppet. Automation & Emerging Technologies Understanding of Agentic AI systems and automation frameworks for incident response and infrastructure optimization is a plus. Interest in exploring intelligent automation to improve reliability and reduce manual toil. Testing & Deployment Expertise Experience with chaos engineering tools (e.g., Gremlin, Chaos Monkey) and methodologies. Hands-on knowledge of Blue/Green and Canary deployment strategies in cloud-native environments. #LI-EB1