Site Reliability Engineer will work on our highly trafficked and highly available systems, leveraging system engineering/programming and debugging expertise to implement and enhance auto-detection & auto-remediation tools and processes, and will take ownership of production uptime and stability.
This person will work closely with other developers, database administrators, and project managers. Responsibilities include but are not limited to agile development, improving existing infrastructure design, and providing production support.
- Write Infrastructure as code using Terraform for AWS (ECS, ALB, S3, SNS/SQS, Elasticsearch, RDS, CloudFront etc..), execute and manage the CI/CD pipeline for the Program.
- Quickly diagnose production issues, document designs and procedures, scaling the infrastructure to meet demands and proactively ensuring the highest levels of systems and infrastructure availability.
- Proactively build and implement services as needed to enhance monitoring, alerting, and recoverability.
- Collaborate on setting and driving our evolving infrastructure automation roadmap.
- Empower engineers to contribute changes to infrastructure by pairing, supporting and mentoring. Ensure knowledge is spread across team members on different locations.
- Actively support operational tasks, and actively participate in knowledge transfers, supporting system stability whilst learning more on how our applications and infrastructure interwork.
- Identify and implement significant improvements that reduce mean time of delivery, antifragile design, and security posture.
- Plan, implement, and release production changes using zero-downtime techniques.
- Support root cause analysis and identify relationships between processes and events.
- Perform ongoing design and code reviews, stressing the concepts of monitorability, resiliency and auto-recovery.
- Proactively identify, build and implement services to enhance monitoring, alerting, recoverability.
- Participate on 24×7 on-call rotation (weekly rotation, primary/secondary).
- 5-7+ years of demonstrated DevOps experience focusing on configuration as code.
- Proven experience managing production cloud infrastructure at scale (AWS).
- Strong knowledge of infrastructure as code concepts and tools (Terraform v0.14).
- Strong Linux / EC2 systems administration and experience.
- Proven experience with configuration management tools (SaltStack, Ansible).
- Experience in administration and performance tuning of application stacks (Tomcat, Apache).
- Strong experience in managing observability platforms and logging systems, at scale.
- Desirable to have Hadoop / Spark / EMR experience.
- Good knowledge of CI/CD implementation best practices, preferably using Jenkins.
- Experience programming with Python/Java.
- Proven knowledge of common security concerns and best practices (threat modeling, blast radius, hardening, penetration testing).
- Bachelor’s degree or higher in Computer Science or other technical discipline, or related practical experience.
- AWS Certified SysOps Administrator – Associatecertification is preferred.
- HashiCorp Certified: Terraform Associatecertification is preferred.
- Ability to thrive in a high energy, high growth, fast paced, entrepreneurial environment.
- Willing to learn new skills and implement new technologies.
- Embraces a self-learning culture, following infra as code industry developments.
- Strong organizational skills with high attention to detail.
- Excellent communication skills – written, verbal, presentation and interpersonal.
- Strong team player who can build strong relationships at all levels of the organization.
- Demonstrates a good business acumen, recognizing need to trade on timely versus perfect solutions.