Careers
Senior DevOps Engineer
Senior DevOps Engineer
- Remote (US or India)
- Full Time
- Competitive
We're seeking a Senior DevOps Engineer to build and lead our infrastructure monitoring, observability, and reliability practices from the ground up. We're ready to modernize our infrastructure monitoring - we need someone to transform it into a proactive, modern DevOps operation. You'll select and implement monitoring tools, establish alerting strategies, create runbooks, and build a culture of reliability across our development teams. This role requires someone who can see beyond traditional network monitoring to implement full-stack observability - from application performance to user experience metrics. You'll work with our development teams to bake monitoring into everything we build and establish SLIs/SLOs that actually matter for our healthcare platform. Perfect for DevOps professionals who love building observability practices from scratch and can transform how an organization thinks about reliability.
Responsibilities:
- Design and implement comprehensive monitoring strategy for applications and infrastructure
- Select, install, and configure monitoring tools (APM, logging, metrics, tracing)
- Set up application performance monitoring to track response times, errors, and throughput
- Implement infrastructure monitoring for servers, databases, and cloud resources
- Create intelligent alerting rules that minimize noise and catch real issues
- Establish SLIs (Service Level Indicators) and SLOs (Service Level Objectives)
- Build dashboards for different stakeholders (developers, management, support)
- Set up centralized logging and log aggregation systems
- Implement distributed tracing for debugging complex issues
- Create automated incident response and escalation procedures
- Develop runbooks and automation for common issues
- Train development teams on observability best practices
- Establish on-call rotations and incident management processes
- Conduct post-mortems and drive continuous improvement
- Implement synthetic monitoring and automated testing
- Set up cost monitoring and optimization for cloud resources
Requirements:
- Bachelor’s degree in Computer Science or related field
- 5+ years of experience in DevOps, SRE, or infrastructure roles
- Strong experience with monitoring tools and platforms
- Expertise in cloud platforms (Azure preferred, AWS acceptable)
- Proficiency in scripting (Python, PowerShell, Bash)
- Experience with Infrastructure as Code (Terraform, ARM templates)
- Understanding of application architecture and performance patterns
- Knowledge of networking, security, and system administration
- Experience with CI/CD pipelines and automation
- Strong analytical and troubleshooting skills
- Excellent communication skills to work with diverse teams
- Experience building monitoring practices from scratch
Will be a plus:
- Specific Tool Experience:
- APM Tools: New Relic, Datadog, AppDynamics, Dynatrace, Application Insights
- Metrics/Monitoring: Prometheus, Grafana, Zabbix, Nagios
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Sumo Logic
- Tracing: Jaeger, Zipkin, AWS X-Ray
- Incident Management: PagerDuty, OpsGenie, VictorOps
- Cloud Monitoring: Azure Monitor, CloudWatch, Google Operations
- Healthcare industry experience and HIPAA compliance knowledge
- Experience with .NET application monitoring
- Knowledge of database performance monitoring (SQL Server)
- Experience with container monitoring (Docker, Kubernetes)
- Chaos engineering and reliability testing experience
- FinOps and cloud cost optimization experience
- ITIL or incident management certifications
- Experience transforming traditional IT teams to DevOps
We offer:
- Competitive senior-level compensation package
- Opportunity to build DevOps practices from the ground up
- Full remote work with flexible schedule
- Budget for tools and platform implementation
- Generous PTO and flexible time off
- Performance-based bonuses
- Authority to select and implement tools
- Direct impact on platform reliability and performance
- Collaboration with development teams globally
- Low on-call burden once systems are properly built