
Junior Platform Support Engineer - Observability & Reliability
1
34
0
About Denvr
Denvr AI Services are high performance computing services designed to empower AI developers, innovators, and business leaders, driving the advancement and adoption of AI technologies. AI Services focus on delivering solutions to companies that are developing or operating AI, including native AI developers & startups, AI-driven businesses, and Enterprise & Public Institutions.
Denvr AI Services include:
AI Compute Services, high-performance compute services designed for AI developers, small businesses, and enterprise businesses with self-managed AI workloads. It provides on-demand or dedicated consumption with optionality for bare metal or virtualized compute resources, optimized for AI workloads, including AI model training, inference, and large-scale data processing.
AI Inference Services, a managed Inference as a Service application that provide customers with scalable, low-latency, and cost-efficient solutions to deploy and manage AI/ML models, delivered through serverless API endpoints, it does not require the customer to manage any hardware.
AI Platform Services, a managed platform product designed for organizations looking to accelerate their AI capability with turn-key solutions for deploying and operating AI infrastructure. Denvr AI Platform Services integrates state-of-the-art modular data centers, AI-optimized compute architectures, and sophisticated software solutions to streamline AI infrastructure deployments and operations, with maximum efficiency and high performance, at hyperscale.
With headquarters in Canada and the USA, at Denvr Dataworks we strive to provide exceptional customer experiences, empowering AI innovators, creators, entrepreneurs, and business leaders to achieve valuable business outcomes using AI technologies. The Denvr team is united in our pursuit of excellence, we are dedicated and committed to delivering World class AI Services.
Joining the Denvr Dataworks team means that you are a leader, a dynamic individual who is forward-thinking and driven by continuous learning and innovation, and you thrive on learning by “doing” and working together as a team. You are highly collaborative, transparent with information, with practical and effective communication and interpersonal skills. You are self-motivated, responsible, and an effective problem solver, you take pride in achieving exceptional results while prioritizing your health and the health of your team. You are motivated to use your knowledge, experience, relationships, and abilities to help drive an exciting and innovative business forward, and you love the idea of being part of an exceptional team, that works together to compete hard in the dynamic, cutting-edge world of AI.
About the RoleWe are seeking a Junior Platform Support Engineer to join our Observability and Reliability team. In this role, you will collaborate with Site Reliability Engineers (SREs) to curate and create Grafana alerts, serve as the first line (L1) responder for alert notifications, and perform initial troubleshooting to route issues effectively. You’ll also get hands-on experience with DevOps tools like Git, GitHub, Ansible, and Terraform to support automation and infrastructure management. This is a fantastic opportunity to kickstart your career in platform engineering and observability within a data-driven environment.
What You’ll Do
Observability & Alerting:
Work with the SRE team to curate, create, and maintain Grafana alerts for monitoring system health and performance.
Assist in configuring and tuning alerts using Prometheus, Grafana, and Loki to ensure timely and accurate notifications.
L1 Support & Troubleshooting:
Act as the primary (L1) point of contact for alert notifications, performing basic troubleshooting to identify the nature of issues (e.g., infrastructure, application, or network-related).
Follow existing runbooks to resolve incidents where possible; if no runbook exists, investigate and escalate tickets to the appropriate L2/L3 team members with detailed context.
Ticket Routing & Documentation:
Route escalated issues to the correct L2/L3 engineers based on problem type, ensuring clear communication and documentation of findings.
Use Git and GitHub to update internal repositories with troubleshooting steps, runbooks, and outcomes to improve future incident response.
Automation & Tooling:
Assist in using DevOps tools like Ansible and Terraform to automate repetitive tasks and manage infrastructure configurations.
Contribute to version-controlled workflows by committing changes and collaborating via GitHub.
Learning & Collaboration:
Partner with senior engineers to deepen your understanding of observability tools, Kubernetes, and DevOps practices.
Participate in team discussions to refine alerting strategies and improve platform reliability.
Who You Are
Experience:
2-3 years of experience in a technical support, IT, or systems administration role (or equivalent education, e.g., BS in Computer Science, IT, or related field).
Basic familiarity with observability tools (e.g., Grafana, Prometheus, Loki) or DevOps practices is a plus.
Required Skill Set:
Linux: Comfortable with command-line operations, basic administration, and troubleshooting.
iDRAC or BMC Software: Experience with remote server management tools like iDRAC or other Baseboard Management Controllers (BMCs).
VMware: Basic knowledge of VMware virtualization for managing virtual environments.
AWS: Familiarity with AWS services (e.g., EC2, S3, CloudWatch) and basic cloud operations.
Kubernetes: Beginner-level understanding of Kubernetes for containerized workloads (or eagerness to learn).
Grafana, Prometheus, Loki: Exposure to or willingness to learn these tools for observability and monitoring.
Git & GitHub: Basic proficiency with version control, including committing, branching, and pull requests.
Ansible: Familiarity with or interest in learning Ansible for configuration management and automation.
Terraform: Exposure to or willingness to learn Terraform for infrastructure-as-code (IaC) tasks.
Soft Skills:
Strong analytical skills to investigate and categorize issues effectively.
Excellent communication skills for documenting findings and collaborating with team members via GitHub or other tools.
Proactive attitude and a desire to learn in a dynamic, technical environment.
Mindset:
Enthusiasm for solving problems and supporting reliable systems.
Comfort with being on-call (with guidance) and handling real-time alerts.
Team-oriented approach with a focus on contributing to collective success.
Denvr Offers
Work in a dynamic, innovative environment alongside industry experts.
Competitive salary and benefits package.
Career growth and professional development opportunities.
If you are passionate about technology and want to be part of a forward-thinking company, we would love to hear from you and learn more about your skills and capabilities. Click on the link to apply!