top of page

Senior Site Reliability Engineer – Cloud Operations

Hybrid in Toronto, ON

Job Type

Full-time

About the Role

We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic Cloud team. The ideal candidate will have a strong background in Linux administration, comprehensive networking knowledge, and exceptional Linux troubleshooting skills. This role requires experience with various storage platforms, proficiency in programming, and expertise in managing Kubernetes platforms on-premises. The successful candidate will be instrumental in ensuring the reliability, performance, and efficiency of our on-prem data center operations:

Responsibilities

  • Improve the performance, security, reliability, scalability and troubleshooting of Linux based systems

  • Utilize sound networking knowledge to manage network-related tasks, including configuration, optimization, and troubleshooting of network issues.

  • Implement and manage storage solutions, including object storage, NFS, and NAS, ensuring high availability and performance.

  • Develop and maintain automation scripts and tools using Ansible, Shell, Python, or similar programming languages to streamline operational processes.

  • Oversee the deployment, management, and operation of Kubernetes clusters on-premises, ensuring scalability, efficiency, and reliability.

  • Develop dashboards, alerts, automated remediation, and insights into the customer experience using observability tools.

  • Create and maintain Kubernetes operators, custom controllers, and other tools to intelligently scale our operational capability.

  • Commitment to continuous learning and improvement, staying abreast of the latest industry trends and technologies.

  • Collaborate with development and operations teams to identify and resolve system bottlenecks, improve system performance, and implement best practices for scalability and security.

  • Participate in on-call rotations, providing expert-level support for critical incidents and ensuring swift resolution of issues impacting system availability and performance.

  • Contribute to the planning and execution of data center operations, focusing on continuous improvement and the adoption of industry best practices.

  • Document system configurations, procedures, and changes to maintain a clear and up-to-date understanding of the environment.


Skills Required


Denvr Dataworks invests in its people and values candidates who can bring their diversified experiences to our teams. 


  • Bachelor’s degree in computer science, Information Technology, Engineering, or a related field.

  • 5+ years of experience in Linux administration and troubleshooting in a fast-paced data center environment.

  • Solid understanding of networking concepts and protocols, with proven experience in network troubleshooting and optimization.

  • Demonstrated experience with various storage platforms (Databases, File servers, Object storage, NFS, NAS).

  • Experience with Git or other version control systems

  • Proficiency in at least one programming or scripting language (Ansible, Shell, Python) for automation and tool development.

  • In-depth knowledge of Kubernetes platform management, including deployment, scaling, and maintenance of containerized applications on-premises.

  • Strong analytical, problem-solving, and communication skills, with the ability to work effectively in a team environment.

  • Commitment to continuous learning and improvement, staying abreast of the latest industry trends and technologies.


If you are passionate about technology and want to be part of a forward-thinking company, please submit your resume, portfolio, and a cover letter detailing your relevant experience and design philosophy to careers@denvrdata.com. It would be great to hear from you and learn more about your skills and capabilities. Please use the subject line "Application for Senior Site Reliability Engineer – Cloud Operations - [Your Name]".

About the Company

Denvr Dataworks is an Alberta-based company that delivers High Performance Cloud Services (HPCaaS/PaaS/SaaS). Denvr operates first of its kind ultra-efficient, modular, liquid immersion cooled data centers, with high density GPU & CPU based compute clusters along with proprietary cloud services software. The Denvr cloud is designed for customers using data or processor intensive applications inherent to advanced technologies such as Artificial Intelligence, Machine Learning, Deep Neural Networks, Data Rendering, Big Data, and related Data Science applications, with seamless support for hybrid cloud and edge computing scenarios.

Joining the Denvr Dataworks team means that you are a dynamic individual who is responsible but forward-thinking and encouraged by continuous learning and innovation. You have practical and effective communication and interpersonal skills where you lead by example, and you are mindful of building a culture of health in all aspects of the business. You are also a self-motivated and effective problem solver, and you take pride in doing a good job and achieving great results. You are highly collaborative, transparent with information, open to learning and you enjoy learning by “doing”. You are motivated to use your knowledge, experience, relationships, and abilities to help drive an exciting business forward and you love the idea of being part of an exceptional team, that works together to compete hard in the dynamic, cutting-edge world of high-performance computing and cloud services.

bottom of page