Job description
About the Job
Red Hat's DDIS team is seeking Senior Site Reliability Engineers (SRE) focused on MLOps to join our team. We are looking for strong site reliability engineers to define, lead SRE practice for the next generation SaaS applications for the Red Hat data products at Cloud scale - warehouses, analytical and machine learning workloads. As an SRE, you will contribute to development and operations of our GitOps based infrastructure as code automation platforms to manage data services environments with a deep focus on aligning the product, processes, and policies.
In this role, you will have an opportunity to influence the complex challenges of scale & security to develop, operate Red Hat data platforms. Also you will be partnering with Sales, Finance, Marketing focused domain teams across Red Hat and compliance and security stakeholders to deliver a secure SaaS platform for Red Hat and partner teams.
What you will do
Manage, deploy, and operate cloud solutions at scale using the principles of Site Reliability Engineering
Participate in the design, implementation and reliability of ML Pipelines.
Participate in the design and development of new features to enable Data 'as-a-service'
Design and write automation software to provision, upgrade, monitor, and heal Data 'as-a-service'
Identify single points of failure and other high-risk architecture issues; propose and implement more resilient resolutions
Define Service level Objectives, implement them along with runbooks
Participate in product release cycles, deploying code to integration, staging and production environments, integrating with CI/CD tooling, monitoring and change management
Interact with automated monitoring and healing infrastructure to ensure healthy environments
Help and develop peers through knowledge sharing, mentoring and collaboration
Create and maintain standard operating procedures (SOPs) for performing maintenance tasks, applying configuration changes and remediating problems in our environment
Participate in a follow-the-sun on-call rotation
Contribute software tests and participate in peer review to increase the quality of our codebase
What you will bring
Mandatory 4+ years of software engineering experience using one or more programming languages such as Golang,Python, Java, Ruby. You should be able to hands-on code on a daily basis in any of the mentioned languages or equivalent.
Experience in developing, manage Infrastructure as code automation platforms - Terraform or equivalent
Experience in troubleshooting as-a-service offerings (SaaS, PaaS, etc.)
Experience with any of the public cloud services.
Mandatory hands-on experience using Kubernetes/OpenShift
Writing/maintaining Kubernetes Operator is good to have.
Prior experience in building ML Pipelines is a huge plus.
Knowledge or prior experience of Gitlab Pipelines, TektonCD/ArgoCD, Kubeflow is huge plus.
Experience with developing and using monitoring and observability tools/stack.
Maintaining SLOs of the responsible services.
Being customer (internal or external) focused is a must.
Good communications skills and experience working within a team and collaborating with other teams.
Ability to quickly learn new technologies.
Role: Data Platform Engineer
Industry Type: IT Services & Consulting
Department: Engineering - Software & QA
Employment Type: Full Time, Permanent
Role Category: Software Development
Education
UG: B.Tech/B.E. in Any Specialization
PG: Any Postgraduate
Key Skills
Skills highlighted with ‘‘ are preferred keyskills
software engineering
GolangJavacloud solutionsPaaSSaaSKubeflowOpenShifttroubleshootingArgoCDGitlabPythonKubernetesTektonCD