Site Reliability Engineer
Permanent Position – Toronto (Hybrid Remote)
At Riskfuel, we have pioneered the use of Deep Neural Networks (DNNs) to speed up mathematical models used in quantitative finance.
We generate massive amounts of training data which we then use to train our neural networks. We leverage thousands of cores of compute and hundreds of GPUs spanning across both bare metal and cloud Kubernetes clusters.
You will work with and support our machine learning engineers, developing and maintaining core internal services which power our business. The work is varied, interesting and fast paced, ranging from CI/CD pipelines, distributed storage systems, Kubernetes cluster administration, the occasional web development project, and more.
At Riskfuel, you’ll get a chance to build something from scratch. Your work will be varied, interesting and fast paced, with lots of opportunity for you to make impactful contributions. This role is very hands-on and you will be responsible for mission critical systems.
Here are some of the things you’ll be working on:
- Working with and supporting our machine learning engineers
- Designing large scale distributed Kubernetes based systems
- Managing bare-metal and cloud Kubernetes clusters
- Scaling up Riskfuel’s distributed storage cluster
- Working with state of the art hardware including NVIDIA DGX A100’s
- Experienced with Unix/Linux operating systems internals as well as with networking
- Experienced working with cloud systems and cloud providers
- Experienced with containers and container orchestration tools (Docker, Kubernetes)
- Experienced with automation tools like Ansible
- Is able to install a Kubernetes cluster
Nice to haves:
- Experienced with rook and/or Ceph
- Experienced designing and deploying CI/CD pipelines
- Interested in web development and/or machine learning
- Experienced with router hardware and software
For the most part, you’ll be working from home with regular visits to our data centre in Toronto West.