Senior Site Reliability Engineer
united kingdom
Location: 100% Remote. The working timezone is EU/GMT.
ThinkAlpha is looking for a Senior Site Reliability Engineer to work in the core infrastructure team supporting our data analytics platform and transactional trading engine.
Our team provides solutions for real-time analytics, financial search, data integration, robust transactional systems, backtesting, and their related services. Our services span real-time analytics, ETL processes, backtesting trading strategies, live trading, natural language processing, and our platform/user interface.
In your role as an SRE you will focus on scalability and reliability from the ground up. You will help build and shape how everything runs at THINKalpha and be a leading voice in how we work and build our infrastructure.
Your Work
Configure and maintain observability tooling with Datadog and PagerDuty (Slack channels)
Contribute to our IaC codebase by creating and maintaining Terraform and Ansible modules, and participate in the review process for the IaC developed by the other SRE engineers.
Help developers with their needs when it comes to infrastructure updates and accounts management
Support our CICD infrastructure and be familiar enough with the workflow to maintain/update our reusable workflows and github actions to reflect the dev team needs.
Keep the infrastructure systems updated and secure by performing scheduled updates to the operating systems, software packages, and services.
Run software releases for our production environments, support the SW Development team with testing and validation, and perform rollbacks when necessary
Be part of the on-call rotation schedule and ensure services are healthy and performing as expected, especially during the market opening hours.
Ideal candidates want to build sustainable code, build systems that are resilient and well-tested, and want to work with a group of people that hold each other to high standards.
Qualifications
> Must have Minimum 5 years of professional experience
Extensive experience and deep understanding of infrastructure as code (Terragrunt/Terraform).
Strong understanding of Docker, Kubernetes, Linux and data pipelines and stream processing tools.
Experience with both on-premise/colocated servers as well as cloud infrastructure, and hybrid deployments spanning both types of environments.
Experience with observability platforms (e.g., DataDog) and alarm systems (e.g., PagerDuty)
>Nice to have Coding background in at least one language (Node, JavaScript, Python, C++, etc)
Understanding of mesh networking with Kubernetes clusters (Istio, Linkerd, or similar), ArgoCD
Familiarity managing and configuring services that rely on: Git, S3, SQL, Mongo
Familiarity with CI and GitHub Actions