We are seeking a senior software engineer who is passionate about reliability and believes in advance
planning to stop fires before they start, which is critical for our software solutions. As the site reliability
engineer, you will help define our strategy across the whole stack — from AWS configuration up to the
front-end application. You will establish processes and systems to help engineers test for reliability and
performance, as well as live monitoring tools to detect problems in production.
We have a variety of monitoring systems already in place but are looking for someone to push the
envelope for detecting problems with Collage.com and other Foreground properties. We hope to find an
engineer who not only keeps up with industry best practices but can also develop custom tools to solve
our hardest problems, like recording and replaying state changes in our custom application to track down
difficult bugs. We look forward to you joining us in our mission to make our software fast and bug-free
for everyone, all the time.
Duties and Responsibilities
▪ Run the production environment by monitoring availability and taking a holistic view of system
health.
▪ Build software and systems to manage platform infrastructure and applications.
▪ Improve reliability, quality, and time-to-market of our suite of software solutions.
▪ Measure and optimize system performance across the entire software stack for all Foreground
properties — consolidating and maintaining existing tools (e.g., CloudWatch, NewRelic,
Datadog, TrackJS, OpsGenie) and implementing new systems where additional monitoring is
required.
▪ Gather and analyze metrics from both operating systems and applications to assist in
performance tuning and fault finding.
▪ Provide primary operational support and engineering for multiple large distributed software
applications.
▪ Prepare our services for handling 10x seasonal traffic (setting scaling policies, provisioning
resources, doing load testing, etc).
▪ Manage processes and automate stability/performance checks that the team uses to develop fast,
reliable software.
▪ Triage and respond to alarms from our monitoring systems with the help of other engineers and
participate in an on call rotation.
▪ Assist in migrating applications to containers sitting on top of AWS EKS along with monitoring
and logging of those clusters.
Qualifications
▪ Bachelor’s degree in computer science or equivalent work experience.
▪ 5+ years of experience developing modern web applications.
▪ 2+ year of experience focused on site reliability for high traffic, high availability applications.

▪ Full-stack debugging and performance optimization ability, including knowledge of AWS
systems (load balancing, caching, content distribution, etc.), continuous integration/build
systems, SQL databases, PHP, JavaScript, and modern web development frameworks like
React/Redux.
▪ A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
▪ Excellent planning and communication skills, including the use of spreadsheets/database queries
to analyze and present data.
▪ Ability to program (structured and OO) with one or more high level languages, such as Python,
PHP and JavaScript
▪ Prior experience in a start-up environment is nice to have.

SRE/Devops Engineer

Apply for this position