Site Reliability Engineer
Company: Softworld Inc
Location: Detroit
Posted on: April 25, 2025
|
|
Job Description:
Job Title: Site Reliability Engineer
Be one of the first applicants, read the complete overview of the
role below, then send your application for consideration.
Job Location: Detroit MI 48228
Onsite Requirements: Remote
Worked 2-3 years as an SRE in Azure
Used terraform, github and ansible for automating SRE job functions
not in DevOps role
Job Description:
The Cloud Site Reliability Engineer (SRE) works closely with the
cloud development team, IT operations team, and business partners
to streamline and implement enhanced monitoring and alerting
capability across infrastructure and application layers.
By leveraging automation tools, SREs address and resolve issues,
minimizing manual workload and enhancing system scalability and
reliability.
Their core focus lies in standardization and automation to build
and run fault-tolerant systems.
Typically, SREs possess a background in software engineering,
system engineering, or system administration, coupled with
substantial IT operations experience.
SREs oversee availability, latency, performance, efficiency, change
management, monitoring, emergency response, and capacity
planning.
Key Accountabilities:
Writing and developing code to automate processes, such as
analyzing logs, testing production environments, and responding to
any issues.
Collaborates with agile teams and business partners to develop
specifications that resolve problems and enhancement needs,
including focusing on monitoring and metrics for operational
readiness.
Identify bottlenecks in development and deployment processes and
design automation solutions to mitigate.
Develop new capabilities in displaying/monitoring/alerting on key
performance indicators by tracking business transactions in
real-time.
Maintain and grow knowledge of platform configuration management,
monitoring of established metrics, and troubleshooting.
Provides continuous feedback to development teams on system
stability, defect analysis, and system enhancements.
Design and develop alert escalation and incident response
automation.
Provide production support for cloud service outages and incidents
and work on both tactical and strategic plans for outage
prevention.
Provide feedback on resiliency and maintainability of solutions to
Cloud and App architects.
Conduct disaster recovery scenario generation and testing.
Implement sustainable, audit-ready processes that support
information technology controls, including deployment execution,
access management, audits, incident management, and related
requirements.
Must-have Technical Skills:
Should have at least 3 years' experience as a site reliability
engineer on a cross-functional agile team working in Azure.
Have working knowledge of agile development methodologies (scrum,
sprints, Kanban, etc.) and tools (Azure DevOps, etc.).
Have at least 3 years hands-on experience using IaC tools
Terraform, GitHub, Ansible, and Packer.
Proven experience across testing, integration, source code
management, deployment, and containerization.
Sound problem-solving skills with the ability to quickly process
complex information and present it clearly and simply.
Experience with cloud technologies and services including those for
Compute, Storage, Databases, and API Management.
On-premise to cloud migration experience.
Required Non-technical Soft Skills:
Strong communication skills and ability to manage complex technical
decisions.
Be a team player and coach, share knowledge, and work towards
building a trusted, passionate team.
Be a thinker and not an order taker. Have the courage and ability
to think, understand, question before doing.
Have the courage to push back and say 'NO' if that is the right
thing to do for DTE.
Have a continuous improvement mindset and be open to constantly
finding better ways of solving security issues.
**3rd party and subcontract staffing agencies are not eligible for
partnership on this position. 3rd party subcontractors need not
apply.
This position requires candidates to be eligible to work in the
United States, directly for an employer, without sponsorship now or
anytime in the future**
Keywords: Softworld Inc, Troy , Site Reliability Engineer, Engineering , Detroit, Michigan
Click
here to apply!
|