Widenet is looking to hire an Operations Engineer for our client’s NOC that is passionate about providing amazing experiences for our users and wants to be part of a growing team that is in the thick of the development, testing, deployment, and operating of highly utilized online services. This role within the DevOps team needs an experienced and talented individual with a diverse set of skills and abilities to work in a key role that is primarily focused on mitigating customer impact by monitoring the platforms, identifying & resolving issues, and working with development teams on escalations to ensure we are highly performant, available, secure, scaled, and operationalized. Though the primary focus of this role is on monitoring & mitigating issues on our platform, this role will also be involved in & contributing towards creating & updating alerts & health checks, scripting & automation, operational & runbook documentation, and partnering with the Development, Test, and Security teams to put in place the right solutions for the business.
- Support of development, test, and production environments via investigation & resolution of performance and functional issues with our service stack as well as upstream/downstream systems.
- Triage issues based on a clear understanding of the business problem and impact to ensure appropriate urgency in response.
- Troubleshoot and resolve complex production / application issues identified through alerts to ensure services are highly available & performant.
- Clearly and concisely document issues that cannot be fixed in the NOC and escalate to on call resources.
- Monitoring our platforms & services using enterprise class monitoring tools, reviewing logs and performing validation checks.
- Enhance monitoring of systems, services and hardware to enable expedient identify and resolution of issues.
- Manage incidents during critical issues and ensure notifications to stakeholders are prompt, accurate, and consumable by non-technical audiences.
- Ensuring that the incident ticketing system is regularly checked for high or critical priority tickets that need resolving or escalating to on call teams.
- Work with developers and engineering leads to advocate for operational improvements in our software stack.
- Develop and improve operational documentation while working within the DevOps team to improve the supportability of Production.
- Maintaining a professional composure and appropriate level of urgency during high severity events.
- Participate in a 24/7 on-call rotation. Must be willing and able to work non-standard work shifts, including evenings, overnight, holidays, and weekends.
Knowledge, Skills & Experience
- 3+ years DevOps / Network Engineering / Application Engineering / Operations Engineering experience.
- 2+ years’ practical scripting and/or development experience in shell, Python and/or other automation tools.
- Understand and troubleshoot networking problems, configuration, and application workflow changes.
- Experienced with Incident Management practices and processes.
- Experience with monitoring, metrics, and logging tools such as New Relic, Telegraf, Grafana or Nagios/Icinga.
- Experience working in a Linux environment and utilizing infrastructure in Amazon Web Services.
- Ability to successfully work within a team and across departments with conflicting priorities.
- Strong analytical, reasoning, and problem solving skills.
- Detail oriented and capable of working on multiple problems at once.
- Exhibited business maturity and ability to convey technical information to non-technical audiences.
- Excellent verbal, written, and interpersonal communication skills.
- Comfortable in a fast-paced, dynamic, agile driven environment.
- A strong desire to own something and make it awesome!
- Experience in the video game or trading card game industry a plus.
- BS Degree in Computer Science or Computer Engineering and/or equivalent working experience.
- DevOps/Systems Administration/System Engineer certification in Unix/Linux/AWS a plus.
- This position may require occasional domestic travel.
- Ability to work flexible hours as required for key initiatives.
- All team members are on-call periodically to respond to any emergency web site outages.