Role: Cloud Operations Engineer
Overview: Riversand Technologies is a Master Data Management (MDM) visionary and a Product Information Management (PIM) leader. We are a team of passionate people who are rethinking the way MDM and PIM work. We are on a trajectory for an accelerated product innovation and growth. If you are a Cloud Operations Engineer looking to advance your career in the latest in Cloud Infrastructure support and technology stack in cloud and open-source technologies, then now is the best time to join Riversand. Our solutions power enterprises worldwide, in a variety of industries including Retail, Manufacturing, Distribution, Energy, Healthcare, and Food Services.
Successful Cloud Operations Engineer need to have a wide range of skills so that they can effectively support global customers using Riversand Software as Service (SaaS) product. This includes, managing and supporting Cloud Infrastructure, Micro services architecture systems, Big Data technology stack, latest in Monitoring, Log management and tools. Requires exemplary production support with attention to detail while working with deep technical issues and identifying opportunities of addressing root causes. They also need good problem-solving skills so that they can identify issues and determine how to correct them. Reviewing configuration and operational processes also involves the use of analytical skills. In addition, need communication and teamwork skills to effectively relay information and properly document their work.
Software, Master Data Management, Application Software, ERP, SaaS
Primary responsibilities include -
- Provide Production support, Monitoring and Troubleshooting to ensure 99.9 % uptime for our SaaS platform, including Cloud Infrastructure support, administration and escalation management for Access, Security, Compliance & Cost management.
- Work on a 24/7 production environment and on shift schedules.
- Provide Platform level support, monitoring and troubleshooting for Internal and External customers on technology stacks on Linux OS, Kubernetes/Docker Swarm, Elasticsearch, Kafka, Apache Storm, Netty, Nginx, MSSQL etc.
- Should use, customize and administrate on Monitoring, Log Management and APM tools like Sensu, Zabbix, Grafana, Prometheus, ELK, Jenkins etc.
- Provide Operations Command Centre assistance to production customer for periodic patches, hotfixes, upgrades along with customer notifications.
- Provide Operational functions like Event & Alert management, Incident Management and Problem resolution based on Service levels, Backup and Disaster recovery, Capacity management etc.
- Own, resolve and restore technical and operational issues with Root Cause Analysis (RCA) of incidents.
- Owns and drives end to end technical resolution of critical incidents which might need involvement from multiple parties and ensures the right collaboration and communication
- Reporting and Documentation based on industry best practices.
- Bachelor’s degree or equivalent in Computer Science or Engineering (or an equivalent major)
- 4+ years of relevant experience with progressively responsible experience in Cloud Operations & Support
- Willing to work on 24x7 environment with shift roaster.
- Workplace as Bangalore.
- Should have worked on administrating and managing production infrastructure and SaaS and have handled day-to-day incident management and problem troubleshooting on software system on public cloud like Azure, AWS
- Must have working experience in Linux OS and remotely diagnose and troubleshoot Open-source systems like Kubernetes/Docker Swarm, Elasticsearch, Kafka, Apache Storm, Netty, Node JS, Nginx, MS SQL, IIS etc.
- Should have worked in environment with large amount of data imports/exports, access and analyze through log management systems and monitoring tools like Sensu, Grafana, ELK etc.
- Should have worked in Public Cloud like Azure/AWS and must have knowledge on PaaS services to access, monitor and troubleshoot
- Should have worked on Command Center to support Production customers on Event Notification, Patch/Hotfix/Upgrade process, Security and Compliance management
- Regular administration of production and non-production systems for Backup, Recovery, DR, Capacity addition, Onboarding and De-provisioning.
- Understand of Software and Application systems that involves troubleshooting by analyzing API calls using tools like Postman, etc.
- Multi-tenant environment and work environment with customer information security as high focus along with compliance.
- Should have worked on ticketing tool to log, manage and track customer tickets like Jira Service Desk, ServiceNow, Team Foundation Server etc.
- Nice to have the ability for scripting when required using shell scripting & Python
Willing to work on 24x7 roaster schedule and Bangalore as workplace (Yes/No):
4+ years of relevant work experience in Cloud Infrastructure Support on Azure/AWS with supporting production SaaS and Open-source software systems with hands-on experience on monitoring and remote troubleshooting (Yes/No):
- Experience on Linux OS and advanced troubleshooting on Cloud IaaS & PaaS systems (Basic/ Intermediate/ Expert):
- Prior work experience in opensource technical stack like Kubernetes, Docker Swarm, Elastic Search, Kafka, Web servers & Database systems (Basic/ Intermediate/ Expert):
- Work experience on setting up and analyzing Monitoring, Log Management, APM tools like Sensu, Grafana, ELK etc. (Basic/ Intermediate/ Expert):
- Working experience on Production environment with Security, Compliance, Backup/DR, Patches/Hotfix, Periodic Upgrades (Basic/ Intermediate/ Expert):
- Knowledge on scripting like Shell scripting, Python and REST APIs using tools like Postman (Basic/ Intermediate/ Expert):