Service Alert - Instability in Deskbee Workspace

Incident Report for Deskbee

Postmortem

No dia 03/11/2025, foi identificado um comportamento anômalo no ambiente de produção, impactando diretamente a plataforma. O problema resultou em intermitência no acesso e lentidão generalizada, seguida de um período de indisponibilidade total.

O incidente foi identificado por meio de alertas internos de monitoramento e relatos de usuários sobre falhas de login e lentidão no carregamento de páginas. As equipes de SRE e Engenharia iniciaram a investigação imediatamente após a confirmação da degradação, atuando em conjunto com o time de aplicação para mitigar o impacto.

A falha ocorreu devido à alta utilização de CPU na instância MySQL HeatWave, principal banco de dados de produção.
Essa sobrecarga provocou o enfileiramento de threads e a degradação severa das consultas, resultando em lentidão e falhas intermitentes nos serviços. Mesmo após o encerramento manual de sessões, as reconexões automáticas das aplicações restabeleciam novas sessões em alta frequência, impedindo a recuperação natural da instância. A origem exata do aumento de carga ainda está sob análise, podendo estar relacionada a consultas intensivas, bloqueios simultâneos ou processos não otimizados.

Foram executadas ações de mitigação imediata para restabelecer a performance e estabilidade do banco de dados, incluindo o reinício controlado dos serviços e rollback preventivo da última versão publicada.
O chamado junto ao suporte da Oracle Cloud permanece aberto para análise aprofundada da camada de infraestrutura, garantindo que não haja falhas físicas ou limitações de recursos no ambiente.

Ações Preventivas

Para prevenir recorrências, as seguintes ações foram definidas:

Revisão das queries e índices executadas no período do incidente;
Ajuste da configuração de performance e definição de sizing adequado da instância;
Envolvimento de um DBA especialista para análise e otimização contínua do banco de dados e processos críticos;
Monitoramento avançado de sessões e filas de execução com alertas proativos.

O ambiente encontra-se estável e em monitoramento contínuo, sem novos registros de lentidão ou intermitência.
A equipe de SRE segue acompanhando com o suporte da Oracle até a conclusão completa da análise e implementação das medidas definitivas.

‌

……………………………………………………………………………

‌

On November 3, 2025, an anomalous behavior was identified in the production environment, directly impacting the platform. The issue resulted in intermittent access and widespread slowness, followed by a period of total unavailability.

The incident was identified through internal monitoring alerts and user reports about login failures and slow page loading. The SRE and Engineering teams immediately began investigating after confirming the degradation, working together with the Application team to mitigate the impact.

The failure occurred due to high CPU utilization on the MySQL HeatWave instance, the main production database.
This overload caused thread queuing and severe query degradation, resulting in latency and intermittent service failures. Even after manually terminating sessions, the applications’ automatic reconnections re-established new sessions at a high rate, preventing natural recovery of the instance.
The exact cause of the increased load is still under investigation and may be related to intensive queries, simultaneous locks, or unoptimized processes.

Immediate mitigation actions were executed to restore database performance and stability, including the controlled restart of services and a preventive rollback of the latest deployed version.
The Oracle Cloud support case remains open for in-depth analysis of the infrastructure layer, ensuring that no hardware failures or resource limitations are affecting the environment.

Preventive Actions

To prevent recurrence, the following measures have been defined:

Review of queries and indexes executed during the incident period;
Adjustment of performance configurations and appropriate instance sizing;
Engagement of a DBA specialist for continuous analysis and optimization of the database and critical processes;
Advanced monitoring of sessions and execution queues with proactive alerts.

The environment is currently stable and under continuous monitoring, with no new occurrences of latency or intermittency.
The SRE team continues to work with Oracle support until the investigation is fully completed and definitive corrective measures are implemented.

Posted Nov 03, 2025 - 19:31 GMT-03:00

Resolved

This incident has been resolved.

Posted Nov 03, 2025 - 12:43 GMT-03:00

Identified

We want to keep you informed that we have identified the issue and are actively working to resolve it as quickly as possible. Our team is dedicated to finding a solution to this matter.

Please note that this time is an estimate and may be subject to change as we work to address the situation.

...................................................................................................................................................................

Desejamos informar que identificamos o problema e estamos trabalhando ativamente para resolvê-lo o mais breve possível. Nossa equipe está empenhada em encontrar uma solução para esta questão.

Observe que esse tempo é uma estimativa e pode estar sujeito a alterações enquanto trabalhamos para resolver a situação.

Agradecemos sua compreensão e paciência durante o processo de resolução.

Posted Nov 03, 2025 - 12:12 GMT-03:00

Investigating

We are currently experiencing a high level of API errors, and our team is actively investigating the issue. We understand the impact this may be causing, and we are committed to resolving it as quickly as possible.

Please note that this time is an estimate and is subject to change as we continue to investigate and work on resolving the problem.

....................................................................................................................................................................

Atualmente, estamos enfrentando um nível elevado de erros na API, e nossa equipe está investigando ativamente o problema. Entendemos o impacto que isso pode estar causando e estamos empenhados em resolvê-lo o mais rápido possível.

Observe que esse tempo é uma estimativa e pode sofrer alterações à medida que continuamos investigando e trabalhando na resolução do problema.

Posted Nov 03, 2025 - 11:49 GMT-03:00

Monitoring

Our team has implemented the necessary fix, and we are currently monitoring the results to ensure everything is back to normal.
......................................................................................................................................................

Nossa equipe implementou o ajuste necessário e estamos monitorando os resultados para garantir que tudo esteja funcionando normalmente.

Posted Nov 03, 2025 - 11:03 GMT-03:00

Update

We have identified the cause of the instability affecting Deskbee Workspace and are actively working on resolving it.
Please note that some API services may still experience intermittent instability during this process.

**Next Scheduled Update:** 11:00 (Brazil Time)

We appreciate your patience and understanding as we work to restore full stability as quickly as possible.

---

Identificamos a causa da instabilidade que está afetando o Deskbee Workspace e nossa equipe já está atuando na correção.
Alguns serviços da API ainda podem apresentar instabilidade intermitente durante esse processo.

**Próxima atualização prevista:** 11h (horário de Brasília)

Agradecemos sua paciência e compreensão enquanto trabalhamos para restabelecer totalmente a estabilidade do sistema.

Posted Nov 03, 2025 - 10:10 GMT-03:00

Identified

We want to keep you informed that we have identified the issue and are actively working to resolve it as quickly as possible. Our team is dedicated to finding a solution to this matter.

Next Scheduled Update: 03/11 10:00

Please note that this time is an estimate and may be subject to change as we work to address the situation.

...................................................................................................................................................................

Desejamos informar que identificamos o problema e estamos trabalhando ativamente para resolvê-lo o mais breve possível. Nossa equipe está empenhada em encontrar uma solução para esta questão.

Próxima atualização prevista: 03/11 10:00

Observe que esse tempo é uma estimativa e pode estar sujeito a alterações enquanto trabalhamos para resolver a situação.

Agradecemos sua compreensão e paciência durante o processo de resolução.

Posted Nov 03, 2025 - 09:12 GMT-03:00

Investigating

We want to inform you about an ongoing issue regarding instability in Deskbee Workspace that is affecting some of our services. Our team is actively investigating the problem to identify the root cause and find a resolution.

Next Scheduled Update: 03/11 10:00 hrs

Please be aware that this time is an estimate and may be subject to change as we continue our investigation and work towards a solution.

We apologize for any inconvenience this may cause and assure you that we are dedicated to resolving this matter promptly.

............................................................................................................................................................................................

O objetivo deste alerta é informar sobre um problema de instabilidade no Deskbee Workspace que está afetando alguns de nossos serviços. Nossa equipe está investigando ativamente o problema para identificar a causa raiz e encontrar uma solução.

Próxima atualização prevista: 03/11 10:00 hrs

Por favor, esteja ciente de que esse tempo é uma estimativa e pode estar sujeito a alterações à medida que continuamos nossa investigação e trabalhamos para resolver o problema.

Pedimos desculpas por qualquer inconveniente que isso possa causar e asseguramos que estamos empenhados em resolver esta questão prontamente.

Posted Nov 03, 2025 - 08:48 GMT-03:00

This incident affected: Api Deskbee.