Rewind to Tuesday the 28th of January, 2020, around 1.30pm:
Your balance might not have been displaying correctly in your Xinja app after posting transactions;
You hit a Xinja Bank Account waitlist when you were trying to join us; or
You landed on an error screen right at the end of the sign-up process.
Not quite the royal Xinja treatment.
We advised our customers of tech issues via text and social channels and by 7:45pm all systems were back up and running.
Since then, we’ve been investigating the root cause and working through plans so we don’t run into the same issue.
So in the spirit of transparency, we’d like to share the details of what went wrong and what we’re doing.
A bit of background
The Xinja platform is built with multiple services, each with a dedicated job to do but all working together to provide multiple features. What this means is that if one part of the platform is slow or breaks down then it doesn’t interrupt your entire Xinja experience; you can still spend your money, transfer and Stash with your balance updated appropriately.
Every single customer action is captured and stored securely to ensure that no data is lost. Therefore in the event of a system issue, we can pinpoint the error, fix it as soon as possible and your actions will be processed correctly. This makes individual parts of the Xinja platform easier to scale and problems can easily be isolated which reduces customer impact.
Simply put, our Xinja platform was hit by a sudden unprecedented spike of activity generated by the uptake of our Stash accounts off the back of our recent launch. This spike in activity caused particular components to process account and transaction data slower than normal and as a result, the data was not readily available to other parts of the platform in the usual speedy manner.
To get a little more technical, our architecture is event-driven and we expect real-time notifications of events across our landscape. In one particular area – our integration with our Core Banking Platform – we are polling for events rather than getting notifications, which creates an inherent weak spot. We’ve been working with SAP to switch to notifications and expect to have this working fully within the next couple of weeks. We thought we had plenty of capacity to handle the initial load but the really big spike showed up this weakness and created a backlog of events that we had trouble catching up with.
The impacts of this meant that even though transactions were going through, and balances adjusted correctly, the data was slower to permeate through the system. This meant that in some cases, updated account balances and new transactions weren’t immediately displayed in your app.
Despite this happening, there was no danger of customers overdrawing because even though balances may not have been displaying correctly in-app, all payments and transactions were successfully going through and adjusted immediately in our platform.
The Xinja team discovered this issue using our monitoring systems and made a performance-related fix which enabled the data to be processed quickly and flow through to the app.
What we observed was that a large amount of transaction and account data received wasn’t being distributed across the platform quickly enough. We were processing this data in larger chunks than usual which meant this took longer to do and hence the data was slower to appear in the app. To overcome this, we filtered the data more effectively and scaled the services that processed it. This resulted in smaller data chunks and more processing power which enabled us to quickly catch up and push out the updated information to the app.
The proper fix, getting real-time notifications from SAP, will completely remove this point of weakness. We’ve been working hard on this and will have it in place soon.
Here’s what we learnt
At Xinja, we’re continually trying to strike the balance of building new stuff we know our customers want and maintaining and strengthening the core foundations (the unsexy stuff our customers expect). Below is how we’re learning from this issue.
The scaling of our platform systems needs to happen more quickly. Ironically, this was a little known imperfection and our engineers had already been working hard to tune the platform exactly where the problem occurred. Don’t worry, we’re due to push through improvements in our next app release.
Having a more elegant way to shut the front door when we need to, rather than just switching on a waitlist.
Communicating better with our customers so you know where you stand during service disruptions.
We’re experiencing many firsts here at Xinja HQ. This is the first time we’ve tried to lift the hood on what went wrong. We hope you found this post useful and welcome any feedback you might have. Thank you to all our Xinjas for your patience and support.