Hi everyone, I'm running mobile applications in pr...
# ory-selfhosting
m
Hi everyone, I'm running mobile applications in production and have implemented a flow where requests are authenticated via Ory Kratos. To manage this, I’ve set up separate target groups: • One for Kratos instances exposed to the internet (handling login). • Another for internal Kratos instances used only for session token validation. Here’s the challenge I’m facing: When I shift my application traffic to go through Kratos for authentication, all requests initially return
401 Unauthorized
— which is expected since clients don't yet have a session token. The mobile apps then detect this and immediately initiate login flows to obtain tokens. At the moment of this transition, I see a sudden spike of around 40,000 requests to Kratos, which then gradually decreases after the initial login burst. We’re using bcrypt in Kratos for password hashing, and we’ve already tested it with different costs (12, 8, and even 4), but the results are the same: CPU usage spikes to 100%, and the instances become unresponsive during the login surge. We initially tried AWS t4g.micro instances with horizontal scaling (30 instances), but they quickly hit 100% CPU and couldn’t keep up. Switching to c7g.2xlarge (15 instances) helped — they still hit 100% CPU but managed to keep processing the requests. It seems like we might not be using the most suitable instance type for this workload. What instance types do you recommend for running Ory Kratos in production, especially in cases with high burst login traffic like this? Would love to hear what has worked for others in similar environments. Thanks in advance!
m
Hey @mysterious-kitchen-18431 Ory now offers an "Enterprise License" for Ory Kratos - this basically gives you access the build of Ory Kratos that we use internally for our managed service. It also comes with full 24/7 support, architecture guidance, and onboarding to help you get to a robust and scalable setup. It also includes a bunch of features that are proprietary in Ory Network, such as support for multi-tenancy, B2B organizations, etc. If you use Ory software in a business critical context I would highly recommend to look into that. You can contact us and we can chat more about the offer and what your requirements are! Also you can of course look into using the managed Ory Network service, I think in the long run that is the most cost efficient option as AWS alone can also get quite costly - not even factoring in the engineering effort on your side to maintain/scale/update etc. As for community support, it is hard for me to give a recommendation for this particular case, maybe someone else has more experience. From top of my head, not sure if that makes sense though: 1. Use Ory Network to offload the scaling complexity 2. Use larger compute-optimized instances (c6i/c7g family) with at least 8-16 cores per instance 3. Considering a hybrid approach where you use powerful instances for the authentication layer and smaller instances for session validation
s
I would say you should really consider some other way to migrate and avoid the sudden spike of logins as much as possible. There is probably not that much you can do about the hash taking as long as it does. Note that changing the bcrypt cost does not affect already existing hashes, as they are already computed and stored with the old cost. As Vincent said, with a support agreement we can also help you look into the specific case and how to best migrate. As is, you can just scale up to meet your need and monitor the resource usage.