I ve been having some issues with keto starting up in a brok Ory Community #talk-keto

I've been having some issues with keto starting up...

average-airport-82192

10/04/2022, 7:47 PM

I've been having some issues with keto starting up in a broken state and was wondering if anyone had an advice or suggestions on what to do. To my knowledge I'm not doing anything strange (happy to share more configs and stuff), but sometimes when keto starts up in our CI environment (and it also happens on macos with the brewed formula) keto will return 404 for endpoints that exist and work normally other times. Here's an example of some curl output showing this condition:

Copy code

#!/bin/bash -eo pipefail
for i in `seq 1 10`;
do
  curl -v -X GET '<http://localhost:4468/relation-tuples>' | grep '{"relation_tuples":\[\],"next_page_token":""}' && exit 0
  sleep 2
done
echo Failed waiting for keto && exit 1

Note: Unnecessary use of -X or --request, GET is already inferred.
* Uses proxy env variable NO_PROXY == '127.0.0.1,localhost,circleci-internal-outer-build-agent'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 127.0.0.1:4468...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 4468 (#0)
> GET /relation-tuples HTTP/1.1
> Host: localhost:4468
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 04 Oct 2022 19:41:32 GMT
< Content-Length: 19
< 
{ [19 bytes data]
100    19  100    19    0     0  19000      0 --:--:-- --:--:-- --:--:-- 19000
* Connection #0 to host localhost left intact

but other times the same check passes on the first try:

Copy code

#!/bin/bash -eo pipefail
for i in `seq 1 10`;
do
  curl -v -X GET '<http://localhost:4468/relation-tuples>' | grep '{"relation_tuples":\[\],"next_page_token":""}' && exit 0
  sleep 2
done
echo Failed waiting for keto && exit 1

Note: Unnecessary use of -X or --request, GET is already inferred.
* Uses proxy env variable NO_PROXY == '127.0.0.1,localhost,circleci-internal-outer-build-agent'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 127.0.0.1:4468...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 4468 (#0)
> GET /relation-tuples HTTP/1.1
> Host: localhost:4468
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=utf-8
< Date: Tue, 04 Oct 2022 18:43:05 GMT
< Content-Length: 44
< 
{ [44 bytes data]
100    44  100    44    0     0  22000      0 --:--:-- --:--:-- --:--:-- 22000
* Connection #0 to host localhost left intact
{"relation_tuples":[],"next_page_token":""}

average-airport-82192

10/04/2022, 7:49 PM

here's the dockerfile we're using and the associated scripts and configs

average-airport-82192

10/04/2022, 7:50 PM

Any feedback or advice would be super appreciated as I'm a bit befuddled as to what's going on

bright-wire-30388

10/04/2022, 7:52 PM

(irrelevant if all your tests are running against a single instance) are you positive the server started without port conflicts? I generally have trouble with ports vs unit testing, where subsequent tests’ servers can’t start becasue the port is occupied by a previous run. this creates flaky races where test b hit test a’s port before a was closed, but test c errors because srv a is since closed, but srv c failed to start becasue srv a was running on port xxx at the time.

average-airport-82192

10/04/2022, 7:56 PM

It's flaky enough on CI that it's hard to say there, but I've observed it locally (running from launchd on startup) where keto will be listening on that port ei

lsof

shows the following, but ill still get 404s from known good keto urls

Copy code

> lsof -i :4468
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
keto    13377 nick   33u  IPv6 0x9d2250f89fcca365      0t0  TCP *:4468 (LISTEN)

average-airport-82192

10/04/2022, 7:58 PM

all the tests run against a single instance started when the env is spun up (fresh every time, we use circle ci), I added the curl check as a stage before the "unit" tests actually run because I was having a hard time believing how it was failing the basic flow is: • startup docker containers for datastores • init databases • block keto starting until its posgres instance is ready • keto starts • block test execution with that curl loop posted in the first message • try and run the tests

average-airport-82192

10/04/2022, 8:30 PM

~~looking at the logs it looks like keto dosen't always read the database config options correctly~~ Edit: looks like that's in reference to the options of the connection, not the connection itself...

average-airport-82192

10/04/2022, 9:35 PM

here's a failure case:

average-airport-82192

10/04/2022, 9:36 PM

and a success:

average-airport-82192

10/04/2022, 9:36 PM

Not seeing anything in the logs to indicate why it would fail

steep-lamp-91158

10/05/2022, 8:42 AM

is the 404 permanent or just for a short time after the start?

steep-lamp-91158

10/05/2022, 8:44 AM

we have a race on master that could be related, but we thought until now it is just a race in the test

steep-lamp-91158

10/05/2022, 8:46 AM

what version of keto are you using?

average-airport-82192

10/05/2022, 4:28 PM

v0.10.0-alpha.0

average-airport-82192

10/05/2022, 4:30 PM

From what I can tell the 404 is persistent, it doesn't go away in any amount of time I've been able to notice. But it's kinda hard to say.

average-airport-82192

10/05/2022, 4:58 PM

Happy to try anything out (builds with logging, setting logging params, etc if that helps)

steep-lamp-91158

10/05/2022, 5:03 PM

Hm I'll try to reproduce, is there anything I should set specifically? What is the body you get? And since what version is it happening?

average-airport-82192

10/05/2022, 5:34 PM

I think it's an empty body, the real way I've noticed the failure is lots of 404s from read endpoints, not so many from write endpoints this request returning 404 is indicative of the failures I've seen:

curl -v -X GET '<http://localhost:4468/relation-tuples>'

(used a different port for our test env but I've seen it on both)

average-airport-82192

10/05/2022, 5:35 PM

Seems to only manifest on startup ie if keto is working correctly you could possibly restart it until it gets into the failed state

average-airport-82192

10/05/2022, 5:55 PM

I think I'll sometimes see something like this
listen tcp 0.0.0.0:4466: bind: address already in use
when it happens too this might be something else

steep-lamp-91158

10/05/2022, 6:07 PM

I have an idea what it could be, I'll investigate tomorrow

average-airport-82192

10/10/2022, 4:31 PM

@steep-lamp-91158 I was wondering if there were any updates or if you'd like me to file an issue on github or somewhere else?

steep-lamp-91158

10/10/2022, 4:37 PM

I cannot reproduce unfortunately... Are you sure there is at any point only one server running? Could it be that there is two instances, both trying to bind to the same port?

steep-lamp-91158

10/10/2022, 4:38 PM

Also I could not find any suspicious code unfortunately....

average-airport-82192

10/14/2022, 5:29 PM

I'm still seeing this happen, although rarely, so I ran our CI process on a 30 minute loop to get a better sense of the frequency. I realize that as a report from some random person this is not the best/most trustworthy data, so I just wanted to say thanks for taking a look at it so far. Are there any logs or things that would be helpful for diagnosing? I'm a bit concerned that this is happening in CI and locally as we're working on migrating our whole authentication system over to keto.

steep-lamp-91158

10/15/2022, 11:07 AM

I'd really like to help but I cannot reproduce the issue... It would really help to get a reproducible case. I'm also wondering whether it is an actual Keto issue as I never saw it in our CI or production system. Could it be related to your CI setup? Eg when the previous run is not completely torn down? Maybe you can share how you start the server and log the PID of it, then you could verify in the failing run using

lsof -i :4466

whether it is the same PID on that port.

steep-lamp-91158

10/15/2022, 11:10 AM

It could then either be a race condition with the next run starting too soon, or Keto is sometimes not properly exiting

average-airport-82192

10/17/2022, 6:36 PM

Thanks! I really appreciate it, Ill try and follow up with our CI provider. My understanding was that the instances run in docker so they're fresh every time. We run the database similarly and haven't seen similar issues with it. The thing that makes me think it's something in keto is that keto responds with a 404, which indicates to me that the server is running and even doing HTTP correctly (otherwise we'd get a timeout or connection failure)

steep-lamp-91158

11/03/2022, 12:08 PM

maybe https://github.com/ory/keto/pull/1107 will solve this for you, we have only seen a flake in unit tests though

7 Views

Open in Slack

Previous Next