I've been having some issues with keto starting up...
# talk-keto
a
I've been having some issues with keto starting up in a broken state and was wondering if anyone had an advice or suggestions on what to do. To my knowledge I'm not doing anything strange (happy to share more configs and stuff), but sometimes when keto starts up in our CI environment (and it also happens on macos with the brewed formula) keto will return 404 for endpoints that exist and work normally other times. Here's an example of some curl output showing this condition:
Copy code
#!/bin/bash -eo pipefail
for i in `seq 1 10`;
do
  curl -v -X GET '<http://localhost:4468/relation-tuples>' | grep '{"relation_tuples":\[\],"next_page_token":""}' && exit 0
  sleep 2
done
echo Failed waiting for keto && exit 1

Note: Unnecessary use of -X or --request, GET is already inferred.
* Uses proxy env variable NO_PROXY == '127.0.0.1,localhost,circleci-internal-outer-build-agent'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 127.0.0.1:4468...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 4468 (#0)
> GET /relation-tuples HTTP/1.1
> Host: localhost:4468
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 04 Oct 2022 19:41:32 GMT
< Content-Length: 19
< 
{ [19 bytes data]
100    19  100    19    0     0  19000      0 --:--:-- --:--:-- --:--:-- 19000
* Connection #0 to host localhost left intact
but other times the same check passes on the first try:
Copy code
#!/bin/bash -eo pipefail
for i in `seq 1 10`;
do
  curl -v -X GET '<http://localhost:4468/relation-tuples>' | grep '{"relation_tuples":\[\],"next_page_token":""}' && exit 0
  sleep 2
done
echo Failed waiting for keto && exit 1

Note: Unnecessary use of -X or --request, GET is already inferred.
* Uses proxy env variable NO_PROXY == '127.0.0.1,localhost,circleci-internal-outer-build-agent'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 127.0.0.1:4468...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 4468 (#0)
> GET /relation-tuples HTTP/1.1
> Host: localhost:4468
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=utf-8
< Date: Tue, 04 Oct 2022 18:43:05 GMT
< Content-Length: 44
< 
{ [44 bytes data]
100    44  100    44    0     0  22000      0 --:--:-- --:--:-- --:--:-- 22000
* Connection #0 to host localhost left intact
{"relation_tuples":[],"next_page_token":""}
here's the dockerfile we're using and the associated scripts and configs
Any feedback or advice would be super appreciated as I'm a bit befuddled as to what's going on
b
(irrelevant if all your tests are running against a single instance) are you positive the server started without port conflicts? I generally have trouble with ports vs unit testing, where subsequent tests’ servers can’t start becasue the port is occupied by a previous run. this creates flaky races where test b hit test a’s port before a was closed, but test c errors because srv a is since closed, but srv c failed to start becasue srv a was running on port xxx at the time.
a
It's flaky enough on CI that it's hard to say there, but I've observed it locally (running from launchd on startup) where keto will be listening on that port ei
lsof
shows the following, but ill still get 404s from known good keto urls
Copy code
> lsof -i :4468
COMMAND   PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
keto    13377 nick   33u  IPv6 0x9d2250f89fcca365      0t0  TCP *:4468 (LISTEN)
all the tests run against a single instance started when the env is spun up (fresh every time, we use circle ci), I added the curl check as a stage before the "unit" tests actually run because I was having a hard time believing how it was failing the basic flow is: • startup docker containers for datastores • init databases • block keto starting until its posgres instance is ready • keto starts • block test execution with that curl loop posted in the first message • try and run the tests
looking at the logs it looks like keto dosen't always read the database config options correctly Edit: looks like that's in reference to the options of the connection, not the connection itself...
here's a failure case:
and a success:
Not seeing anything in the logs to indicate why it would fail
s
is the 404 permanent or just for a short time after the start?
we have a race on master that could be related, but we thought until now it is just a race in the test
what version of keto are you using?
a
v0.10.0-alpha.0
From what I can tell the 404 is persistent, it doesn't go away in any amount of time I've been able to notice. But it's kinda hard to say.
Happy to try anything out (builds with logging, setting logging params, etc if that helps)
s
Hm I'll try to reproduce, is there anything I should set specifically? What is the body you get? And since what version is it happening?
a
I think it's an empty body, the real way I've noticed the failure is lots of 404s from read endpoints, not so many from write endpoints this request returning 404 is indicative of the failures I've seen:
curl -v -X GET '<http://localhost:4468/relation-tuples>'
(used a different port for our test env but I've seen it on both)
Seems to only manifest on startup ie if keto is working correctly you could possibly restart it until it gets into the failed state
I think I'll sometimes see something like this
listen tcp 0.0.0.0:4466: bind: address already in use
when it happens too
this might be something else
s
I have an idea what it could be, I'll investigate tomorrow
a
@steep-lamp-91158 I was wondering if there were any updates or if you'd like me to file an issue on github or somewhere else?
s
I cannot reproduce unfortunately... Are you sure there is at any point only one server running? Could it be that there is two instances, both trying to bind to the same port?
Also I could not find any suspicious code unfortunately....
a
I'm still seeing this happen, although rarely, so I ran our CI process on a 30 minute loop to get a better sense of the frequency. I realize that as a report from some random person this is not the best/most trustworthy data, so I just wanted to say thanks for taking a look at it so far. Are there any logs or things that would be helpful for diagnosing? I'm a bit concerned that this is happening in CI and locally as we're working on migrating our whole authentication system over to keto.
s
I'd really like to help but I cannot reproduce the issue... It would really help to get a reproducible case. I'm also wondering whether it is an actual Keto issue as I never saw it in our CI or production system. Could it be related to your CI setup? Eg when the previous run is not completely torn down? Maybe you can share how you start the server and log the PID of it, then you could verify in the failing run using
lsof -i :4466
whether it is the same PID on that port.
It could then either be a race condition with the next run starting too soon, or Keto is sometimes not properly exiting
a
Thanks! I really appreciate it, Ill try and follow up with our CI provider. My understanding was that the instances run in docker so they're fresh every time. We run the database similarly and haven't seen similar issues with it. The thing that makes me think it's something in keto is that keto responds with a 404, which indicates to me that the server is running and even doing HTTP correctly (otherwise we'd get a timeout or connection failure)
s
maybe https://github.com/ory/keto/pull/1107 will solve this for you, we have only seen a flake in unit tests though