Hi there :wave: We are using the hosted version of...
# talk-keto
r
Hi there 👋 We are using the hosted version of Ory. And we are using Ketos to implement our permission system. The stored relationships are also crucial data points for our data teams. As far as I understand, with a self-hosted version of Ketos we could store our data in Postgres and we could sync data using a change data capture tool from Postgres into our data warehouse. I couldn't find an answer for this when using the SaaS version of Ory. Could someone point me in the right direction? How do others get data out of the cloud system? Thank you! 🙂
h
What data do you want to sync in particular? 🙂 We have this on the roadmap and are looking for input / requirements / use cases @fast-lunch-54279
r
Thanks for your feedback! Great to hear that it's on the roadmap. Luckily it's also not an immediate requirement for us. We are most interested in the relationship tuples from Keto. We are not duplicating these in our data store. We have entities such as users, organisations and roles in our data store, but the links between them are only stored in Keto. We want a single source of truth for these relationships. These links are crucial to join data for different analysis. Identity data such as email and username might also be helpful in the future, but is not as essential. When integrating data we would typically use CDC tools like Debezium or Kafka Connect. Then for SaaS APIs we also rely on Airbyte. If the product is popular enough and has an API that exposes the data, the chance is high someone build a connector for a tool like Airbyte. Ideally the data API has some notion of incremental update fetching (by passing a timestamp or other form of cursor).
s
What is an acceptable delay for such an API? Do you have some other similar API as a comparison?
r
For our use cases these data pipelines run only a few times a day. Having data that is stale for a few hours is mostly still acceptable.
If that can go down to minutes it enables a few more use cases for building dashboards to lookup current things in the system and not only historical analysis. But long-term analytics is definitely the main priority. As an example, our hubspot integration runs every 6h and sales is fine with the data they have for building dashboards.
s
Thanks, that helps a lot 👍
r
Great to hear! Of course, other companies might have other requirements. But I would say that for integrations with SaaS providers it's the norm that data is arriving with a delay. Getting data integrated in the first place is already a big challenge. Most data workflows are still batched-based. The "modern data stack" builds on tools like DBT and airflow which run scheduled batch jobs in intervals. So for most companies their internal data pipelines will also have a delay in the processing.
s
So you think a "stream" will not be sufficient, but also a batch is needed? Or is it enough that one can start the stream any time in the past?
r
You would be able to take one batch of events from the stream at a time and continue at the previous offset, later correct? That sounds like a concept that should work for batch tools as well.