sliding-sync

mirror of https://github.com/matrix-org/sliding-sync.git synced 2025-03-10 13:37:11 +00:00

Author	SHA1	Message	Date
David Robertson	5621423295	Fix tests	2023-04-18 15:16:42 +01:00
David Robertson	2bae78476d	Propagate prepend events to poller, and inject	2023-04-18 14:56:25 +01:00
David Robertson	666823d211	Introduce return struct for Initialise	2023-04-17 20:05:32 +01:00
David Robertson	d70818ca8f	Track the size of returned poller timelines	2023-04-17 14:06:35 +01:00
David Robertson	2bdb88fffe	Update test expectations	2023-04-14 18:05:21 +01:00
David Robertson	aadc358581	Request timeline limit of 50 instead of HS default	2023-04-14 17:57:04 +01:00
David Robertson	601e3fce49	More sentry logging	2023-04-13 15:02:46 +01:00
David Robertson	80e46c234d	Merge pull request #61 from matrix-org/dmr/sentry-2 Send error logs to Sentry	2023-04-12 20:37:53 +01:00
David Robertson	846197e996	Have WhoAmI extract the device_id Useful for #51, small enough to include in isolation	2023-04-11 22:14:15 +01:00
David Robertson	1f3f14f30c	Report errors to Sentry, plumbing ctxs if needed	2023-04-05 18:24:01 +01:00
Kegan Dougal	a6c3f8f3fc	When a device is deleted, remove all device data with it (to-device events, device lists)	2023-03-01 16:56:04 +00:00
Kegan Dougal	6bdef5feba	bugfix: expire connections when the access token gets invalidated With regression test. The behaviour is: - Delete the connection, such that incoming requests will end up with M_UNKNOWN_POS - The next request will then return HTTP 401. This has knock-on effects: - We no longer send HTTP 502 if /whoami returns 401, instead we return 401. - When the token is expired (pollers get 401, the device is deleted from the DB).	2023-03-01 16:40:15 +00:00
Kegan Dougal	7fa433f732	bugfix: fix a bug with list ops when sorting with unread counts; fix a bug which could cause typing/receipts to not be live streamed Previously, we would not send unread count INCREASES to the client, as we would expect the actual event update to wake up the client conn. This was great because it meant the event+unread count arrived atomically on the client. This was implemented as "parse unread counts first, then events". However, this introduced a bug when there were >1 user in the same room. In this scenario, one poller may get the event first, which would go through to the client. The subsequent unread count update would then be dropped and not sent to the client. This would just be an unfortunate UI bug if it weren't for sorting by_notification_count and sorting by_notification_level. Both of these sort operations use the unread counts to determine room list ordering. This list would be updated on the server, but no list operation would be sent to the client, causing the room lists to de-sync, and resulting in incorrect DELETE/INSERT ops. This would manifest as duplicate rooms on the room list. In the process of fixing this, also fix a bug where typing notifications would not always be sent to the client - it would only do so when piggybacked due to incorrect type switches. Also fix another bug which prevented receipts from always being sent to the client. This was caused by the extensions handler not checking if the receipt extension had data to determine if it should return. This the interacted with an as-yet unfixed bug which cleared the extension on subequent updates, causing the receipt to be lost entirely. A fix for this will be inbound soon.	2023-02-07 13:34:26 +00:00
Kegan Dougal	fc7f2a183b	bugfix: fix data race in poller termination code Just swap to using atomic.Bool as it's easier. Remove the unnecessary channel.	2023-02-02 12:04:22 +00:00
Kegan Dougal	7eb139191f	Suppress duplicate typing events from waking up connections	2023-02-01 11:51:06 +00:00
Kegan Dougal	22abc32ca6	perf/bugfix: refactor event transaction_id handling - Scope transaction IDs to the device ID (access token) rather than the user ID, as this is more accurate with the spec. - Batch up all transaction ID lookups for all rooms being returned into a single query. Previously, we would sequentially call SELECT n times, one per room being returned, which was taking lots of time just due to RTTs to the database server (often this table is empty).	2023-01-16 11:55:37 +00:00
Kegan Dougal	0da350dd1a	Move e2e/txn interfaces closer to where they are used; rather than sync2 where they were used previously	2023-01-16 10:53:48 +00:00
Kegan Dougal	95a5af3abe	perf: immediately send to-device messages to listening conns	2023-01-09 11:53:17 +00:00
Kegan Dougal	48f28f9f6c	perf: filter out all rooms when doing an initial sync on 2nd+ pollers Fixes #17 in theory, as now the initial sync request will have no rooms and hence be faster to return. In theory. Maybe. Let's see.	2023-01-05 18:25:25 +00:00
Kegan Dougal	6c4f7d3722	improvement: completely refactor device data updates - `Conn`s now expose a direct `OnUpdate(caches.Update)` function for updates which concern a specific device ID. - Add a bitset in `DeviceData` to indicate if the OTK or fallback keys were changed. - Pass through the affected `DeviceID` in `pubsub.V2DeviceData` updates. - Remove `DeviceDataTable.SelectFrom` as it was unused. - Refactor how the poller invokes `OnE2EEData`: it now only does this if there are changes to OTK counts and/or fallback key types and/or device lists, and _only_ sends those fields, setting the rest to the zero value. - Remove noisy logging. - Add `caches.DeviceDataUpdate` which has no data but serves to wake-up the long poller. - Only send OTK counts / fallback key types when they have changed, not constantly. This matches the behaviour described in MSC3884 The entire flow now looks like: - Poller notices a diff against in-memory version of otk count and invokes `OnE2EEData` - Handler updates device data table, bumps the changed bit for otk count. - Other handler gets the pubsub update, directly finds the `Conn` based on the `DeviceID`. Invokes `OnUpdate(caches.DeviceDataUpdate)` - This update is handled by the E2EE extension which then pulls the data out from the database and returns it. - On initial connections, all OTK / fallback data is returned.	2022-12-22 15:08:42 +00:00
Kegan Dougal	aa28df161c	Rename package -> github.com/matrix-org/sliding-sync	2022-12-15 11:08:50 +00:00
Kegan Dougal	be8543a21a	add extensions for typing and receipts; bugfixes and additional perf improvements Features: - Add `typing` extension. - Add `receipts` extension. - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`. - Add `SYNCV3_PPROF` support. - Add `by_notification_level` sort order. - Add `include_old_rooms` support. - Add support for `$ME` and `$LAZY`. - Add correct filtering when `,` is used as `required_state`. - Add `num_live` to each room response to indicate how many timeline entries are live. Bug fixes: - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm. - Send back an `errcode` on HTTP errors (e.g expired sessions). - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :( - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic. - Send HTTP 400 for invalid range requests. - Don't publish no-op unread counts which just adds extra noise. - Fix leaking DB connections which could eventually consume all available connections. - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever. Improvements: - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s. - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s. - Add `SlidingSyncUntil...` in tests to reduce races. - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s. - Add trace task for initial syncs. - Include the proxy version in UA strings. - HTTP errors now wait 1s before returning to stop clients tight-looping on error. - Pending event buffer is now 2000. - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8. - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.	2022-12-14 18:53:55 +00:00
Kegan Dougal	b90a18a62a	Fix #45 : ensure we don't send null when we mean []	2022-09-20 13:19:28 +01:00
Kegan Dougal	d77e21138d	refactor: remove spurious code; rename OnRetireInvite to OnLeftRoom Add HasLeft to the user room metadata to control whether or not the list algo will nuke the room or not from the list.	2022-08-31 14:48:14 +01:00
Kegan Dougal	47ddc04652	E2EE extension: Add support for device_unused_fallback_key_types With tests	2022-08-09 10:05:18 +01:00
Kegan Dougal	0a17d0a4e4	perf: don't alert connstate for every event in `state` We don't care about that as they never form part of the timeline. Also, only send up a timeline limit: 1 filter to sync v2 when there is no ?since token. Otherwise, we want a timeline limit >1 so we can ensure that we remain gapless (else the proxy drops events).	2022-07-21 16:47:13 +01:00
Kegan Dougal	b9196db30b	BREAKING(db): refactor how history is calculated - Completely ignore events in the `state` block when processing sync v3 requests with a large `timeline_limit`. We should never have been including them in the first place as they are not chronological at all. - Perform sync v2 requests with a timeline limit of 1 to ensure we can always return a `prev_batch` token to the caller. This means on the first startup, clicking a room will force a `/messages` hit until there have been `$limit` new events, in which case it will be able to serve these events from the local DB. Critically, this ensures that we never send back an empty `prev_batch`, which causes clients to believe that there is no history in a room.	2022-07-21 16:20:59 +01:00
Kegan Dougal	976875ba7a	Skip unreadable access tokens	2022-07-20 11:37:26 +01:00
Kegan Dougal	47b74a6be6	Automatically start v2 pollers on startup We can do this now because we store the access token for each device. Throttled at 16 concurrent sync requests to avoid causing thundering herds on startup.	2022-07-14 10:48:45 +01:00
Kegan Dougal	ed9e9ed48c	Persist v2 access tokens in the database, encrypted - Add `SYNCV3_SECRET` env var which is SHA256'd and used as an AES key to encrypt/decrypt tokens. - Add column `v2_token_encrypted` to `syncv3_sync2_devices` - Update unit tests to check encryption/decryption work. This provides an extra layer of security in case the database is compromised and real user access tokens are leaked. This forces an attacker to obtain both the database table _and_ the secret env var (which will typically be stored in secure storage e.g k8s secrets). Unfortunately, we need to have the access_token in the plain so we cannot rely on password-style storage algorithms like bcrypt/scrypt, which would be safer.	2022-07-13 17:03:40 +01:00
Kegan Dougal	ebb9919614	Add trace logging	2022-04-12 12:27:20 +01:00
Kegan Dougal	5dc1c38764	Add prev_batch column to events table This will be used to return prev batch tokens to the client on a best-effort basis.	2022-03-31 14:29:26 +01:00
Kegan Dougal	873edd7315	bugfix: rework how invites are handled Fixes https://github.com/matrix-org/sliding-sync/issues/23 - Added InvitesTable - Allow invites to be sorted/searched the same as any other room by implementing RoomMetadata for the invite (though this is best effort as we don't have heroes)	2022-03-29 09:44:18 +01:00
Kegan Dougal	2920191a44	feature: add txnids to events Clients rely on transaction IDs coming down their /sync streams so they can pair up an incoming event with an event they just sent but have not yet got the event ID for. The proxy has not historically handled this because of the shared work model of operation, where we store exactly 1 copy of the event in the database and no more. This means if Alice and Bob are running in the same proxy, then Alice sends a message, Bob's /sync stream may get the event first and that will NOT contain the `transaction_id`. This then gets written into the database. Later when Alice /syncs, she will not get the `transaction_id` for her event which she sent. This commit fixes this by having a TTL cache which maps (user, event) -> txn_id. Transaction IDs are inherently ephemeral, so keeping the last 5 minutes worth of txn IDs in-memory is an easy solution which will be good enough for the proxy. Actual server implementations of sliding sync will be able to trivially deal with this behaviour natively.	2022-03-28 15:19:42 +01:00
Kegan Dougal	5c666a8e50	use constants for alice/aliceToken in integration tests	2022-03-25 13:07:12 +00:00
Kegan Dougal	3e36037844	bugfix: ensure we have done an initial sync before returning from EnsurePolling - Modify the API to instead have `WaitUntilInitialSync()` which is backed by a `WaitGroup`. - Call this new function when a poller exists and hasn't been terminated. Previously, we would assume that if a poller exists then it has done an initial sync, which may not always be true. This could lead to position mismatches as a connection would be re-created after EnsurePolling returned.	2022-03-18 12:31:31 +00:00
Kegan Dougal	41e73206c6	Log to_device counts	2022-03-17 15:33:56 +00:00
Kegan Dougal	b71a2b7769	Use GMSL timestamps	2022-02-21 20:35:17 +00:00
Kegan Dougal	e680a3c66d	Include invited rooms in the room list With a very basic test to make sure it appears.	2022-02-21 20:31:54 +00:00
Kegan Dougal	d4eee49f63	Implement E2EE extension, with tests. - Persist OTK counts and device list changes in-memory per Poller. - Expose a new `E2EEFetcher` to allow the E2EE extension code to grab said E2EE data from the Poller. - OTK counts are replaced outright. - Device lists are updated in a user_id->changed\|left map which is then deleted when read. - Add tests for basic functionality and some edge cases like ensuring that v3 request retries still return changed\|left values.	2021-12-16 18:12:09 +00:00
Kegan Dougal	24be8252f7	Change the retry schedule for the v2 poller to always be 3s Comments explain why.	2021-12-15 09:56:58 +00:00
Kegan Dougal	0e021eb560	Pass to-device messages through to the client - Treat to-device messages as opaque JSON blobs - Add basic integration test to ensure the messages make it from v2 to v3.	2021-12-14 11:51:47 +00:00
Kegan Dougal	344cd5dbc1	Don't wait for a full sync if the user has synced before This means we can serve rooms/events from the v3 database immediately if they exist. The downside is that we still do need to hit v2 to pull in to-device messages, but they can come in later.	2021-12-10 14:40:30 +00:00
Kegan Dougal	c6de2270ed	bugfix: is_dm filter was ignored for new live events Caused by not loading the `UserRoomData` when applying the filter. Added regression test.	2021-11-11 12:39:19 +00:00
Kegan Dougal	a2d6774024	Support `filters.is_dm` - Add `AccountDataTable` with tests. - Read global and per-room account data from sync v2 and add new callbacks to the poller. - Update the `SyncV3Handler` to persist account data from sync v2 then notify the user cache. - Update the `UserCache` to update `UserRoomData.IsDM` status on `m.direct` events. - Read `m.direct` event from the DB when `UserCache` is created to track DM status per-room.	2021-11-09 15:08:08 +00:00
Kegan Dougal	6e55f7f608	tests: make a consistent test env for both local and CI runs - Only have a single database for all tests, like CI. - Calling `PrepareDBConnectionString` drops all tables before returning the string. - Tests must be run with no concurrency else they will step on each other due to the previous point. This should prevent cases where local tests pass but CI fails.	2021-11-09 10:15:48 +00:00
Kegan Dougal	7ca81ef68a	bugfix: ensure notification counts don't get reset on new events With regression tests	2021-11-03 11:07:01 +00:00
Kegan Dougal	6c12077f62	Ensure the first sync is snappy if there is no traffic	2021-10-29 13:15:39 +01:00
Kegan Dougal	b3e9f0b32e	wait for v2 responses to be processed before returning	2021-10-28 16:25:24 +01:00
Kegan Dougal	9f3364d9ed	PollerMap: ensure callbacks are always called from a single goroutine Document a nasty race condition which can happen if >1 user is joined to the same room. Fixed to ensure that `GlobalCache` will always stay in-sync with the database without having to hit the database.	2021-10-28 16:15:17 +01:00

... 2 3 4 5 6

273 Commits