24 Commits

Author SHA1 Message Date
Kegan Dougal
4d54faa1a6 Fix remaining race conditions; add -race to CI 2024-03-11 10:30:03 +00:00
David Robertson
f595aed2c5
Add a separate payload for redacting state
So that we don't end up nuking conns unnecessarily.
2023-11-01 19:03:17 +00:00
David Robertson
84a5ae5dc4
A batch of useful comments
pulled out of #329
2023-10-19 14:17:39 +01:00
David Robertson
afe589921e
Invalidation: don't bother propagating a snapshot 2023-09-08 18:17:13 +01:00
David Robertson
be78e6f6e4
Define V2InvalidateRoom 2023-09-07 18:45:32 +01:00
David Robertson
b6534aa45e
Add Success field to V2InitialSyncComplete 2023-09-06 11:15:06 +01:00
kegsay
a61a3fdde2
Merge pull request #235 from matrix-org/kegan/leave-event-shouldnt-snapshot
Do not make snapshots for lone leave events
2023-08-02 04:53:40 -07:00
Kegan Dougal
6623ddb9e3 Do not make snapshots for lone leave events
Specifically this is targetting invite rejections, where the leave
event is inside the leave block of the sync v2 response.

Previously, we would make a snapshot with this leave event. If the
proxy wasn't in this room, it would mean the room state would just
be the leave event, which is wrong. If the proxy was in the room,
then state would correctly be rolled forward.
2023-07-31 17:53:15 +01:00
David Robertson
4a6623ff77
Include room ID in the txnid payload 2023-07-25 19:08:11 +01:00
David Robertson
008157c146
poller: send all-clear 2023-07-25 19:08:10 +01:00
Kegan Dougal
f36c038cf8 Rate limit pubsub.V2DeviceData updates to be at most 1 per second
The db writes are still instant, but the notifications are now delayed
by up to 1 second, in order to not swamp the pubsub channels.
2023-06-26 21:04:02 -07:00
David Robertson
6a951908af
Emit a User and DeviceID in txn message 2023-06-12 11:52:33 +01:00
David Robertson
32f393ddac
Define new pubsub for txnids 2023-06-10 12:12:39 +01:00
David Robertson
06b7d91b08
Bunch of comments 2023-05-15 12:10:55 +01:00
David Robertson
ca8a2d72c4
Make ConnID hold a UserID 2023-04-28 18:50:42 +01:00
David Robertson
181cfba19e
Introduce PollerID 2023-04-28 17:05:46 +01:00
David Robertson
4f62e7af50
More work on fetching tokens from DB 2023-04-28 12:29:51 +01:00
Kegan Dougal
6bdef5feba bugfix: expire connections when the access token gets invalidated
With regression test. The behaviour is:
 - Delete the connection, such that incoming requests will end up with M_UNKNOWN_POS
 - The next request will then return HTTP 401.

This has knock-on effects:
 - We no longer send HTTP 502 if /whoami returns 401, instead we return 401.
 - When the token is expired (pollers get 401, the device is deleted from the DB).
2023-03-01 16:40:15 +00:00
Kegan Dougal
2139eda047 tests: add test for full connection buffers and expiry
Fixed a bug in notification code which could cause integration
tests to not be as deterministic as intended; should fix flakey
tests.
2023-02-03 10:00:45 +00:00
Kegan Dougal
95a5af3abe perf: immediately send to-device messages to listening conns 2023-01-09 11:53:17 +00:00
Kegan Dougal
6c4f7d3722 improvement: completely refactor device data updates
- `Conn`s now expose a direct `OnUpdate(caches.Update)` function
  for updates which concern a specific device ID.
- Add a bitset in `DeviceData` to indicate if the OTK or fallback keys were changed.
- Pass through the affected `DeviceID` in `pubsub.V2DeviceData` updates.
- Remove `DeviceDataTable.SelectFrom` as it was unused.
- Refactor how the poller invokes `OnE2EEData`: it now only does this if
  there are changes to OTK counts and/or fallback key types and/or device lists,
  and _only_ sends those fields, setting the rest to the zero value.
- Remove noisy logging.
- Add `caches.DeviceDataUpdate` which has no data but serves to wake-up the long poller.
- Only send OTK counts / fallback key types when they have changed, not constantly. This
  matches the behaviour described in MSC3884

The entire flow now looks like:
- Poller notices a diff against in-memory version of otk count and invokes `OnE2EEData`
- Handler updates device data table, bumps the changed bit for otk count.
- Other handler gets the pubsub update, directly finds the `Conn` based on the `DeviceID`.
  Invokes `OnUpdate(caches.DeviceDataUpdate)`
- This update is handled by the E2EE extension which then pulls the data out from the database
  and returns it.
- On initial connections, all OTK / fallback data is returned.
2022-12-22 15:08:42 +00:00
Kegan Dougal
233d21ad2e Type switch payload types; add Prometheus instructions
The type names should make it self-explanatory what kinds of
payloads are being processed.
2022-12-16 10:52:08 +00:00
Kegan Dougal
aa28df161c Rename package -> github.com/matrix-org/sliding-sync 2022-12-15 11:08:50 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00