- Connections are unique for the 3-uple (user, device, conneciton) IDs.
The code was only checking (user, device). This means we would delete
ALL connections for a device, is ANY connection expired.
- ...except we wouldn't, because of the 2nd bug, which is the deletion
code itself. This is missing `i--` so we will not do an ID check on
the element after a deleted index.
Both of these issues have now been fixed.
The server will wait 1s if clients:
- repeat the same request (same `?pos=`)
- repeatedly hit `/sync` without a `?pos=`.
Both of these failure modes have been seen in the wild.
Fixes#93.
With regression test. The behaviour is:
- Delete the connection, such that incoming requests will end up with M_UNKNOWN_POS
- The next request will then return HTTP 401.
This has knock-on effects:
- We no longer send HTTP 502 if /whoami returns 401, instead we return 401.
- When the token is expired (pollers get 401, the device is deleted from the DB).
Features:
- Add `typing` extension.
- Add `receipts` extension.
- Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
- Add `SYNCV3_PPROF` support.
- Add `by_notification_level` sort order.
- Add `include_old_rooms` support.
- Add support for `$ME` and `$LAZY`.
- Add correct filtering when `*,*` is used as `required_state`.
- Add `num_live` to each room response to indicate how many timeline entries are live.
Bug fixes:
- Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
- Send back an `errcode` on HTTP errors (e.g expired sessions).
- Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
- Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
- Send HTTP 400 for invalid range requests.
- Don't publish no-op unread counts which just adds extra noise.
- Fix leaking DB connections which could eventually consume all available connections.
- Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.
Improvements:
- Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
- Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
- Add `SlidingSyncUntil...` in tests to reduce races.
- Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
- Add trace task for initial syncs.
- Include the proxy version in UA strings.
- HTTP errors now wait 1s before returning to stop clients tight-looping on error.
- Pending event buffer is now 2000.
- Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
- Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
`sync3` contains data structures and logic which is very isolated and
testable (think ConnMap, Room, Request, SortableRooms, etc) whereas
`sync3/handler` contains control flow which calls into `sync3` data
structures.
This has numerous benefits:
- Gnarly complicated structs like `ConnState` are now more isolated
from the codebase, forcing better API design on `sync3` structs.
- The inability to do import cycles forces structs in `sync3` to remain
simple: they cannot pull in control flow logic from `sync3/handler`
without causing a compile error.
- It's significantly easier to figure out where to start looking for
code that executes when a new request is received, for new developers.
- It simplifies the number of things that `ConnState` can touch. Previously
we were gut wrenching out of convenience but now we're forced to move
more logic from `ConnState` into `sync3` (depending on the API design).
For example, adding `SortableRooms.RoomIDs()`.
Let ConnState directly subscribe to GlobalCache rather than
the awful indirection of ConnMap -> Conn -> ConnState we had before.
We had that before because ConnMap is responsible for destroying old
connections (based on the TTL cache), so we could just subscribe once
and then look through the map to see who to notify. In the interests
of decoupling logic, we now just call ConnState.Destroy() when the
connection is removed from ConnMap which allows ConnState to subscribe
to GlobalCache on creation and remove its subscription on Destroy().
This makes it significantly clearer who and where callbacks are firing
from and to, and now means ConnMap is simply in charge of maintaining
maps of user IDs -> Conn as well as terminating them when they expire
via TTL.
Add a `LoadJoinedRoomsOverride` to allow tests to override
and bypass DB checks. We need them in the cache in order to
synchronise loading connection state with live updates to
ensure we process events exactly once.
Keep it pure (not dependent on `state.Storage`) to make testing
easier. The responsibility for fanning out user cache updates
is with the Handler as it generally deals with glue code.
Adding this filter fundamentally changes the query to be optimised to
not pull out the entire room state. This will be used when calculating
the `required_state` response.
Also add tests for RoomStateAfterEventPosition and RoomStateBeforeEventPosition