Previously, we would not send unread count INCREASES to the client,
as we would expect the actual event update to wake up the client conn.
This was great because it meant the event+unread count arrived atomically
on the client. This was implemented as "parse unread counts first, then events".
However, this introduced a bug when there were >1 user in the same room. In this
scenario, one poller may get the event first, which would go through to the client.
The subsequent unread count update would then be dropped and not sent to the client.
This would just be an unfortunate UI bug if it weren't for sorting by_notification_count
and sorting by_notification_level. Both of these sort operations use the unread counts
to determine room list ordering. This list would be updated on the server, but no
list operation would be sent to the client, causing the room lists to de-sync, and
resulting in incorrect DELETE/INSERT ops. This would manifest as duplicate rooms
on the room list.
In the process of fixing this, also fix a bug where typing notifications would not
always be sent to the client - it would only do so when piggybacked due to incorrect
type switches.
Also fix another bug which prevented receipts from always being sent to the client.
This was caused by the extensions handler not checking if the receipt extension had
data to determine if it should return. This the interacted with an as-yet unfixed bug
which cleared the extension on subequent updates, causing the receipt to be lost entirely.
A fix for this will be inbound soon.
The RoomFinder accesses s.allRooms and is used when sorting the
room list, where we would expect many accesses. Previously, we
returned copies of room metadata, which caused significant amounts
of GC churn, enough to show up on traces.
Swap to using pointers and rename the function to `ReadOnlyRoom(roomID)`
to indicate that it isn't safe to write to this return value.
This results in flakey tests and a bad UX because one SS response can
say "the count changed from 0 to 1" but the message takes another
SS response. We _only_ send notification counts if they _decrease_,
and piggyback increases on the events in question which caused the
counts to go up.
Features:
- Add `typing` extension.
- Add `receipts` extension.
- Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
- Add `SYNCV3_PPROF` support.
- Add `by_notification_level` sort order.
- Add `include_old_rooms` support.
- Add support for `$ME` and `$LAZY`.
- Add correct filtering when `*,*` is used as `required_state`.
- Add `num_live` to each room response to indicate how many timeline entries are live.
Bug fixes:
- Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
- Send back an `errcode` on HTTP errors (e.g expired sessions).
- Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
- Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
- Send HTTP 400 for invalid range requests.
- Don't publish no-op unread counts which just adds extra noise.
- Fix leaking DB connections which could eventually consume all available connections.
- Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.
Improvements:
- Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
- Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
- Add `SlidingSyncUntil...` in tests to reduce races.
- Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
- Add trace task for initial syncs.
- Include the proxy version in UA strings.
- HTTP errors now wait 1s before returning to stop clients tight-looping on error.
- Pending event buffer is now 2000.
- Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
- Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
This is so clients can accurately calculate the push rule:
```
{"kind":"room_member_count","is":"2"}
```
Also fixed a bug in the global room metadata for the joined/invited
counts where it could be wrong because of Synapse sending duplicate
join events as we were tracking +-1 deltas. We now calculate these
counts based on the set of user IDs in a specific membership state.
Then just loop over the list deltas when processing the event. This
ensures we don't needlessly loop over lists which did not care and
still do not care about the incoming update.
This is part of a series of refactors aimed to improve the performance
and complexity of calculating list deltas, which up until now exists in
its current form due to organic growth of the codebase.
This specific refactor introduces a new interface `RoomFinder` which
can map room IDs to `*RoomConnMetadata` which is used by `ConnState`.
All the sliding sync lists now use the `RoomFinder` instead of keeping
their own copies of `RoomConnMetadata`, meaning per-connection, rooms
just have 1 copy in-memory. This cuts down on memory usage as well as
cuts down on GC churn as we would constantly be replacing N rooms for
each update, where N is the total number of lists on that connection.
For Element-Web, N=7 currently to handle Favourites, Low Priority, DMs,
Rooms, Spaces, Invites, Search. This also has the benefit of creating
a single source of truth in `InternalRequestLists.allRooms` which can
be updated once and then a list of list deltas can be calculated off
the back of that. Previously, `allRooms` was _only_ used to seed new
lists, which created a weird imbalance as we would need to update both
`allRooms` _and_ each `FilteredSortableRooms` to keep things in-sync.
This refactor is incomplete in its present form, as we need to make
use of the new `RoomDelta` struct to efficiently package list updates.
Relevant actions include:
- People joining/leaving a room
- An m.room.name or m.room.canonical_alias event is sent
- etc..
Prior to this, we just set the room name field for initial=true
rooms only. This meant that if a room name was updated whilst it was
in the visible range (or currently subscribed to), we wouldn't set
this field resulting in stale names for clients. This was particularly
prominent when you created a room, as the initial member event would
cause the room to appear in the list as "Empty room" which then would
never be updated even if there was a subsequent `m.room.name` event
sent.
Fixed with regression tests.
- Rename `SortableRoomLists` to `InternalRequestLists` as it's more accurate.
- Move `allRooms` into `InternalRequestLists` rather than having it in connstate.go
and force accessors through `InternalRequestLists`. This ensure that we create
new lists in one place with the right rooms consistently.
Specifically:
- Remove top-level `ops`, and replace with `lists`.
- Remove list indexes from `ops`, and rely on contextual location information.
- Remove top-level `counts` and instead embed them into each list contextually.
- Refactor connstate to reflect new API shape.
Still to do:
- Remove `rooms` / `room` from the op response, and bundle it into the
top-level `rooms`.
- Remove `UPDATE` op.
- Add `room_id` / `room_ids` field to ops to let clients know which rooms each op relates to.