- Print it out in red so you can spot it easily in a long test log
- Pretty print the response json, because life's too short to train your
brain to be a JSON parser.
- `Conn`s now expose a direct `OnUpdate(caches.Update)` function
for updates which concern a specific device ID.
- Add a bitset in `DeviceData` to indicate if the OTK or fallback keys were changed.
- Pass through the affected `DeviceID` in `pubsub.V2DeviceData` updates.
- Remove `DeviceDataTable.SelectFrom` as it was unused.
- Refactor how the poller invokes `OnE2EEData`: it now only does this if
there are changes to OTK counts and/or fallback key types and/or device lists,
and _only_ sends those fields, setting the rest to the zero value.
- Remove noisy logging.
- Add `caches.DeviceDataUpdate` which has no data but serves to wake-up the long poller.
- Only send OTK counts / fallback key types when they have changed, not constantly. This
matches the behaviour described in MSC3884
The entire flow now looks like:
- Poller notices a diff against in-memory version of otk count and invokes `OnE2EEData`
- Handler updates device data table, bumps the changed bit for otk count.
- Other handler gets the pubsub update, directly finds the `Conn` based on the `DeviceID`.
Invokes `OnUpdate(caches.DeviceDataUpdate)`
- This update is handled by the E2EE extension which then pulls the data out from the database
and returns it.
- On initial connections, all OTK / fallback data is returned.
Features:
- Add `typing` extension.
- Add `receipts` extension.
- Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
- Add `SYNCV3_PPROF` support.
- Add `by_notification_level` sort order.
- Add `include_old_rooms` support.
- Add support for `$ME` and `$LAZY`.
- Add correct filtering when `*,*` is used as `required_state`.
- Add `num_live` to each room response to indicate how many timeline entries are live.
Bug fixes:
- Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
- Send back an `errcode` on HTTP errors (e.g expired sessions).
- Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
- Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
- Send HTTP 400 for invalid range requests.
- Don't publish no-op unread counts which just adds extra noise.
- Fix leaking DB connections which could eventually consume all available connections.
- Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.
Improvements:
- Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
- Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
- Add `SlidingSyncUntil...` in tests to reduce races.
- Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
- Add trace task for initial syncs.
- Include the proxy version in UA strings.
- HTTP errors now wait 1s before returning to stop clients tight-looping on error.
- Pending event buffer is now 2000.
- Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
- Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
- Randomly move elements 10,000 times in a sliding window.
- Fixed a bug as a result which would cause the algorithm to
fail to issue a DELETE/INSERT when the room was _inserted_
to the very end of the window range, due to it misfiring
with the logic to not issue operations for no-op moves.
This is so clients can accurately calculate the push rule:
```
{"kind":"room_member_count","is":"2"}
```
Also fixed a bug in the global room metadata for the joined/invited
counts where it could be wrong because of Synapse sending duplicate
join events as we were tracking +-1 deltas. We now calculate these
counts based on the set of user IDs in a specific membership state.
This was caused by the GlobalCache not having a metadata entry for
the new room, which in some cases prevented a stub from being made.
With regression test.
In preparation for migrating end-to-end style integration tests
to be actual end-to-end tests. The intended split is:
- Does the test exclusively use the public sliding sync API for test assertions?
- Does the test exclusively use the public sync v2 API for configuring the test?
If the answer to both questions is YES, then they should be end-to-end tests.
Some examples of this include testing core functionality of the API like
room subscriptions, multiple lists, filters, extensions, etc.
Some examples of tests which are NOT end-to-end tests include:
- Testing connection handling (e.g sending multiple duplicate requests)
- Ensuring outstanding requests get cancelled.
- Testing restarts of the proxy.
- Testing out-of-order responses.
- Benchmarks.
These all involve configuring the test / asserting different things, which would
be extremely difficult to reliably engineer using a real homeserver.
Along with a battery of tests to make sure we give account data only
for rooms being tracked in the sliding lists, unless it's global in
which case we just send it on.
- Only have a single database for all tests, like CI.
- Calling `PrepareDBConnectionString` drops all tables before returning
the string.
- Tests must be run with no concurrency else they will step on each other
due to the previous point.
This should prevent cases where local tests pass but CI fails.
We lazily load timelines for rooms as the client fetches them. If a previously
lazily loaded timeine goes out of the window then back in it results in a
DELETE/INSERT. We would detect that we already have a timeline for this room
and just return that, however that timeline was stale. We now keep this timeline
in sync with live events.
With regression test.
Add a `LoadJoinedRoomsOverride` to allow tests to override
and bypass DB checks. We need them in the cache in order to
synchronise loading connection state with live updates to
ensure we process events exactly once.