This properly propagates the go Context on down to all HTTP calls, which means that outgoing request have the OTLP trace context.
This also adds the Jaeger propagator to the list of OTEL propagators, so that Synapse properly gets the incoming trace context.
It also upgrades all the OTEL libraries
- Move processing of to-device msgs to the last thing, so we don't double process.
- Use internal.DataError when we fail to load a snapshot correctly i.e missing events in the snapshot.
On receipt of errors, do not advance the since token. Only added to
functions where losing data is bad (events, to-device msgs, etc).
With unit tests, which actually caught some interesting failure modes.
Specifically this is targetting invite rejections, where the leave
event is inside the leave block of the sync v2 response.
Previously, we would make a snapshot with this leave event. If the
proxy wasn't in this room, it would mean the room state would just
be the leave event, which is wrong. If the proxy was in the room,
then state would correctly be rolled forward.
Thank God for Goland's refactoring tools.
This will (untested) associate sentry events from the sync2 part of the
code with User IDs and Device IDs, without having to constantly invoke
sentry.WithScope(). (Not all of the handler methods currently have that
information.) It also leaves the door open for us to include more data
on poller sentry reports (e.g. access token hash, time of last token
activity on the sync3 side, ...)
With regression test. The behaviour is:
- Delete the connection, such that incoming requests will end up with M_UNKNOWN_POS
- The next request will then return HTTP 401.
This has knock-on effects:
- We no longer send HTTP 502 if /whoami returns 401, instead we return 401.
- When the token is expired (pollers get 401, the device is deleted from the DB).
Features:
- Add `typing` extension.
- Add `receipts` extension.
- Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
- Add `SYNCV3_PPROF` support.
- Add `by_notification_level` sort order.
- Add `include_old_rooms` support.
- Add support for `$ME` and `$LAZY`.
- Add correct filtering when `*,*` is used as `required_state`.
- Add `num_live` to each room response to indicate how many timeline entries are live.
Bug fixes:
- Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
- Send back an `errcode` on HTTP errors (e.g expired sessions).
- Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
- Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
- Send HTTP 400 for invalid range requests.
- Don't publish no-op unread counts which just adds extra noise.
- Fix leaking DB connections which could eventually consume all available connections.
- Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.
Improvements:
- Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
- Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
- Add `SlidingSyncUntil...` in tests to reduce races.
- Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
- Add trace task for initial syncs.
- Include the proxy version in UA strings.
- HTTP errors now wait 1s before returning to stop clients tight-looping on error.
- Pending event buffer is now 2000.
- Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
- Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
Fixes https://github.com/matrix-org/sliding-sync/issues/23
- Added InvitesTable
- Allow invites to be sorted/searched the same as any other room by
implementing RoomMetadata for the invite (though this is best effort
as we don't have heroes)
Clients rely on transaction IDs coming down their /sync streams so they
can pair up an incoming event with an event they just sent but have not
yet got the event ID for.
The proxy has not historically handled this because of the shared work
model of operation, where we store exactly 1 copy of the event in the
database and no more. This means if Alice and Bob are running in the
same proxy, then Alice sends a message, Bob's /sync stream may get the
event first and that will NOT contain the `transaction_id`. This then
gets written into the database. Later when Alice /syncs, she will not
get the `transaction_id` for her event which she sent.
This commit fixes this by having a TTL cache which maps (user, event)
-> txn_id. Transaction IDs are inherently ephemeral, so keeping the
last 5 minutes worth of txn IDs in-memory is an easy solution which
will be good enough for the proxy. Actual server implementations of
sliding sync will be able to trivially deal with this behaviour natively.
- Modify the API to instead have `WaitUntilInitialSync()` which is backed by a `WaitGroup`.
- Call this new function when a poller exists and hasn't been terminated. Previously,
we would assume that if a poller exists then it has done an initial sync, which may
not always be true. This could lead to position mismatches as a connection would be
re-created after EnsurePolling returned.
- Add `AccountDataTable` with tests.
- Read global and per-room account data from sync v2 and add new callbacks to the poller.
- Update the `SyncV3Handler` to persist account data from sync v2 then notify the user cache.
- Update the `UserCache` to update `UserRoomData.IsDM` status on `m.direct` events.
- Read `m.direct` event from the DB when `UserCache` is created to track DM status per-room.
Document a nasty race condition which can happen if >1 user is joined
to the same room. Fixed to ensure that `GlobalCache` will always stay
in-sync with the database without having to hit the database.