59 Commits

Author SHA1 Message Date
Kegan Dougal
05a82a43dc Same race pattern as timeSince for timeSleep 2024-03-11 12:06:13 +00:00
David Robertson
c239cacc83
Initialise: handle gappy polls and ditch prependStateEvents 2023-11-03 15:42:25 +00:00
Kegan Dougal
32c2f6b93d Actually use the provided value 2023-10-11 13:21:52 +01:00
Kegan Dougal
97d53448d7 Fix poller race condition 2023-10-11 12:58:05 +01:00
Kegan Dougal
0856a8d53d bugfix: give up polling if the /sync response keeps erroring for >50min 2023-10-03 13:02:17 +01:00
David Robertson
a28e419d5d
Update mockClient to match new interface 2023-09-26 13:35:24 +01:00
David Robertson
e75a462d4c
Merge pull request #300 from matrix-org/dmr/invalidate-timelines 2023-09-20 14:29:55 +01:00
David Robertson
d3ba1f1c30
Move TimelineResponse back to sync2 2023-09-19 12:41:25 +01:00
David Robertson
957bdee9d2
Merge branch 'main' into dmr/invalidate-timelines 2023-09-19 12:40:13 +01:00
Kegan Dougal
e4cedaabcd Merge branch 'main' into kegan/poll-retry-loop-bad-create-event 2023-09-14 09:29:44 +01:00
David Robertson
df01e50438
Pass TimelineResponse struct around 2023-09-13 19:17:53 +01:00
Quentin Gliech
af5e8579b2 Better propagate request context
This properly propagates the go Context on down to all HTTP calls, which means that outgoing request have the OTLP trace context.
This also adds the Jaeger propagator to the list of OTEL propagators, so that Synapse properly gets the incoming trace context.
It also upgrades all the OTEL libraries
2023-09-13 19:41:52 +02:00
Kegan Dougal
7c80b5424a Prioritise retriable errors over unretriable errors
Bump to Go 1.20 for errors.Join and added introspection to
errors.As to inspect []error.
2023-09-12 14:57:40 +01:00
David Robertson
d34a053927
Brief unit test 2023-09-06 15:49:19 +01:00
David Robertson
fca1318095
Let PollerMap.EnsurePolling return an error 2023-09-06 11:28:20 +01:00
Kegan Dougal
9c5ebb2f2b Guard for when the test has finished 2023-08-16 15:08:50 +01:00
Kegan Dougal
980d6423a5 Fix concurrent map writes 2023-08-16 14:00:40 +01:00
David Robertson
ff7120245a
Merge pull request #242 from matrix-org/dmr/purge-inactive-pollers 2023-08-16 13:43:46 +01:00
Kegan Dougal
066327d407 Add internal.DataError to skip over bad responses
- Move processing of to-device msgs to the last thing, so we don't double process.
- Use internal.DataError when we fail to load a snapshot correctly i.e missing events in the snapshot.
2023-08-16 10:52:35 +01:00
Kegan Dougal
9c7c7b7be2 Unbreak UTs 2023-08-15 19:11:21 +01:00
Kegan Dougal
d63864f494 Modify V2DataReceiver to allow error returns
On receipt of errors, do not advance the since token. Only added to
functions where losing data is bad (events, to-device msgs, etc).

With unit tests, which actually caught some interesting failure modes.
2023-08-15 18:51:11 +01:00
David Robertson
d659824edf
Expire pollers method 2023-08-09 11:46:12 +01:00
Till Faelligen
5846873d43
Merge branch 'main' of github.com:matrix-org/sliding-sync into s7evink/typing 2023-08-02 14:02:44 +02:00
Kegan Dougal
6623ddb9e3 Do not make snapshots for lone leave events
Specifically this is targetting invite rejections, where the leave
event is inside the leave block of the sync v2 response.

Previously, we would make a snapshot with this leave event. If the
proxy wasn't in this room, it would mean the room state would just
be the leave event, which is wrong. If the proxy was in the room,
then state would correctly be rolled forward.
2023-07-31 17:53:15 +01:00
Till Faelligen
3a2001f07d
Use PollerID instead of device ID 2023-07-27 12:33:10 +02:00
Till Faelligen
8dc8d4897f
Let only one device handle typing notifications 2023-07-24 08:40:23 +02:00
Till Faelligen
22f640a352
Check that calls to /sync use the expected since token 2023-07-19 14:56:44 +02:00
Till Faelligen
46d56b8433
Add test to check that the since token is only stored in the database
periodically
2023-07-19 12:17:47 +02:00
Till Faelligen
f6f1106fc4
Update test to include ToDevice messages 2023-07-18 14:37:33 +02:00
David Robertson
e5eb4f12ba
Plumb a ctx through to sync2
Thank God for Goland's refactoring tools.

This will (untested) associate sentry events from the sync2 part of the
code with User IDs and Device IDs, without having to constantly invoke
sentry.WithScope(). (Not all of the handler methods currently have that
information.) It also leaves the door open for us to include more data
on poller sentry reports (e.g. access token hash, time of last token
activity on the sync3 side, ...)
2023-05-25 22:22:15 +01:00
David Robertson
b428ede1ca
Update txns table 2023-05-02 18:16:14 +01:00
David Robertson
c1b1de5456
Delete tokens on expiry, to force /whoami lookup 2023-04-28 18:50:43 +01:00
David Robertson
181cfba19e
Introduce PollerID 2023-04-28 17:05:46 +01:00
David Robertson
5621423295
Fix tests 2023-04-18 15:16:42 +01:00
David Robertson
846197e996
Have WhoAmI extract the device_id
Useful for #51, small enough to include in isolation
2023-04-11 22:14:15 +01:00
Kegan Dougal
a6c3f8f3fc When a device is deleted, remove all device data with it (to-device events, device lists) 2023-03-01 16:56:04 +00:00
Kegan Dougal
6bdef5feba bugfix: expire connections when the access token gets invalidated
With regression test. The behaviour is:
 - Delete the connection, such that incoming requests will end up with M_UNKNOWN_POS
 - The next request will then return HTTP 401.

This has knock-on effects:
 - We no longer send HTTP 502 if /whoami returns 401, instead we return 401.
 - When the token is expired (pollers get 401, the device is deleted from the DB).
2023-03-01 16:40:15 +00:00
Kegan Dougal
48f28f9f6c perf: filter out all rooms when doing an initial sync on 2nd+ pollers
Fixes #17 in theory, as now the initial sync request will have no
rooms and hence be faster to return. In theory. Maybe. Let's see.
2023-01-05 18:25:25 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
d77e21138d refactor: remove spurious code; rename OnRetireInvite to OnLeftRoom
Add HasLeft to the user room metadata to control whether or not the
list algo will nuke the room or not from the list.
2022-08-31 14:48:14 +01:00
Kegan Dougal
5dc1c38764 Add prev_batch column to events table
This will be used to return prev batch tokens to the client
on a best-effort basis.
2022-03-31 14:29:26 +01:00
Kegan Dougal
873edd7315 bugfix: rework how invites are handled
Fixes https://github.com/matrix-org/sliding-sync/issues/23

- Added InvitesTable
- Allow invites to be sorted/searched the same as any other room by
  implementing RoomMetadata for the invite (though this is best effort
  as we don't have heroes)
2022-03-29 09:44:18 +01:00
Kegan Dougal
2920191a44 feature: add txnids to events
Clients rely on transaction IDs coming down their /sync streams so they
can pair up an incoming event with an event they just sent but have not
yet got the event ID for.

The proxy has not historically handled this because of the shared work
model of operation, where we store exactly 1 copy of the event in the
database and no more. This means if Alice and Bob are running in the
same proxy, then Alice sends a message, Bob's /sync stream may get the
event first and that will NOT contain the `transaction_id`. This then
gets written into the database. Later when Alice /syncs, she will not
get the `transaction_id` for her event which she sent.

This commit fixes this by having a TTL cache which maps (user, event)
-> txn_id. Transaction IDs are inherently ephemeral, so keeping the
last 5 minutes worth of txn IDs in-memory is an easy solution which
will be good enough for the proxy. Actual server implementations of
sliding sync will be able to trivially deal with this behaviour natively.
2022-03-28 15:19:42 +01:00
Kegan Dougal
3e36037844 bugfix: ensure we have done an initial sync before returning from EnsurePolling
- Modify the API to instead have `WaitUntilInitialSync()` which is backed by a `WaitGroup`.
- Call this new function when a poller exists and hasn't been terminated. Previously,
  we would assume that if a poller exists then it has done an initial sync, which may
  not always be true. This could lead to position mismatches as a connection would be
  re-created after EnsurePolling returned.
2022-03-18 12:31:31 +00:00
Kegan Dougal
24be8252f7 Change the retry schedule for the v2 poller to always be 3s
Comments explain why.
2021-12-15 09:56:58 +00:00
Kegan Dougal
0e021eb560 Pass to-device messages through to the client
- Treat to-device messages as opaque JSON blobs
- Add basic integration test to ensure the messages make it from v2 to v3.
2021-12-14 11:51:47 +00:00
Kegan Dougal
a2d6774024 Support filters.is_dm
- Add `AccountDataTable` with tests.
- Read global and per-room account data from sync v2 and add new callbacks to the poller.
- Update the `SyncV3Handler` to persist account data from sync v2 then notify the user cache.
- Update the `UserCache` to update `UserRoomData.IsDM` status on `m.direct` events.
- Read `m.direct` event from the DB when `UserCache` is created to track DM status per-room.
2021-11-09 15:08:08 +00:00
Kegan Dougal
6c12077f62 Ensure the first sync is snappy if there is no traffic 2021-10-29 13:15:39 +01:00
Kegan Dougal
9f3364d9ed PollerMap: ensure callbacks are always called from a single goroutine
Document a nasty race condition which can happen if >1 user is joined
to the same room. Fixed to ensure that `GlobalCache` will always stay
in-sync with the database without having to hit the database.
2021-10-28 16:15:17 +01:00
Kegan Dougal
fb9394d73b Add UnreadTable to track per-user per-room unread counters
With tests. Add function to V2DataReceiver interface.
2021-10-08 12:31:56 +01:00