41 Commits

Author SHA1 Message Date
Kegan Dougal
8ccb0185b3 Include mutexes in spans 2024-02-26 13:01:18 +00:00
David Robertson
f3037861a7
Cancel outstanding requests when destroying conns 2023-10-26 15:58:06 +01:00
David Robertson
fe74488a58
Clear queues on receipt of txn payload 2023-07-26 13:47:33 +01:00
Kegan Dougal
6c83b3a75b Adjust spam intervals 2023-07-24 18:33:20 +01:00
David Robertson
cc8e6d9fb0
Track the time before processing a request
in particular load() and setupConnection()
2023-06-22 17:40:22 +01:00
David Robertson
9037dc06fb
Don't report errors in the sync3 handler twice 2023-05-25 21:18:36 +01:00
David Robertson
2cc84501ca
Remove dead code
err is only set if panicErr != nil; so this branch is never hit.
2023-05-25 20:17:28 +01:00
Kegan Dougal
afaea53064 feat: add rate limiting
The server will wait 1s if clients:
 - repeat the same request (same `?pos=`)
 - repeatedly hit `/sync` without a `?pos=`.

Both of these failure modes have been seen in the wild.
Fixes #93.
2023-05-22 17:44:04 +01:00
kegsay
285e5263c1
Update sync3/conn.go
Co-authored-by: David Robertson <davidr@element.io>
2023-05-10 17:54:23 +01:00
Kegan Dougal
1d48ebea2f Add conn_id as per the MSC
Also fix a bug whereby required_state would not cause new state
to be sent to clients if it was updated as part of a room subscription.
2023-05-10 17:31:07 +01:00
David Robertson
ca8a2d72c4
Make ConnID hold a UserID 2023-04-28 18:50:42 +01:00
David Robertson
646232dcb0
Explanatory comments 2023-04-05 17:54:25 +01:00
David Robertson
32d482edd3
Maybe this fixes the segfault? 2023-04-05 17:14:47 +01:00
David Robertson
c8add54e59
Fix segfault when trying to report a panic 2023-04-05 15:51:54 +01:00
David Robertson
c208c6cc60
Report non-panic, internal errors 2023-04-05 15:04:17 +01:00
David Robertson
faef68bc6f
Don't use Sentry's middleware 2023-04-04 17:36:21 +01:00
Kegan Dougal
c2a3c53542 tracing: do runtime/trace and OTLP at the same time 2023-02-20 14:57:49 +00:00
Kegan Dougal
ff212bac48 bugfix: fix data races UserCache listener and cancelOutstandingReq
The user cache listeners slice is written to by HTTP goroutines
when clients make requests, and is read by the callbacks from v2
pollers. This slice wasn't protected from bad reads, only writes
were protected. Expanded the mutex to be RW to handle this.

cancelOutstandingReq is the context cancellation function to terminate
previous requests when a new request arrives. Whilst the request itself
is held in a mutex, invoking this cancellation function was not held
by anything. Added an extra mutex for this.
2023-02-02 11:43:18 +00:00
Kegan Dougal
a09bd2b854 Add more useful trace logs 2023-02-01 12:13:42 +00:00
Kegan Dougal
d1eaecc46c Add more tracing tasks and logs 2023-01-18 11:41:20 +00:00
Kegan Dougal
6c4f7d3722 improvement: completely refactor device data updates
- `Conn`s now expose a direct `OnUpdate(caches.Update)` function
  for updates which concern a specific device ID.
- Add a bitset in `DeviceData` to indicate if the OTK or fallback keys were changed.
- Pass through the affected `DeviceID` in `pubsub.V2DeviceData` updates.
- Remove `DeviceDataTable.SelectFrom` as it was unused.
- Refactor how the poller invokes `OnE2EEData`: it now only does this if
  there are changes to OTK counts and/or fallback key types and/or device lists,
  and _only_ sends those fields, setting the rest to the zero value.
- Remove noisy logging.
- Add `caches.DeviceDataUpdate` which has no data but serves to wake-up the long poller.
- Only send OTK counts / fallback key types when they have changed, not constantly. This
  matches the behaviour described in MSC3884

The entire flow now looks like:
- Poller notices a diff against in-memory version of otk count and invokes `OnE2EEData`
- Handler updates device data table, bumps the changed bit for otk count.
- Other handler gets the pubsub update, directly finds the `Conn` based on the `DeviceID`.
  Invokes `OnUpdate(caches.DeviceDataUpdate)`
- This update is handled by the E2EE extension which then pulls the data out from the database
  and returns it.
- On initial connections, all OTK / fallback data is returned.
2022-12-22 15:08:42 +00:00
Kegan Dougal
aa28df161c Rename package -> github.com/matrix-org/sliding-sync 2022-12-15 11:08:50 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
a37aee4c2b Improve logging; remove useless fields 2022-08-16 14:23:05 +01:00
Kegan Dougal
f2cd4034c7 bugfix: don't delete the acking response the moment it is ACKed
Else if the client retries that request (because the new response is lost)
then we will HTTP 400 them with an unknown pos.
2022-08-05 12:43:22 +01:00
Kegan Dougal
7a049ec3a3 Adjust the timeout value when we are forced to process requests with buffered responses
The problem is that there is NOT a 1:1 relationship between request/response,
due to cancellations needing to be processed (else state diverges between client/server).
Whilst we were buffering responses and returning them eagerly if the request data did
not change, we we processing new requests if the request data DID change. This puts us
in an awkward position. We have >1 response waiting to send to the client, but we
cannot just _ignore_ their new request else we'll just drop it to the floor, so we're
forced to process it and _then_ return the buffered response. This is great so long as
the request processing doesn't take long: which it will if we are waiting for live updates.
To get around this, when we detect this scenario, we artificially reduce the timeout value
to ensure request processing is fast.

If we just use websockets this problem goes away...
2022-08-04 12:06:22 +01:00
Kegan Dougal
ccbe1a81db Add response buffering to Conn
With unit/integration tests
2022-08-03 17:14:31 +01:00
Kegan Dougal
0d3157d610 Add support for txn_id in request/response
Missing buffering
2022-08-03 15:33:56 +01:00
Kegan Dougal
fbac9dde32 connstatelive: remove isSubscribedToRoom and getDeltaRoomData and simplify code paths 2022-05-31 12:32:11 +01:00
Kegan Dougal
2e571dd417 bugfix: cancel previous requests before acquiring conn locks
We used to rely on the HTTP conn being cancelled for this behaviour.
When the sliding sync proxy is used behind a reverse proxy there is
no guarantee that the upstream conn will be cancelled, causing very
laggy and poor performance. We now manually cancel() the previous
request.
2022-04-13 12:43:08 +01:00
Kegan Dougal
ebb9919614 Add trace logging 2022-04-12 12:27:20 +01:00
Kegan Dougal
24f70c9a8d bugfix: don't tightloop requests on panic/errors 2022-03-29 10:37:30 +01:00
Kegan Dougal
b0b65667f3 Add isInitial flag to OnIncomingRequest
Pass it to extensions for them to determine if they want to short-circuit
the sync loop. E2EE wants to short-circuit OTK counts on the first request,
as they aren't enough to short-circuit mid-connection.
2022-03-18 18:32:56 +00:00
Kegan Dougal
b208a2e2b3 Add room name filtering; Remove session IDs entirely
Should fix #19
2022-02-18 16:49:26 +00:00
Kegan Dougal
8f27160a88 Make pos a string and not an int 2022-01-04 15:32:50 +00:00
Kegan Dougal
d12863b9fa Remove connections when the buffer overflows
Else we block for 5s for each event resulting in a backlog of events.
2021-12-01 12:22:56 +00:00
Kegan Dougal
11b1260d07 Split sync3 into sync3 and sync3/handler
`sync3` contains data structures and logic which is very isolated and
testable (think ConnMap, Room, Request, SortableRooms, etc) whereas
`sync3/handler` contains control flow which calls into `sync3` data
structures.

This has numerous benefits:
 - Gnarly complicated structs like `ConnState` are now more isolated
   from the codebase, forcing better API design on `sync3` structs.
 - The inability to do import cycles forces structs in `sync3` to remain
   simple: they cannot pull in control flow logic from `sync3/handler`
   without causing a compile error.
 - It's significantly easier to figure out where to start looking for
   code that executes when a new request is received, for new developers.
 - It simplifies the number of things that `ConnState` can touch. Previously
   we were gut wrenching out of convenience but now we're forced to move
   more logic from `ConnState` into `sync3` (depending on the API design).
   For example, adding `SortableRooms.RoomIDs()`.
2021-11-05 15:45:04 +00:00
Kegan Dougal
488c638e7b Streamline how new events are pushed to ConnState
Let ConnState directly subscribe to GlobalCache rather than
the awful indirection of ConnMap -> Conn -> ConnState we had before.
We had that before because ConnMap is responsible for destroying old
connections (based on the TTL cache), so we could just subscribe once
and then look through the map to see who to notify. In the interests
of decoupling logic, we now just call ConnState.Destroy() when the
connection is removed from ConnMap which allows ConnState to subscribe
to GlobalCache on creation and remove its subscription on Destroy().

This makes it significantly clearer who and where callbacks are firing
from and to, and now means ConnMap is simply in charge of maintaining
maps of user IDs -> Conn as well as terminating them when they expire
via TTL.
2021-10-22 17:21:47 +01:00
Kegan Dougal
48613956d1 Add UserCache and move unread count tracking to it
Keep it pure (not dependent on `state.Storage`) to make testing
easier. The responsibility for fanning out user cache updates
is with the Handler as it generally deals with glue code.
2021-10-11 16:22:41 +01:00
Kegan Dougal
5f19fccd07 Implement unread counts on the client 2021-10-08 14:15:36 +01:00
Kegan Dougal
e20a8ad067 Move synclive to sync3 2021-10-05 16:22:02 +01:00