273 Commits

Author SHA1 Message Date
David Robertson
5621423295
Fix tests 2023-04-18 15:16:42 +01:00
David Robertson
2bae78476d
Propagate prepend events to poller, and inject 2023-04-18 14:56:25 +01:00
David Robertson
666823d211
Introduce return struct for Initialise 2023-04-17 20:05:32 +01:00
David Robertson
d70818ca8f
Track the size of returned poller timelines 2023-04-17 14:06:35 +01:00
David Robertson
2bdb88fffe
Update test expectations 2023-04-14 18:05:21 +01:00
David Robertson
aadc358581
Request timeline limit of 50 instead of HS default 2023-04-14 17:57:04 +01:00
David Robertson
601e3fce49
More sentry logging 2023-04-13 15:02:46 +01:00
David Robertson
80e46c234d
Merge pull request #61 from matrix-org/dmr/sentry-2
Send error logs to Sentry
2023-04-12 20:37:53 +01:00
David Robertson
846197e996
Have WhoAmI extract the device_id
Useful for #51, small enough to include in isolation
2023-04-11 22:14:15 +01:00
David Robertson
1f3f14f30c
Report errors to Sentry, plumbing ctxs if needed 2023-04-05 18:24:01 +01:00
Kegan Dougal
a6c3f8f3fc When a device is deleted, remove all device data with it (to-device events, device lists) 2023-03-01 16:56:04 +00:00
Kegan Dougal
6bdef5feba bugfix: expire connections when the access token gets invalidated
With regression test. The behaviour is:
 - Delete the connection, such that incoming requests will end up with M_UNKNOWN_POS
 - The next request will then return HTTP 401.

This has knock-on effects:
 - We no longer send HTTP 502 if /whoami returns 401, instead we return 401.
 - When the token is expired (pollers get 401, the device is deleted from the DB).
2023-03-01 16:40:15 +00:00
Kegan Dougal
7fa433f732 bugfix: fix a bug with list ops when sorting with unread counts; fix a bug which could cause typing/receipts to not be live streamed
Previously, we would not send unread count INCREASES to the client,
as we would expect the actual event update to wake up the client conn.
This was great because it meant the event+unread count arrived atomically
on the client. This was implemented as "parse unread counts first, then events".

However, this introduced a bug when there were >1 user in the same room. In this
scenario, one poller may get the event first, which would go through to the client.
The subsequent unread count update would then be dropped and not sent to the client.
This would just be an unfortunate UI bug if it weren't for sorting by_notification_count
and sorting by_notification_level. Both of these sort operations use the unread counts
to determine room list ordering. This list would be updated on the server, but no
list operation would be sent to the client, causing the room lists to de-sync, and
resulting in incorrect DELETE/INSERT ops. This would manifest as duplicate rooms
on the room list.

In the process of fixing this, also fix a bug where typing notifications would not
always be sent to the client - it would only do so when piggybacked due to incorrect
type switches.

Also fix another bug which prevented receipts from always being sent to the client.
This was caused by the extensions handler not checking if the receipt extension had
data to determine if it should return. This the interacted with an as-yet unfixed bug
which cleared the extension on subequent updates, causing the receipt to be lost entirely.
A fix for this will be inbound soon.
2023-02-07 13:34:26 +00:00
Kegan Dougal
fc7f2a183b bugfix: fix data race in poller termination code
Just swap to using atomic.Bool as it's easier. Remove the
unnecessary channel.
2023-02-02 12:04:22 +00:00
Kegan Dougal
7eb139191f Suppress duplicate typing events from waking up connections 2023-02-01 11:51:06 +00:00
Kegan Dougal
22abc32ca6 perf/bugfix: refactor event transaction_id handling
- Scope transaction IDs to the device ID (access token) rather
  than the user ID, as this is more accurate with the spec.
- Batch up all transaction ID lookups for all rooms being returned
  into a single query. Previously, we would sequentially call SELECT
  n times, one per room being returned, which was taking lots of time
  just due to RTTs to the database server (often this table is empty).
2023-01-16 11:55:37 +00:00
Kegan Dougal
0da350dd1a Move e2e/txn interfaces closer to where they are used; rather than sync2 where they were used previously 2023-01-16 10:53:48 +00:00
Kegan Dougal
95a5af3abe perf: immediately send to-device messages to listening conns 2023-01-09 11:53:17 +00:00
Kegan Dougal
48f28f9f6c perf: filter out all rooms when doing an initial sync on 2nd+ pollers
Fixes #17 in theory, as now the initial sync request will have no
rooms and hence be faster to return. In theory. Maybe. Let's see.
2023-01-05 18:25:25 +00:00
Kegan Dougal
6c4f7d3722 improvement: completely refactor device data updates
- `Conn`s now expose a direct `OnUpdate(caches.Update)` function
  for updates which concern a specific device ID.
- Add a bitset in `DeviceData` to indicate if the OTK or fallback keys were changed.
- Pass through the affected `DeviceID` in `pubsub.V2DeviceData` updates.
- Remove `DeviceDataTable.SelectFrom` as it was unused.
- Refactor how the poller invokes `OnE2EEData`: it now only does this if
  there are changes to OTK counts and/or fallback key types and/or device lists,
  and _only_ sends those fields, setting the rest to the zero value.
- Remove noisy logging.
- Add `caches.DeviceDataUpdate` which has no data but serves to wake-up the long poller.
- Only send OTK counts / fallback key types when they have changed, not constantly. This
  matches the behaviour described in MSC3884

The entire flow now looks like:
- Poller notices a diff against in-memory version of otk count and invokes `OnE2EEData`
- Handler updates device data table, bumps the changed bit for otk count.
- Other handler gets the pubsub update, directly finds the `Conn` based on the `DeviceID`.
  Invokes `OnUpdate(caches.DeviceDataUpdate)`
- This update is handled by the E2EE extension which then pulls the data out from the database
  and returns it.
- On initial connections, all OTK / fallback data is returned.
2022-12-22 15:08:42 +00:00
Kegan Dougal
aa28df161c Rename package -> github.com/matrix-org/sliding-sync 2022-12-15 11:08:50 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
b90a18a62a Fix #45: ensure we don't send null when we mean [] 2022-09-20 13:19:28 +01:00
Kegan Dougal
d77e21138d refactor: remove spurious code; rename OnRetireInvite to OnLeftRoom
Add HasLeft to the user room metadata to control whether or not the
list algo will nuke the room or not from the list.
2022-08-31 14:48:14 +01:00
Kegan Dougal
47ddc04652 E2EE extension: Add support for device_unused_fallback_key_types
With tests
2022-08-09 10:05:18 +01:00
Kegan Dougal
0a17d0a4e4 perf: don't alert connstate for every event in state
We don't care about that as they never form part of the timeline.
Also, only send up a timeline limit: 1 filter to sync v2 when there
is no ?since token. Otherwise, we want a timeline limit >1 so we
can ensure that we remain gapless (else the proxy drops events).
2022-07-21 16:47:13 +01:00
Kegan Dougal
b9196db30b BREAKING(db): refactor how history is calculated
- Completely ignore events in the `state` block when processing
  sync v3 requests with a large `timeline_limit`. We should never
  have been including them in the first place as they are not
  chronological at all.
- Perform sync v2 requests with a timeline limit of 1 to ensure
  we can always return a `prev_batch` token to the caller. This
  means on the first startup, clicking a room will force a `/messages`
  hit until there have been `$limit` new events, in which case it
  will be able to serve these events from the local DB. Critically,
  this ensures that we never send back an empty `prev_batch`, which
  causes clients to believe that there is no history in a room.
2022-07-21 16:20:59 +01:00
Kegan Dougal
976875ba7a Skip unreadable access tokens 2022-07-20 11:37:26 +01:00
Kegan Dougal
47b74a6be6 Automatically start v2 pollers on startup
We can do this now because we store the access token for each device.

Throttled at 16 concurrent sync requests to avoid causing
thundering herds on startup.
2022-07-14 10:48:45 +01:00
Kegan Dougal
ed9e9ed48c Persist v2 access tokens in the database, encrypted
- Add `SYNCV3_SECRET` env var which is SHA256'd and used as an AES
  key to encrypt/decrypt tokens.
- Add column `v2_token_encrypted` to `syncv3_sync2_devices`
- Update unit tests to check encryption/decryption work.

This provides an extra layer of security in case the database is
compromised and real user access tokens are leaked. This forces
an attacker to obtain both the database table _and_ the secret
env var (which will typically be stored in secure storage e.g
k8s secrets). Unfortunately, we need to have the access_token
in the plain so we cannot rely on password-style storage algorithms
like bcrypt/scrypt, which would be safer.
2022-07-13 17:03:40 +01:00
Kegan Dougal
ebb9919614 Add trace logging 2022-04-12 12:27:20 +01:00
Kegan Dougal
5dc1c38764 Add prev_batch column to events table
This will be used to return prev batch tokens to the client
on a best-effort basis.
2022-03-31 14:29:26 +01:00
Kegan Dougal
873edd7315 bugfix: rework how invites are handled
Fixes https://github.com/matrix-org/sliding-sync/issues/23

- Added InvitesTable
- Allow invites to be sorted/searched the same as any other room by
  implementing RoomMetadata for the invite (though this is best effort
  as we don't have heroes)
2022-03-29 09:44:18 +01:00
Kegan Dougal
2920191a44 feature: add txnids to events
Clients rely on transaction IDs coming down their /sync streams so they
can pair up an incoming event with an event they just sent but have not
yet got the event ID for.

The proxy has not historically handled this because of the shared work
model of operation, where we store exactly 1 copy of the event in the
database and no more. This means if Alice and Bob are running in the
same proxy, then Alice sends a message, Bob's /sync stream may get the
event first and that will NOT contain the `transaction_id`. This then
gets written into the database. Later when Alice /syncs, she will not
get the `transaction_id` for her event which she sent.

This commit fixes this by having a TTL cache which maps (user, event)
-> txn_id. Transaction IDs are inherently ephemeral, so keeping the
last 5 minutes worth of txn IDs in-memory is an easy solution which
will be good enough for the proxy. Actual server implementations of
sliding sync will be able to trivially deal with this behaviour natively.
2022-03-28 15:19:42 +01:00
Kegan Dougal
5c666a8e50 use constants for alice/aliceToken in integration tests 2022-03-25 13:07:12 +00:00
Kegan Dougal
3e36037844 bugfix: ensure we have done an initial sync before returning from EnsurePolling
- Modify the API to instead have `WaitUntilInitialSync()` which is backed by a `WaitGroup`.
- Call this new function when a poller exists and hasn't been terminated. Previously,
  we would assume that if a poller exists then it has done an initial sync, which may
  not always be true. This could lead to position mismatches as a connection would be
  re-created after EnsurePolling returned.
2022-03-18 12:31:31 +00:00
Kegan Dougal
41e73206c6 Log to_device counts 2022-03-17 15:33:56 +00:00
Kegan Dougal
b71a2b7769 Use GMSL timestamps 2022-02-21 20:35:17 +00:00
Kegan Dougal
e680a3c66d Include invited rooms in the room list
With a very basic test to make sure it appears.
2022-02-21 20:31:54 +00:00
Kegan Dougal
d4eee49f63 Implement E2EE extension, with tests.
- Persist OTK counts and device list changes in-memory per Poller.
- Expose a new `E2EEFetcher` to allow the E2EE extension code to
  grab said E2EE data from the Poller.
- OTK counts are replaced outright.
- Device lists are updated in a user_id->changed|left map which is then
  deleted when read.
- Add tests for basic functionality and some edge cases like ensuring that
  v3 request retries still return changed|left values.
2021-12-16 18:12:09 +00:00
Kegan Dougal
24be8252f7 Change the retry schedule for the v2 poller to always be 3s
Comments explain why.
2021-12-15 09:56:58 +00:00
Kegan Dougal
0e021eb560 Pass to-device messages through to the client
- Treat to-device messages as opaque JSON blobs
- Add basic integration test to ensure the messages make it from v2 to v3.
2021-12-14 11:51:47 +00:00
Kegan Dougal
344cd5dbc1 Don't wait for a full sync if the user has synced before
This means we can serve rooms/events from the v3 database
immediately if they exist. The downside is that we still do
need to hit v2 to pull in to-device messages, but they can
come in later.
2021-12-10 14:40:30 +00:00
Kegan Dougal
c6de2270ed bugfix: is_dm filter was ignored for new live events
Caused by not loading the `UserRoomData` when applying the filter.
Added regression test.
2021-11-11 12:39:19 +00:00
Kegan Dougal
a2d6774024 Support filters.is_dm
- Add `AccountDataTable` with tests.
- Read global and per-room account data from sync v2 and add new callbacks to the poller.
- Update the `SyncV3Handler` to persist account data from sync v2 then notify the user cache.
- Update the `UserCache` to update `UserRoomData.IsDM` status on `m.direct` events.
- Read `m.direct` event from the DB when `UserCache` is created to track DM status per-room.
2021-11-09 15:08:08 +00:00
Kegan Dougal
6e55f7f608 tests: make a consistent test env for both local and CI runs
- Only have a single database for all tests, like CI.
- Calling `PrepareDBConnectionString` drops all tables before returning
  the string.
- Tests must be run with no concurrency else they will step on each other
  due to the previous point.

This should prevent cases where local tests pass but CI fails.
2021-11-09 10:15:48 +00:00
Kegan Dougal
7ca81ef68a bugfix: ensure notification counts don't get reset on new events
With regression tests
2021-11-03 11:07:01 +00:00
Kegan Dougal
6c12077f62 Ensure the first sync is snappy if there is no traffic 2021-10-29 13:15:39 +01:00
Kegan Dougal
b3e9f0b32e wait for v2 responses to be processed before returning 2021-10-28 16:25:24 +01:00
Kegan Dougal
9f3364d9ed PollerMap: ensure callbacks are always called from a single goroutine
Document a nasty race condition which can happen if >1 user is joined
to the same room. Fixed to ensure that `GlobalCache` will always stay
in-sync with the database without having to hit the database.
2021-10-28 16:15:17 +01:00