59 Commits

Author SHA1 Message Date
Kegan Dougal
5dae70069b Clean the syncv3_snapshots table periodically
Also cleans the transaction table periodically.

Fixes https://github.com/matrix-org/sliding-sync/issues/372

On testing, this cuts db size to about 1/3 of its original size.
2024-04-22 08:55:05 +01:00
Kegan Dougal
7e06813fe2 Fix #365: only return the last joined range
If we returned multiple distinct ranges, we always assumed that
the history visibility was "joined", so we would never return
events in the invite/shared state. This would be fine if the client
had a way to fetch those events sent before they were joined, but
they did not have a way as the prev_batch token would not be set
correctly. We now only return a single range of events and the prev
batch for /that/ range only, and defer to the upstream HS for
history visibility calculations.

Add end-to-end test to assert this new behaviour works.
2023-11-06 11:54:15 +00:00
David Robertson
f0ea7cbd4d
Add FetchMemberships function
Pulled out of #329.
2023-11-02 15:47:17 +00:00
David Robertson
d3ba1f1c30
Move TimelineResponse back to sync2 2023-09-19 12:41:25 +01:00
David Robertson
957bdee9d2
Merge branch 'main' into dmr/invalidate-timelines 2023-09-19 12:40:13 +01:00
David Robertson
3150c17cde
Test helper driver-by comment 2023-09-13 19:17:53 +01:00
David Robertson
df01e50438
Pass TimelineResponse struct around 2023-09-13 19:17:53 +01:00
David Robertson
773a28cf14
Make circularSlice generic 2023-09-08 18:17:13 +01:00
David Robertson
777cb357fe
Factor out AccumulateResult struct 2023-09-07 20:41:11 +01:00
Kegan Dougal
b2c26b7e93 Redact events in the DB on m.room.redaction
Fixes #279
2023-08-31 17:06:44 +01:00
Kegan Dougal
6623ddb9e3 Do not make snapshots for lone leave events
Specifically this is targetting invite rejections, where the leave
event is inside the leave block of the sync v2 response.

Previously, we would make a snapshot with this leave event. If the
proxy wasn't in this room, it would mean the room state would just
be the leave event, which is wrong. If the proxy was in the room,
then state would correctly be rolled forward.
2023-07-31 17:53:15 +01:00
Kegan Dougal
019661eb76 Calculate heroes from the returned joined/invited members 2023-07-19 18:23:09 +01:00
Kegan Dougal
1895080e84 Remove unused functions 2023-07-17 17:47:37 +01:00
Kegan Dougal
9ebe7634ec Implement table tests 2023-07-17 16:25:28 +01:00
Kegan Dougal
fbd865abba wip tests 2023-07-17 15:55:10 +01:00
Kegan Dougal
fc04171c7c Combine invite/join calcs into 1 query for speed 2023-07-17 10:48:21 +01:00
David Robertson
1717408dc3
Use fewer DB conns when events into the UserCache 2023-06-19 17:58:56 +01:00
David Robertson
5636f11984
Bugger, I need join event timestamps too 2023-06-06 14:22:51 +01:00
David Robertson
3e4eaa0219
Fix tests this time? 2023-06-05 18:10:35 +01:00
David Robertson
c07ef096bc
Cleanup tests again? 2023-06-05 14:49:50 +01:00
David Robertson
dcc37926e3
Fixup tests 2023-06-05 14:03:30 +01:00
David Robertson
6574101a7b
GlobalCache: LoadJoinedRooms also loads join NIDs 2023-06-01 20:05:42 +01:00
Kegan Dougal
fa6746796c perf: improve startup speeds by using temp tables
When the proxy is run with large DBs (10m+ events), the
startup queries are very slow (around 30min to load the initial snapshot.

After much EXPLAIN ANALYZEing, the cause is due to Postgres' query planner
not making good decisions when the the tables are that large. Specifically,
the startup queries need to pull all joined members in all rooms, which
ends up being nearly 50% of the entire events table of 10m rows. When this
query is embedded in a subselect, the query planner assumes that the subselect
will return only a few rows, and decides to pull those rows via an index. In this
particular case, indexes are the wrong choice, as there are SO MANY rows a Seq Scan
is often more appropriate. By using an index (which is a btree), this ends up doing
log(n) operations _per row_ or `O(0.5 * n * log(n))` assuming we pull 50% of the
table of n rows. As n increases, this is increasingly the wrong call over a basic
O(n) seq scan. When n=10m, a seq scan has a cost of 10m, but using indexes has a
cost of 16.6m. By dumping the result of the subselect to a temporary table, this
allows the query planner to notice that using an index is the wrong thing to do,
resulting in better performance. On large DBs, this decreases the startup time
from 30m to ~5m.
2023-05-18 16:45:02 +01:00
Kegan Dougal
513aec4c61 Unbreak tests 2023-05-12 10:11:25 +01:00
David Robertson
666823d211
Introduce return struct for Initialise 2023-04-17 20:05:32 +01:00
Kegan Dougal
a7eed93722 Add comprehensive regression test for GlobalSnapshot(); ensure we clear db conns when tests end 2023-01-18 14:54:26 +00:00
Kegan Dougal
00e4b8238c BREAKING(db) perf: Massively improve time to exec RoomStateAfterEventPosition
The previous query would:
 - Map room IDs to snapshot NIDs
 - UNNEST(events) on all those state snapshots
 - Compare if the type/state_key match the filter

This was very slow under the following circumstances:
 - The rooms have lots of members (e.g Matrix HQ)
 - The required_state has no filter on m.room.member

This is what Element X does.

To improve this, we now have _two_ columns per state snapshot:
 - membership_events : only the m.room.member events
 - events : everything else

Now if a query comes in which doesn't need m.room.member events, we just need
to look in the everything-else bucket of events which is significantly smaller.
This reduces these queries to about 50ms, from 500ms.
2023-01-12 17:11:09 +00:00
Kegan Dougal
b5661a3c16 perf: rewrite MetadataForAllRooms to not do redundant work
We already extracted the joined users in all rooms, but then
this function would do another query to pull out the join counts.
This query was particularly inefficient, clocking in at 4s (!) on
my test server. Removed it entirely and instead do len(joinedUsers)
by calling AllJoinedMembers first.
2023-01-03 14:43:31 +00:00
Kegan Dougal
aa28df161c Rename package -> github.com/matrix-org/sliding-sync 2022-12-15 11:08:50 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
5ca156afe9 spaces: synchronise space updates between global/user caches
Add request filter for spaces(!)
2022-07-29 15:19:20 +01:00
Kegan Dougal
1a55076478 Add NewJoinEvent shorthand for tests 2022-07-12 15:12:02 +01:00
Kegan Dougal
1380a71f80 bugfix: fix several issues which could cause corrupt state snapshots
A fundamental assumption in the proxy has been that the order of events
in `timeline` in v2 will be the same all the time. There's some evidence
to suggest this isn't true in the wild. This commit refactors the proxy
to not assume this. It does this by:
  - Not relying on the number of newly inserted rows and slicing the events
    to figure out _which_ events are new. Now the INSERT has `RETURNING event_id, event_nid`
    and we return a map from event ID to event NID to explicitly say which
    events are new.
  - Add more paranoia when calculating new state snapshots: if we see the
    same (type, state key) tuple more than once in a snapshot we error out.
  - Add regression tests which try to insert events out of order to trip the
    proxy up.
2022-06-08 18:20:10 +01:00
Kegan Dougal
5339dc8ce3 perf: cache the prev batch tokens for each room with an LRU cache
- Replace `PrevBatch string` in user room data with `PrevBatches lru.Cache`.
  This allows us to persist prev batch tokens in-memory rather than doing
  N sequential DB lookups which would take ~4s for ~150 rooms on the postgres
  instance running the database. The tokens are keyed off a tuple of the
  event ID being searched and the latest event in the room, to allow prev
  batches to be assigned when new sync v2 responses arrive.
- Thread through context to complex storage functions for profiling
2022-04-26 14:42:30 +01:00
Kegan Dougal
17cc4e6ec1 perf: reduce the number of SQL queries further when pulling required_state 2022-04-25 20:35:27 +01:00
Kegan Dougal
0d8e22fc88 perf: refactor how required_state is queried from the database
Use a single SQL query per request rather than sequentially performing
1 query per room.
2022-04-25 17:12:00 +01:00
Kegan Dougal
234d068d97 optimisation: only extract needed events for required_state where possible
Previously, we would only optimise pulling out event types i.e. if you want
state events with types A and B we only pull out all current state with event
type A or B. This falls down when the client wants their own member event,
as m.room.member is the bulk of the current state. This commit optimises the
SQL queries to also take into account the state key asked for, whilst still
supporting wildcards '*' when they are requested.
2022-04-22 12:12:51 +01:00
Kegan Dougal
dd6e6da50c Inject prev_batch values into timeline UserRoomData 2022-03-31 15:10:42 +01:00
Kegan Dougal
5dc1c38764 Add prev_batch column to events table
This will be used to return prev batch tokens to the client
on a best-effort basis.
2022-03-31 14:29:26 +01:00
Kegan Dougal
873edd7315 bugfix: rework how invites are handled
Fixes https://github.com/matrix-org/sliding-sync/issues/23

- Added InvitesTable
- Allow invites to be sorted/searched the same as any other room by
  implementing RoomMetadata for the invite (though this is best effort
  as we don't have heroes)
2022-03-29 09:44:18 +01:00
Kegan Dougal
c15c3f290e Integration tests for transaction IDs
Also standardise testutils.NewEvent to match testutils.NewStateEvent
to allow With... modifiers.
2022-03-28 15:52:25 +01:00
Kegan Dougal
53480c18a7 Revert "Load invite rooms on initial connection correctly"
This reverts commit 991b597e6e4b167f1d67fcb2ea696204aabca8f2.
2022-03-25 14:34:23 +00:00
Kegan Dougal
991b597e6e Load invite rooms on initial connection correctly 2022-03-25 13:44:20 +00:00
Kegan Dougal
e680a3c66d Include invited rooms in the room list
With a very basic test to make sure it appears.
2022-02-21 20:31:54 +00:00
Kegan Dougal
3f4a7459b4 Make more store functions private 2021-10-27 18:28:53 +01:00
Kegan Dougal
26ed9b9a40 Merge SortableRoom and HeroInfo into RoomMetadata
RoomMetadata stores the current invite/join count, heroes for the
room, most recent timestamp, name event content, canonical alias, etc

This information is consistent across all users so can be globally
cached for future use. Make ConnState call CalculateRoomName with
RoomMetadata to run the name algorithm.

This is *almost* complete but as there are no Heroes yet in the
metadata, things don't quite render correctly yet.
2021-10-27 18:16:43 +01:00
Kegan Dougal
e9d179fe4a tests: remove check for absolute room counts as it varies on the tests run 2021-10-27 11:02:35 +01:00
Kegan Dougal
51e6ac5469 HeroInfoForAllRooms: add queries for join/invite counts 2021-10-27 11:01:28 +01:00
Kegan Dougal
eaea3402a2 Use gmsl.Timestamp in more places 2021-10-26 10:01:45 +01:00
Kegan Dougal
d7913c8e26 Return the most recent timeline events for each room
TODO: the global cache isn't being kept updated so live
streamed events don't load (though they sort correctly)
2021-10-22 18:18:02 +01:00