43 Commits

Author SHA1 Message Date
Kegan Dougal
4d54faa1a6 Fix remaining race conditions; add -race to CI 2024-03-11 10:30:03 +00:00
David Robertson
c239cacc83
Initialise: handle gappy polls and ditch prependStateEvents 2023-11-03 15:42:25 +00:00
David Robertson
f595aed2c5
Add a separate payload for redacting state
So that we don't end up nuking conns unnecessarily.
2023-11-01 19:03:17 +00:00
David Robertson
41ed56ecd7
Tweak test prose 2023-09-19 15:37:41 +01:00
David Robertson
a4c90fbd78
Fixup merges 2023-09-19 12:55:14 +01:00
David Robertson
d3ba1f1c30
Move TimelineResponse back to sync2 2023-09-19 12:41:25 +01:00
David Robertson
957bdee9d2
Merge branch 'main' into dmr/invalidate-timelines 2023-09-19 12:40:13 +01:00
David Robertson
a65a69b7bc
Set missing_parents field in the DB 2023-09-13 19:17:53 +01:00
David Robertson
df01e50438
Pass TimelineResponse struct around 2023-09-13 19:17:53 +01:00
David Robertson
e83a9d6218
Unit test the cache reload emission logic 2023-09-12 19:04:16 +01:00
David Robertson
777cb357fe
Factor out AccumulateResult struct 2023-09-07 20:41:11 +01:00
David Robertson
e960d7ff80
Fix integration test to include a create event 2023-08-22 16:12:34 +01:00
David Robertson
be4b2dc9a1
Initialise: don't snapshot without create event 2023-08-22 15:56:38 +01:00
Kegan Dougal
6623ddb9e3 Do not make snapshots for lone leave events
Specifically this is targetting invite rejections, where the leave
event is inside the leave block of the sync v2 response.

Previously, we would make a snapshot with this leave event. If the
proxy wasn't in this room, it would mean the room state would just
be the leave event, which is wrong. If the proxy was in the room,
then state would correctly be rolled forward.
2023-07-31 17:53:15 +01:00
Kegan Dougal
ae29d14c6f Remove unused code 2023-07-19 15:56:43 +01:00
Kegan Dougal
e947612ad9 Fix #192: ignore unseen old events 2023-07-11 19:08:32 +01:00
Kegan Dougal
e753f51d24 Add concurrency test 2023-06-08 14:06:41 +01:00
Kegan Dougal
c2f4b53fdd Use a txn for accumulator.Accumulate and make one in Storage.Accumulate 2023-06-08 13:54:46 +01:00
David Robertson
33b174dd67
Fixup test code 2023-04-17 20:26:40 +01:00
Kegan Dougal
a7eed93722 Add comprehensive regression test for GlobalSnapshot(); ensure we clear db conns when tests end 2023-01-18 14:54:26 +00:00
Kegan Dougal
00e4b8238c BREAKING(db) perf: Massively improve time to exec RoomStateAfterEventPosition
The previous query would:
 - Map room IDs to snapshot NIDs
 - UNNEST(events) on all those state snapshots
 - Compare if the type/state_key match the filter

This was very slow under the following circumstances:
 - The rooms have lots of members (e.g Matrix HQ)
 - The required_state has no filter on m.room.member

This is what Element X does.

To improve this, we now have _two_ columns per state snapshot:
 - membership_events : only the m.room.member events
 - events : everything else

Now if a query comes in which doesn't need m.room.member events, we just need
to look in the everything-else bucket of events which is significantly smaller.
This reduces these queries to about 50ms, from 500ms.
2023-01-12 17:11:09 +00:00
Kegan Dougal
aa28df161c Rename package -> github.com/matrix-org/sliding-sync 2022-12-15 11:08:50 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
1380a71f80 bugfix: fix several issues which could cause corrupt state snapshots
A fundamental assumption in the proxy has been that the order of events
in `timeline` in v2 will be the same all the time. There's some evidence
to suggest this isn't true in the wild. This commit refactors the proxy
to not assume this. It does this by:
  - Not relying on the number of newly inserted rows and slicing the events
    to figure out _which_ events are new. Now the INSERT has `RETURNING event_id, event_nid`
    and we return a map from event ID to event NID to explicitly say which
    events are new.
  - Add more paranoia when calculating new state snapshots: if we see the
    same (type, state key) tuple more than once in a snapshot we error out.
  - Add regression tests which try to insert events out of order to trip the
    proxy up.
2022-06-08 18:20:10 +01:00
Kegan Dougal
5dc1c38764 Add prev_batch column to events table
This will be used to return prev batch tokens to the client
on a best-effort basis.
2022-03-31 14:29:26 +01:00
Kegan Dougal
e04b38726a Fix a bug which can happen when v2 sync returns dupe events
Add regression tests as well.
2021-10-01 16:55:50 +01:00
Kegan Dougal
e838fab449 Load initial ConnState; document and fix races with loading/streaming
Specifically, we can double-process events if we don't take into account
the event NID. This happens because we can receive live events before
we have loaded the initial connection state (list of joined rooms). We
base the initial load around an event NID, so we need to make sure to
ignore any live streamed events which are <= the initial load NID.

We could have alternatively loaded the initial connection state and
/then/ register to receive live events, but this means we could drop
events in the gap between loading events and making the register call
which is arguably worse. We could slap a mutex around it all to atomically
do this, but this means that getting pushed new events is tied to
loading (potentially a lot of) state for a single Conn, increasing
lock contention.
2021-09-23 17:46:34 +01:00
Kegan Dougal
66c1a8a3e1 Remove membership_log_table and fold behaviour into events_table
Add another index to make queries asking for "all $type events in room X" fast
2021-08-23 15:21:51 +01:00
Kegan Dougal
65cbdb07c8 Add type|state_key cols to events table; refactor select in
- Add `verifyAll` flag to assert if all events should be in the SELECT result
- Factor out `testutils.NewStateEvent`
2021-08-20 15:56:17 +01:00
Kegan Dougal
45e9e432bc Track the state before on each event rather than after
It's easier to roll forward than roll backwards. Add 'replaces_nid' field
on the events table which tells which nid in the snapshot gets replaced, if any.
2021-08-18 18:21:40 +01:00
Kegan Dougal
30366983ab Remove snapshot_ref table; it's easier to not track this for now
We still track tokens though so can retrospectively clean up snapshots
2021-08-18 15:50:51 +01:00
Kegan Dougal
9fdf3901df Hook up the notifier and test it
Test that v3 requests can time out, be notified when v2 returns a response,
can return immediately.
2021-07-23 16:40:32 +01:00
Kegan Dougal
90dceee6f6 Add V2DataReceiver interface and indirect via SyncV3Handler
This allows us to notify the Notifier as well as store the data
in the database. Reshuffle where streams live (it's a sync v3 concept).
2021-07-23 15:39:41 +01:00
Kegan Dougal
e09c749bdc Add Storage and put Accumulator inside it 2021-06-16 17:18:04 +01:00
Kegan Dougal
fc35269134 Fix test data 2021-06-15 17:43:13 +01:00
Kegan Dougal
9588597a67 Modify snapshot ref table to track entities 2021-06-15 17:26:17 +01:00
Kegan Dougal
1b28768ca2 Fix tests 2021-06-11 14:47:39 +01:00
Kegan Dougal
5fa1548a27 Add membership log accumulator tests 2021-06-11 14:07:25 +01:00
Kegan Dougal
502274b475 Dump v2 responses into the accumulator 2021-06-03 16:18:01 +01:00
Kegan Dougal
0dcd3fac09 Make tests work for others, add timeline calculations 2021-06-03 14:35:34 +01:00
Kegan Dougal
5909d1a6b0 Implement Accumulate 2021-05-28 16:07:28 +01:00
Kegan Dougal
91dd5609f7 Impl most of Accumulator.Accumulate 2021-05-28 12:08:10 +01:00
Kegan Dougal
cd20d07d9f Add Accumulator.Initialise with tests 2021-05-27 19:20:36 +01:00