553 Commits

Author SHA1 Message Date
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
b90a18a62a Fix #45: ensure we don't send null when we mean [] 2022-09-20 13:19:28 +01:00
Kegan Dougal
d591dd0584 BREAKING(db): Add predecessor_room_id to the rooms table
This tracks the room ID supplied in the create event.
2022-09-08 15:44:48 +01:00
Kegan Dougal
55ed63ef97 Assert that the new room has a predecessor set 2022-09-08 15:15:42 +01:00
Kegan Dougal
2b4f3a8bc2 Log more information in responses 2022-09-08 09:42:16 +01:00
Kegan Dougal
e492e2d443 Factor out canonicalised name calculations 2022-09-08 09:32:16 +01:00
Kegan Dougal
bfbccb045a Make the rest of the proxy aware of the upgraded room ID rather than just is_tombstoned 2022-09-07 17:44:04 +01:00
Kegan Dougal
fb1cd41637 BREAKING(db): replace is_tombstoned column with upgraded_room_id in rooms table
This is in preparation to allow us to walk over tombstones automatically
in the proxy. Currently, we just emulate `is_tombstoned` behaviour by
checking if `upgraded_room_id != NULL`.

This is a breaking database change as there is no migration path.
2022-09-07 17:34:49 +01:00
Kegan Dougal
f4a5200150 bugfix: ensure room name calculation updates are correct
Previously we would fail to update to the correct room name
because we didn't remove the client from the list of Heroes.
2022-09-07 15:26:46 +01:00
Kegan Dougal
564f1863ba v0.4.1 v0.4.1 2022-09-05 17:28:27 +01:00
Kegan Dougal
6a14e03d12 bugfix: fix a bug where move updates could go missing between windows
If a room moved from one window range to another window range, and the
index of the destination was the leading edge of a different window,
this would trip up the code into thinking it was a no-op move and hence
not issue a DELETE/INSERT for the 2nd window, even though it was in fact
needed. For example:
```
   w1           w2
[0,1,2] 3,4,5 [6,7,8]

Move 1 to 6 turns the list into:
[0,2,3] 4,5,6 [1,7,8]

which should be the operations:
DELETE 1, INSERT 2 (val=3)
DELETE 6, INSERT 6 (val=1)

but because DELETE/INSERT both have the same index value, and the target
room is the updated room, we thought this was the same as when you have:

[0,1,2] 3,4,5

Move 0 to 0

which should no-op.
```

Fixed by ensuring that we also check that there is only 1 move operation.
If there are >1 move operations then we are moving between lists and should
include the DELETE/INSERT operation with the same index. This could manifest
itself in updated rooms spontaneously disappearing and/or neighbouring rooms
being duplicated.
2022-09-05 16:47:05 +01:00
Kegan Dougal
d529ed52d5 bugfix: honour the windows that the ranges represent to avoid mismatched DELETE/INSERT operations
Previously, we would just focus on finding _any_ window boundary and then
assume that was the boundary which matched the window for the purposes of
DELETE/INSERT move operations. However, this wasn't always true, especially
in the following case:
```
0..9 [10..20] 21...29 [30...40]
then move 30 to 10
0..9 [30,10...19] 20...28 [29,31...40]

expect:
 - DELETE 30, INSERT 30 (val=29)
 - DELETE 20, INSERT 10 (val=30)

but we would get:
 - DELETE 30, INSERT 20 (val=19)
 - DELETE 20, INSERT 10 (val=30)

because the code assumed that there was a window range [20,30] which there wasn't.
```
2022-09-05 15:58:56 +01:00
Kegan Dougal
78e8564d36 tests: add additional testing when the window is not at [0]
Also assert new room subscriptions are correct.
2022-09-05 13:59:14 +01:00
Kegan Dougal
8ebb1be2c1 bugfix: add torture test for list delta ops
- Randomly move elements 10,000 times in a sliding window.
- Fixed a bug as a result which would cause the algorithm to
  fail to issue a DELETE/INSERT when the room was _inserted_
  to the very end of the window range, due to it misfiring
  with the logic to not issue operations for no-op moves.
2022-09-05 13:43:12 +01:00
Kegan Dougal
cff1be0f1e bugfix: ensure we always INSERT a shifted room when handling a deletion
Previously, this would fail:
```
			//                0    1    2    3    4    5    6    7    8
			before: []string{"a", "b", "c", "d", "e", "f", "g", "h", "i"},
			after:  []string{"b", "c", "d", "e", "f", "g", "h", "i"},
			ranges: SliceRanges{{1, 3}, {5, 7}},
```

because the 2nd window range perfectly matched the list size, it would
ignore the `INSERT,7,i`.
2022-08-31 18:45:22 +01:00
Kegan Dougal
dcad80f51f bugfix: send correct deltas for deletions at the front of windows
Previously we wouldn't send deletions for this, even though they shift
all elements to the left. Add a battery of unit tests for the list delta
algorithm, and standardise on the practice of issuing a DELETE prior to
an INSERT for newly inserted rooms, regardless of where in the window
they appear. Previously, we may skip the DELETE at the end of the list,
which was just inconsistent of us.
2022-08-31 17:54:07 +01:00
Kegan Dougal
dcf3cfb4b0 Remove special handling for invites for updates; not needed anymore 2022-08-31 14:51:24 +01:00
Kegan Dougal
d77e21138d refactor: remove spurious code; rename OnRetireInvite to OnLeftRoom
Add HasLeft to the user room metadata to control whether or not the
list algo will nuke the room or not from the list.
2022-08-31 14:48:14 +01:00
Kegan Dougal
10bd0da932 refactor: add ops.go to calculate list ops
re-jig ConnState to use this new function, with unit tests.
2022-08-31 13:43:09 +01:00
Kegan Dougal
eadebd6c89 Remove ok from CalculateMoveIndexes as it's obsolete 2022-08-30 18:56:01 +01:00
Kegan Dougal
7e7a8a98ce feat/bugfix: Add invited|joined_count to room response JSON
This is so clients can accurately calculate the push rule:
```
{"kind":"room_member_count","is":"2"}
```
Also fixed a bug in the global room metadata for the joined/invited
counts where it could be wrong because of Synapse sending duplicate
join events as we were tracking +-1 deltas. We now calculate these
counts based on the set of user IDs in a specific membership state.
2022-08-30 17:27:58 +01:00
Kegan Dougal
cf56283e6f Remove AddRoomIfNotExists as we now always use the same codepath via SetRoom 2022-08-30 14:33:49 +01:00
Kegan Dougal
a3bb77e60a Use the ref to UserRoomMetadata instead of pulling from the user cache directly 2022-08-30 14:02:35 +01:00
Kegan Dougal
21d0776e56 refactor: add ListOp and pre-calculate them before processing lists
Then just loop over the list deltas when processing the event. This
ensures we don't needlessly loop over lists which did not care and
still do not care about the incoming update.
2022-08-26 13:54:44 +01:00
Kegan Dougal
19f8b4dbf7 refactor: add RoomFinder and use it in InternalRequestLists
This is part of a series of refactors aimed to improve the performance
and complexity of calculating list deltas, which up until now exists in
its current form due to organic growth of the codebase.

This specific refactor introduces a new interface `RoomFinder` which
can map room IDs to `*RoomConnMetadata` which is used by `ConnState`.
All the sliding sync lists now use the `RoomFinder` instead of keeping
their own copies of `RoomConnMetadata`, meaning per-connection, rooms
just have 1 copy in-memory. This cuts down on memory usage as well as
cuts down on GC churn as we would constantly be replacing N rooms for
each update, where N is the total number of lists on that connection.
For Element-Web, N=7 currently to handle Favourites, Low Priority, DMs,
Rooms, Spaces, Invites, Search. This also has the benefit of creating
a single source of truth in `InternalRequestLists.allRooms` which can
be updated once and then a list of list deltas can be calculated off
the back of that. Previously, `allRooms` was _only_ used to seed new
lists, which created a weird imbalance as we would need to update both
`allRooms` _and_ each `FilteredSortableRooms` to keep things in-sync.

This refactor is incomplete in its present form, as we need to make
use of the new `RoomDelta` struct to efficiently package list updates.
2022-08-26 10:09:41 +01:00
Kegan Dougal
1155d24314 bugfix: fixed a bug whereby a DELETE could specify an index of -1
This could happen with 1-length windows e.g `[0,0]` where an element
was moved from outside the range e.g i=5 to the window index e.g 0.
This then triggered an off-by-one error in the code which snapped
indexes to windows. Fixed with regression tests.
2022-08-25 20:48:38 +01:00
Kegan Dougal
657f8ccc5d Use pointers to RoomConnMetadata 2022-08-25 15:30:07 +01:00
Kegan Dougal
6e0ea54c57 v0.4.0 v0.4.0 2022-08-23 16:15:28 +01:00
Kegan Dougal
edf581f0e7 bugfix: resort lists when room tags are updated
Previously we didn't, which would cause problems when
tag changes caused rooms to appear/disappear from lists.
2022-08-23 09:49:26 +01:00
Kegan Dougal
c071cee921 Add support for not_tags 2022-08-22 18:31:44 +01:00
Kegan Dougal
a4faf5a97f Actually parse tags according to the spec format, with wrapping tags object 2022-08-22 18:26:43 +01:00
Kegan Dougal
b5b13b75a6 Add support for room tag filters
This includes favourites and low priority rooms. With integration
tests.
2022-08-22 18:02:48 +01:00
Kegan Dougal
fdd530350e bugfix: remove the conn when the buffer is exceeded
Previously, we would only remove the conn due to TTL expiry.
If the buffer filled in the mean time, we risked returning
no messages at all.
2022-08-19 18:12:09 +01:00
Kegan Dougal
daa200c0ba v0.3.3 v0.3.3 2022-08-19 11:32:02 +01:00
Kegan Dougal
d8ef0d3a6b bugfix: immediately send global account data updates
Fixes #28: regression test lives in cypress/react-sdk.
2022-08-19 11:05:28 +01:00
Kegan Dougal
9c40797135 bugfix: sorting by room name didn't work correctly when room names were updated
Caused by us not updating the `CanonicalisedName` which is what we use to sort on.
This field is a bit of an oddity because it lived outside the user/global cache
fields because it is a calculated value from the global cache data in the scope
of a user, whereas other user cache values are derived directly from specific
data (notif counts, DM-ness). This is a silly distinction however, since spaces
are derived from global room data as well, so move `CanonicalisedName` to the
UserCache and keep it updated when the room name changes.

Longer term: we need to clean this up so only the user cache is responsible
for updating user cache fields, and connstate treats user room data and global
room data as immutable. This is _mostly_ true today, but isn't always, and it
causes headaches. In addition, it looks like we maintain O(n) caches based on
the number of lists the user has made: we needn't do this and should lean
much more heavily on `s.allRooms`, just keeping pointers to this slice from
whatever lists the user requests.
2022-08-18 13:11:05 +01:00
Kegan Dougal
86a0d5484d v0.3.2 v0.3.2 2022-08-16 16:46:43 +01:00
Kegan Dougal
306b720ebe bugfix: fix stuck invites when the server is restarted
We just never removed the invites from the invites table when
the user accepts the invite; we only did it for rejected invites.
2022-08-16 16:24:12 +01:00
Kegan Dougal
bcb1c42ccb bugfix: update the roomIDToIndex map when rooms are removed
We weren't doing this previously, but things didn't blow up because
we would almost always call resort() shortly afterwards which _would_
update the map with the new sort positions. In some cases this could
cause the lists to be sorting with incorrect index positions, notably:
 - when an invite is retired.
 - when a room no longer meets filter criteria and is removed.

This could be a major source of duplicate rooms.
2022-08-16 15:56:16 +01:00
Kegan Dougal
3c23e4bb4d v0.3.1 v0.3.1 2022-08-16 14:36:40 +01:00
Kegan Dougal
a37aee4c2b Improve logging; remove useless fields 2022-08-16 14:23:05 +01:00
Kegan Dougal
ba78a33bb8 bugfix: use a map to store joined room tracker info
Previously we used a slice as this is slightly cheaper, but
since Synapse can return multiple join events in the timeline
it could cause a user ID to be present multiple times in a room.

When this happened, it would cause the `UserCache` callbacks to
be invoked twice for every event, clearly not ideal. By using a
set instead, we make sure that we don't add the same user more
than one time.

Ref: https://github.com/matrix-org/synapse/issues/9768
2022-08-16 11:13:03 +01:00
Kegan Dougal
0656d79abe Add additional e2e test to check that accepting invites works correctly 2022-08-16 11:04:23 +01:00
Kegan Dougal
8af8f7413e bugfix: ensure live-streamed invites include an invite_state
With regression tests. Comments explain the edge case, but basically
previously we were not calling `builder.AddRoomsToSubscription` with
the new room because we didn't know it was brand new, as it had a
valid swap operation. It only had a valid swap op because we "inserted"
(read: pretended) that the room has always been there at `len(list)`
so the from index was outside the known range. This works great most
of the time, but failed in the case where you use a large window size
e.g `[[0,20]]` for 3 rooms, then the 4th room is still "inside the range"
and hence is merely an update, not a brand new room, so we wouldn't add
the room to the builder.

Fixed by decoupling adding rooms to the builder and expecting swap/insert
ops; they aren't mutually exclusive.
2022-08-15 18:40:43 +01:00
Kegan Dougal
59cddd08c7 bugfix: update the 'name' field on rooms when relevant actions occur
Relevant actions include:
 - People joining/leaving a room
 - An m.room.name or m.room.canonical_alias event is sent
 - etc..

Prior to this, we just set the room name field for initial=true
rooms only. This meant that if a room name was updated whilst it was
in the visible range (or currently subscribed to), we wouldn't set
this field resulting in stale names for clients. This was particularly
prominent when you created a room, as the initial member event would
cause the room to appear in the list as "Empty room" which then would
never be updated even if there was a subsequent `m.room.name` event
sent.

Fixed with regression tests.
2022-08-11 15:07:36 +01:00
Kegan Dougal
54e1cfbb0e bugfix: ensure newly joined live-stream rooms don't cause 500s
This was caused by the GlobalCache not having a metadata entry for
the new room, which in some cases prevented a stub from being made.

With regression test.
2022-08-10 19:48:03 +01:00
Kegan Dougal
47ddc04652 E2EE extension: Add support for device_unused_fallback_key_types
With tests
2022-08-09 10:05:18 +01:00
Kegan Dougal
ca2b19310e v0.3.0 v0.3.0 2022-08-05 13:10:48 +01:00
Kegan Dougal
f2cd4034c7 bugfix: don't delete the acking response the moment it is ACKed
Else if the client retries that request (because the new response is lost)
then we will HTTP 400 them with an unknown pos.
2022-08-05 12:43:22 +01:00
Kegan Dougal
7a049ec3a3 Adjust the timeout value when we are forced to process requests with buffered responses
The problem is that there is NOT a 1:1 relationship between request/response,
due to cancellations needing to be processed (else state diverges between client/server).
Whilst we were buffering responses and returning them eagerly if the request data did
not change, we we processing new requests if the request data DID change. This puts us
in an awkward position. We have >1 response waiting to send to the client, but we
cannot just _ignore_ their new request else we'll just drop it to the floor, so we're
forced to process it and _then_ return the buffered response. This is great so long as
the request processing doesn't take long: which it will if we are waiting for live updates.
To get around this, when we detect this scenario, we artificially reduce the timeout value
to ensure request processing is fast.

If we just use websockets this problem goes away...
2022-08-04 12:06:22 +01:00