27 Commits

Author SHA1 Message Date
Kegan Dougal
365e7cf11a Bump complement@main 2023-11-10 15:12:03 +00:00
dependabot[bot]
34ca02956a
Bump google.golang.org/grpc from 1.58.0 to 1.58.3
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.58.0 to 1.58.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.58.0...v1.58.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-26 16:24:59 +00:00
Kegan Dougal
37aa1469a5 WIP: use complement libraries 2023-10-11 12:23:46 +01:00
Kegan Dougal
e4cedaabcd Merge branch 'main' into kegan/poll-retry-loop-bad-create-event 2023-09-14 09:29:44 +01:00
Quentin Gliech
af5e8579b2 Better propagate request context
This properly propagates the go Context on down to all HTTP calls, which means that outgoing request have the OTLP trace context.
This also adds the Jaeger propagator to the list of OTEL propagators, so that Synapse properly gets the incoming trace context.
It also upgrades all the OTEL libraries
2023-09-13 19:41:52 +02:00
Kegan Dougal
7c80b5424a Prioritise retriable errors over unretriable errors
Bump to Go 1.20 for errors.Join and added introspection to
errors.As to inspect []error.
2023-09-12 14:57:40 +01:00
Kegan Dougal
b2c26b7e93 Redact events in the DB on m.room.redaction
Fixes #279
2023-08-31 17:06:44 +01:00
Kegan Dougal
9ae4a04824 Move from Jaeger to OTLP exportor
Jaeger spans can be sent as OTLP so this is mostly semantics for
the collector, which is more flexible if it accepts OTLP traces
rather than jaeger.thrift traces.
2023-08-23 12:52:47 +01:00
Kegan Dougal
4714088231 device data: use CBOR instead of JSONB
Using JSONB columns adds too much DB load. Prefer a slightly
faster serialisation format instead, and use the old system of
handling BYTEA, which is about 2x faster.
```
BenchmarkSerialiseDeviceDataJSON-12    	    1770	    576646 ns/op	  426297 B/op	    6840 allocs/op
BenchmarkSerialiseDeviceDataCBOR-12    	    4635	    247509 ns/op	  253971 B/op	    4796 allocs/op
```
This was using a growing list of 1000 device list changes.
2023-08-14 18:53:45 +01:00
Till Faelligen
a66070c50b
Update deps 2023-07-28 11:30:30 +02:00
David Robertson
8219d72a04
go get sentry 2023-04-04 16:37:54 +01:00
Kegan Dougal
5de7fa72f6 Read TraceContext headers to get full client/server spans 2023-02-21 10:50:39 +00:00
Kegan Dougal
1d64febf49 metrics: add Jaeger spans by setting SYNCV3_JAEGER_URL
This is a WIP but is mostly there. Jaeger debug logging goes
to the wrong logger currently (e.g if you enter an invalid URL).
2023-02-20 17:55:34 +00:00
Kegan Dougal
c2a3c53542 tracing: do runtime/trace and OTLP at the same time 2023-02-20 14:57:49 +00:00
Kegan Dougal
05ddb6812b extensions refactor: automatically handle the enabled flag
Part of a series of refactors on the extensions code.
2023-02-08 11:30:54 +00:00
Till Faelligen
c710a039e9
Update deps 2023-01-12 08:40:54 +01:00
Kegan Dougal
aa28df161c Rename package -> github.com/matrix-org/sliding-sync 2022-12-15 11:08:50 +00:00
Kegan Dougal
be8543a21a add extensions for typing and receipts; bugfixes and additional perf improvements
Features:
 - Add `typing` extension.
 - Add `receipts` extension.
 - Add comprehensive prometheus `/metrics` activated via `SYNCV3_PROM`.
 - Add `SYNCV3_PPROF` support.
 - Add `by_notification_level` sort order.
 - Add `include_old_rooms` support.
 - Add support for `$ME` and `$LAZY`.
 - Add correct filtering when `*,*` is used as `required_state`.
 - Add `num_live` to each room response to indicate how many timeline entries are live.

Bug fixes:
 - Use a stricter comparison function on ranges: fixes an issue whereby UTs fail on go1.19 due to change in sorting algorithm.
 - Send back an `errcode` on HTTP errors (e.g expired sessions).
 - Remove `unsigned.txn_id` on insertion into the DB. Otherwise other users would see other users txn IDs :(
 - Improve range delta algorithm: previously it didn't handle cases like `[0,20] -> [20,30]` and would panic.
 - Send HTTP 400 for invalid range requests.
 - Don't publish no-op unread counts which just adds extra noise.
 - Fix leaking DB connections which could eventually consume all available connections.
 - Ensure we always unblock WaitUntilInitialSync even on invalid access tokens. Other code relies on WaitUntilInitialSync() actually returning at _some_ point e.g on startup we have N workers which bound the number of concurrent pollers made at any one time, we need to not just hog a worker forever.

Improvements:
 - Greatly improve startup times of sync3 handlers by improving `JoinedRoomsTracker`: a modest amount of data would take ~28s to create the handler, now it takes 4s.
 - Massively improve initial initial v3 sync times, by refactoring `JoinedRoomsTracker`, from ~47s to <1s.
 - Add `SlidingSyncUntil...` in tests to reduce races.
 - Tweak the API shape of JoinedUsersForRoom to reduce state block processing time for large rooms from 63s to 39s.
 - Add trace task for initial syncs.
 - Include the proxy version in UA strings.
 - HTTP errors now wait 1s before returning to stop clients tight-looping on error.
 - Pending event buffer is now 2000.
 - Index the room ID first to cull the most events when returning timeline entries. Speeds up `SelectLatestEventsBetween` by a factor of 8.
 - Remove cancelled `m.room_key_requests` from the to-device inbox. Cuts down the amount of events in the inbox by ~94% for very large (20k+) inboxes, ~50% for moderate sized (200 events) inboxes. Adds book-keeping to remember the unacked to-device position for each client.
2022-12-14 18:53:55 +00:00
Kegan Dougal
a82615978b 2/? : Refactor API shape to be closer to the current MSC
Specifically:
 - Remove top-level `ops`, and replace with `lists`.
 - Remove list indexes from `ops`, and rely on contextual location information.
 - Remove top-level `counts` and instead embed them into each list contextually.
 - Refactor connstate to reflect new API shape.

Still to do:
 - Remove `rooms` / `room` from the op response, and bundle it into the
   top-level `rooms`.
 - Remove `UPDATE` op.
 - Add `room_id` / `room_ids` field to ops to let clients know which rooms each op relates to.
2022-05-25 11:36:30 +01:00
Kegan Dougal
5339dc8ce3 perf: cache the prev batch tokens for each room with an LRU cache
- Replace `PrevBatch string` in user room data with `PrevBatches lru.Cache`.
  This allows us to persist prev batch tokens in-memory rather than doing
  N sequential DB lookups which would take ~4s for ~150 rooms on the postgres
  instance running the database. The tokens are keyed off a tuple of the
  event ID being searched and the latest event in the room, to allow prev
  batches to be assigned when new sync v2 responses arrive.
- Thread through context to complex storage functions for profiling
2022-04-26 14:42:30 +01:00
Kegan Dougal
66659936b4 go mod tidy 2021-10-26 12:46:59 +01:00
Kegan Dougal
c12345481f Bump GMSL 2021-10-26 12:46:22 +01:00
Kegan Dougal
62f1eb0ee6 Conn: handle positions, retries and blocking operations
This abstracts the long-pollness of the HTTP connection.
Note that we cannot just maintain a server-side buffer of
events to feed down the connection because the client can
drastically alter _which_ events should be fed to the client.
There still needs to be a request/response cycle, except we
can factor out retry handling (duplicate request detection)
and incrementing of the positions.
2021-09-21 16:00:06 +01:00
Kegan Dougal
0075b46bc9 Flesh out to_device_table with tests, update gjson dep 2021-08-03 09:33:38 +01:00
Kegan Dougal
7f8d84d79e Add filter tables; remove alice dep 2021-07-21 12:12:57 +01:00
Kegan Dougal
f3e0f96d91 Add room state tables 2021-05-26 20:01:56 +01:00
Kegan Dougal
9d372e15de Initial commit 2021-05-14 16:49:33 +01:00