OpenTelemetry Support [Experimental]

Experimental OpenTelemetry Support

OpenTelemetry https://opentelemetry.io/

OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

OpenTelemetry aims to become the open standard for instrumenting code with logs, metrics and tracing telemetry and they are maturing rapidly and have both stable and successful APIs to prove it. We think that’s a noble aim and aim to offer first class support for OpenTelemetry in the RedwoodJS framework.

RedwoodJS + OpenTelemetry = Easy

Demo

Setup

To setup OpenTelemetry you can run one simple command:

yarn rw experimental setup-opentelemetry

or use the exp abbreviation of the new experimental CLI section like so:

yarn rw exp setup-opentelemetry

After that you’ll need to update your graphql configuration to include the new openTelemetryOptions option:

export const handler = createGraphQLHandler({
  authDecoder,
  getCurrentUser,
  loggerConfig: { logger, options: {} },
  directives,
  sdls,
  services,
  onException: () => {
    // Disconnect from your database with an unhandled exception.
    db.$disconnect()
  },
  // This is new and you'll need to manually add it
  openTelemetryOptions: {
    resolvers: true,
    result: true,
    variables: true,
  }
})

as this will allow OpenTelemetry instrumentation inside you graphql server.

Changes

TOML
The setup command adds the following values to the redwood.toml file:

[experimental.opentelemetry]
  enabled = true
  apiSdk = "/home/linux-user/redwood-project/api/src/opentelemetry.js"

The enabled options simply turns on or off OpenTelemetry - beware this doesn’t turn off the graphql OpenTelemetry you’ll need to remove the plugin options for that to happen. The apiSdk option should point to the js file which is loaded before your application code to setup the OpenTelemetry SDK. You will likely want to leave this as the default value.

SDK File
The api/src/opentelemetry.js|ts file generated by the setup command is where the OpenTelemetry SDK is defined. For more information on the contents of this file - including what options you may wish to edit to suit your own needs - please see the documentation at Instrumentation | OpenTelemetry.

Availability

The setup command is currently available from the canary version of Redwood. You can try this out in a new project by running yarn rw upgrade --tag canary and following any general upgrade steps recommend on the forums.

Limitations

Currently we are only supporting OpenTelemetry from the “api” side of Redwood but we aim to add the “web” side soon.

We also at the moment only support OpenTelemetry during development, that is yarn rw dev will automatically enable OpenTelemetry when you have it setup and enabled within the TOML. Other commands like yarn rw serve do not currently do this but we hope to add this in the future too.

Known Issues

There are a few know issues which we will be addressing shortly:

  1. Services which return a promise but that are not marked as async do not have the correct timings reported.

Feedback

Please leave feedback as comments to this forum post. We would love to hear what’s broken, what isn’t clear and what additions or changes you’d like to see!

We would also welcome any form of collaboration on this feature!

4 Likes

Hi,
I tried to use Redwood Studio and Open Telemetry. First of all it is super nice to get such thourough insights on queries!
When I am using it I sometimes get these logs on my api side:

{"stack":"Error: connect ECONNREFUSED 127.0.0.1:4318\n    at __node_internal_captureLargerStackTrace (node:internal/errors:484:5)\n    at __node_internal_exceptionWithHostPort (node:internal/errors:662:12)\n    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16)\n    at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17)","message":"connect ECONNREFUSED 127.0.0.1:4318","errno":"-4078","code":"ECONNREFUSED","syscall":"connect","address":"127.0.0.1","port":"4318","name":"Error"}

And some of the queries are not named properly, but I guess this is a known issue.
Another thing I observed is the decreased performance. Queries that would normally take some 80ms, can suddenly take up to 4000ms (querying a collection of 1000 items) and if I have relational resolvers they seem to lead to N+1 issues creating for a query of 20000ms. In the opentelemetry.ts a SimpleSpanProcessor is mentioned and that it is preferable to use the BatchSpanProcessor in production. I tried it in dev, but I think it did not change much.
I am now just wondering, if the decreased performance also influences the analysis and if I can rely on the traces?

I hope we see more on this tooling, it would be really helpful for production, too :slight_smile:

1 Like

Thanks for the feedback @dennemark, it’s really appreciated!

I have to admit I did not test with collections that large. I’ll make a note to try this out and see if I can reproduce this poor performance and track down the cause. We certainly can’t have this take seconds!

The naming issues are surprising I’ve not came across that before. Is there any more information you can provide about that? Was it any particular type of query or mutation? Likewise I’ve never seen the issue with not being able to communicate with the local server at 4318 when studio was run from the CLI. I’ll make sure to keep an eye out for these too when I investigate this poor performance.

One thing I do have to get back to and fix/check is around services returning promises. Right now the timing for services doesn’t reflect resolving any promise it’s returning. That could also be an issue in other places where redwood lets you return a promise that we’ll resolve internally at some point.

Thanks again for the feedback! We need people trying it out, pointing out what needs to be better or fixed and if they find it useful. This lets us justify spending time on these experimental features.

Ah I thought the naming issue might be related to:

It seems Redwood Studio is collecting all queries in a trace and sometimes it takes a bit longer. But there are cases where it only shows a blank title as if it did not fully collect the trace.

1 Like

Hi - Opentelemetry and Redwood Studio integration is top notch and extremely convenient to use with a very low barrier to entry. Makes finding hot paths a breeze! Congrats guys!

Would be fantastic if something similar worked for the web side

FYI, I spotted the warning below spamming the console using RW 6.5.0 (not sure related to this release)

warning:60
(node:87265) [FSTDEP017] FastifyDeprecation: You are accessing the deprecated "request.routerPath" property. Use "request.routeOptions.url" instead. Property "req.routerPath" will be removed in `fastify@5`.

Hi there, i am trying to make it work to send traces to a (local) jaeger-collector or even directly to grafana cloud for example. The problem is, that no matter where i do send the traces i get houndreds or thousands of error messages from the otlp-exporter saying this:

api | {“stack”:“OTLPExporterError: Request Timeout\n at ClientRequest. (/home/user/git/project/node_modules/@opentelemetry/otlp-exporter-base/src/platform/node/util.ts:147:21)\n at (/home/user/git/project/node_modules/@opentelemetry/context-async-hooks/src/AbstractAsyncHooksContextManager.ts:75:49)\n at AsyncLocalStorage.run (node:async_hooks:338:14)\n at AsyncLocalStorageContextManager.with (/home/user/git/project/node_modules/@opentelemetry/context-async-hooks/src/AsyncLocalStorageContextManager.ts:40:36)\n at ClientRequest.contextWrapper (/home/user/git/project/node_modules/@opentelemetry/context-async-hooks/src/AbstractAsyncHooksContextManager.ts:75:26)\n at ClientRequest.emit (node:events:514:28)\n at TLSSocket.socketCloseListener (node:_http_client:474:11)\n at TLSSocket.emit (node:events:526:35)\n at node:net:323:12\n at TCP.done (node:_tls_wrap:588:7)”,“message”:“Request Timeout”,“name”:“OTLPExporterError”,“code”:“ECONNRESET”}

i thought it would be related to how slow or fast the collector accepts answers, but sadly no matter what i do i will get these timeouts.
Interestingly not all of the traces get rejected. the first 10 to 20 will get through, after that nothing comes through.

Is this known in any way or related to the redwood implementation?

the opentelemetry.ts in redwood looks like this:


const resource = Resource.default().merge(
  new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'redwood-app',
    [SemanticResourceAttributes.SERVICE_VERSION]: '0.0.0',
  })
)

const exporter = new OTLPTraceExporter({
  // Update this URL to point to where your OTLP compatible collector is listening
  // The redwood development studio (`yarn rw exp studio`) can collect your telemetry at `http://127.0.0.1:4318/v1/traces`
  url: process.env.OTEL_HOST,
  headers: {
    Authorization:
      'Basic ' +
      Buffer.from(`${process.env.OTEL_USER}:${process.env.OTEL_PASSWORD}`).toString('base64'),
  },
})

// You may wish to switch to BatchSpanProcessor in production as it is the recommended choice for performance reasons
const processor = new SimpleSpanProcessor(exporter)

const provider = new NodeTracerProvider({
  resource: resource,
})
provider.addSpanProcessor(processor)

// Optionally register instrumentation libraries here
registerInstrumentations({
  tracerProvider: provider,
  instrumentations: [
    new HttpInstrumentation(),
    new FastifyInstrumentation(),
    new PrismaInstrumentation({
      middleware: true,
    }),
  ],
})

provider.register()

if you have any idea how i could fix this or if you need me to provide more information, please let me know!

thanks a lot

1 Like

Hey @pcace :wave: I haven’t experienced this so can’t give a direct answer I know will work. One thing I might try is changing from a SimpleSpanProcessor to a BatchSpanProcessor. It could be that sending each individual span is a little too much for the network/ingestion to handle.

Hi, thank you so much for your fast reply!
using the BatchSpanProcessor it gets a little bit better - since the issue gets a little bit more obvious:

api | Dropped 11785 spans because maxQueueSize reached
api | Dropped 4329 spans because maxQueueSize reached
api | Dropped 3574 spans because maxQueueSize reached
api | Dropped 1726 spans because maxQueueSize reached

what is still strange to me is, that even if i set the maxQueueSize crazy high, it changes nothing and the traces get lost:

provider.addSpanProcessor(processor, {
  // The maximum queue size. After the size is reached spans are dropped.
  maxQueueSize: 200000,
  // The maximum batch size of every export. It must be smaller or equal to maxQueueSize.
  maxExportBatchSize: 100000,
  // The interval between two consecutive exports
  // scheduledDelayMillis: 50000,
  // How long the export can run before it is cancelled
  // exportTimeoutMillis: 300000,
})

any idea on that? Thanks a lot again!!

No problem! Not too sure about the queue size unless there is a maximum limit the config can’t exceed either. My next gut reaction would be to see if making the scheduledDelayMillis small helps to keep the queue size low?

I have to find time to get back to using OTel with redwood and improving the default experience. After I implemented the initial experimental dev support I haven’t had the time to go back and improve it.

Hi, thanks again! it seems to be a little better. I guess the problem comes from the fact we are using rowlevel security and need to pull lots of objects from the db.
Disabling PrismaInstrumentation makes the problem disappear completely.

one other problem @mat mentioned is that the console gets flooded by logs like this too:

api | (node:549304) [FSTDEP017] FastifyDeprecation: You are accessing the deprecated "request.routerPath" property. Use "request.routeOptions.url" instead. Property "req.routerPath" will be removed in `fastify@5`.

any idea if there could be a fix?

cheers and thanks!

Okay, that’s a useful data point for us when we revisit for sure. We of course would like it to JustWork even with features like row-level security.

I’ll prioritise looking into that deprecation notice since it’s such a pain/spammer. I’ll probably not get to it this week though so I might be able to get back to you all over the weekend or early next week.

A little update form my side - with a little bit of configuring it seems to be working much better by adding TraceIdRatioBasedSampler in the provider. It also works with PrismaInstrumentation. this is how it looks like for now and is working. i now need to figure how usefull the traces are if there is only 10% traced. maybe there is another sampler wich makes more sense (for example one wich samples only long running traces or failing traces… )

const { diag, DiagConsoleLogger, DiagLogLevel } = require('@opentelemetry/api')
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http')

const { registerInstrumentations } = require('@opentelemetry/instrumentation')
const { FastifyInstrumentation } = require('@opentelemetry/instrumentation-fastify')
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http')
const { Resource } = require('@opentelemetry/resources')
const {
  NodeTracerProvider,
  BatchSpanProcessor,
  TraceIdRatioBasedSampler,
} = require('@opentelemetry/sdk-trace-node')
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions')
const { PrismaInstrumentation } = require('@prisma/instrumentation')
const { NodeSDK } = require('@opentelemetry/sdk-node')

// You may wish to set this to DiagLogLevel.DEBUG when you need to debug opentelemetry itself
diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO) // INFO

const resource = Resource.default().merge(
  new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'redwood-app',
    [SemanticResourceAttributes.SERVICE_VERSION]: '0.0.0',
  })
)

const exporter = new OTLPTraceExporter({
  // Update this URL to point to where your OTLP compatible collector is listening
  // The redwood development studio (`yarn rw exp studio`) can collect your telemetry at `http://127.0.0.1:4318/v1/traces`
  url: process.env.OTEL_HOST,
  headers: {
    Authorization:
      'Basic ' +
      Buffer.from(`${process.env.OTEL_USER}:${process.env.OTEL_PASSWORD}`).toString('base64'),
  },
  // concurrencyLimit: 100000, // an optional limit on pending requests
})

// You may wish to switch to BatchSpanProcessor in production as it is the recommended choice for performance reasons
const processor = new BatchSpanProcessor(exporter)

// https://opentelemetry.io/docs/languages/js/sampling/
// https://opentelemetry.io/docs/languages/sdk-configuration/general/
const samplePercentage = process.env.OTEL_TRACES_SAMPLER_ARG
  ? process.env.OTEL_TRACES_SAMPLER_ARG
  : '0.1'

const provider = new NodeTracerProvider({
  resource: resource,
  sampler: new TraceIdRatioBasedSampler(samplePercentage),
})

provider.addSpanProcessor(processor, {
  // The maximum queue size. After the size is reached spans are dropped.
  maxQueueSize: 5000,
  // The maximum batch size of every export. It must be smaller or equal to maxQueueSize.
  // maxExportBatchSize: 100000,
  // The interval between two consecutive exports
  scheduledDelayMillis: 10,
  // How long the export can run before it is cancelled
  exportTimeoutMillis: 3000,
})

// Optionally register instrumentation libraries here
registerInstrumentations({
  tracerProvider: provider,
  instrumentations: [
    new HttpInstrumentation(),
    new FastifyInstrumentation(),
    new PrismaInstrumentation({
      middleware: true,
    }),
  ],
})

provider.register()
2 Likes

I set this up, and it seems worked seamlessly when pointing it at Honeycomb. However, when pointing it at RW studio, I see the following logs over and over:

{"level":50,"time":1708664248315,"pid":90954,"hostname":"pemulis.local","msg":"Failed to ingest span"}
{"level":50,"time":1708664248315,"pid":90954,"hostname":"pemulis.local","err":{"type":"TypeError","message":"Cannot convert undefined to a BigInt","stack":"TypeError: Cannot convert undefined to a BigInt\n    at BigInt (<anonymous>)\n    at convertLongToBigInt (/Users/aoife/Code/app/node_modules/@redwoodjs/studio/api/dist/functions/otel-trace/otel-trace.js:52:36)\n    at createSpan (/Users/aoife/Code/app/node_modules/@redwoodjs/studio/api/dist/functions/otel-trace/otel-trace.js:246:22)\n    at async __rw_handler (/Users/aoife/Code/app/node_modules/@redwoodjs/studio/api/dist/functions/otel-trace/otel-trace.js:281:11)\n    at async requestHandler (/Users/aoife/Code/app/node_modules/@redwoodjs/api-server/dist/requestHandlers/awsLambdaFastify.js:92:30)"},"msg":"Cannot convert undefined to a BigInt"}

Im using RW 7.0.2 - any ideas what’s going on? Also, is there any way to just turn on OTL in production?

2 Likes

Thanks for raising this with us @aoifelee and thanks for trying out the studio!

The conversion of undefined to BigInt is a bug. I’ll raise an issue for this on our studio repository here. I will try to get round to fixing it soon but can’t commit to a time scale yet sorry!

I’m not sure about just enabling it during production. So far we have only really supported this as a development time feature. I’d like to hear more about your needs for it during production and how you’ve been configuring it to run in production right now. We can certainly try to make this easier but it might not be a top-level priority on our roadmap right now.

I haven’t configured it to run in production since it seems that by default the telemetry script only runs in dev. I was interested just because I’m interested in having OTEL-based observability in prod, and the integration with Honeycomb was so easy to set up (and I assume it would also be easy for other vendors).

1 Like

I too hit this error as I upgrade to v7 and try out Studio.
I have it introspecting the schema and showing that but the traces look to have the same error:

{"level":30,"time":1709152623137,"pid":16862,"hostname":"Ryans-Mac-mini.local","reqId":"req-22r","res":{"statusCode":200},"responseTime":1.2016669996082783,"msg":"request completed"}
{"level":30,"time":1709152623138,"pid":16862,"hostname":"Ryans-Mac-mini.local","reqId":"req-22y","req":{"method":"POST","url":"/.redwood/functions/otel-trace","hostname":"127.0.0.1:4318","remoteAddress":"127.0.0.1","remotePort":51988},"msg":"incoming request"}
{"level":50,"time":1709152623137,"pid":16862,"hostname":"Ryans-Mac-mini.local","err":{"type":"TypeError","message":"Cannot convert undefined to a BigInt","stack":"TypeError: Cannot convert undefined to a BigInt\n    at BigInt (<anonymous>)\n    at convertLongToBigInt (/Users/ryanquinn/Documents/projects/SwankyCloud/Redwood-Swanky/node_modules/@redwoodjs/studio/api/dist/functions/otel-trace/otel-trace.js:52:36)\n    at createSpan (/Users/ryanquinn/Documents/projects/SwankyCloud/Redwood-Swanky/node_modules/@redwoodjs/studio/api/dist/functions/otel-trace/otel-trace.js:246:22)\n    at async __rw_handler (/Users/ryanquinn/Documents/projects/SwankyCloud/Redwood-Swanky/node_modules/@redwoodjs/studio/api/dist/functions/otel-trace/otel-trace.js:281:11)\n    at async requestHandler (/Users/ryanquinn/Documents/projects/SwankyCloud/Redwood-Swanky/node_modules/@redwoodjs/api-server/dist/requestHandlers/awsLambdaFastify.js:92:30)"},"msg":"Cannot convert undefined to a BigInt"}
{"level":50,"time":1709152623138,"pid":16862,"hostname":"Ryans-Mac-mini.local","msg":"Failed to ingest span"}

I dove into it a bit and I think I got it working at least for my setup.
Inside node_modules>@redwoodjs>studio>api>dist>functions>otel-trace/otel-trace.js

I changed this part to just not use the convertToBigInt and was at least able to get throughput to studio. Now the question is can I skip this function:

the part here was:

await import_db.db.oTelTraceSpan.create({
    data: {
      typeId: spanTypeId,
      brief: spanBrief,
      traceId: span.traceId,
      traceState: span.traceState,
      spanId: span.spanId,
      parentId: span.parentSpanId,
      name: span.name,
      flags: (0, import_flags.default)(span),
      kind: span.kind,
      startTimeNano: convertLongToBigInt(span.startTimeUnixNano),
      endTimeNano: convertLongToBigInt(span.endTimeUnixNano),
      attributes: {
        connect: (0, import_map.default)(attributeIds).call(attributeIds, (id) => ({
          id
        }))
      },
      events: {
        create: events
      },
      links: {
        create: links
      },
      statusMessage: span.status.message,
      statusCode: span.status.code,
      resourceId,
      scopeId
    },
    select: {
      id: true
    }
  });

The convertToBigInt function used for:
~startTimeNano: convertToBigInt(span.startTimeUnixNano),
~endTimeNano: convertToBigInt(span.endTimeUnixNano)
is expecting an object with {low, high, unsigned} inputs, but the opentelemetry span’s variable is just a string of unix nanoseconds:

so I changed it over to:

    data: {
      typeId: spanTypeId,
      brief: spanBrief,
      traceId: span.traceId,
      traceState: span.traceState,
      spanId: span.spanId,
      parentId: span.parentSpanId,
      name: span.name,
      flags: (0, import_flags.default)(span),
      kind: span.kind,
      startTimeNano: span.startTimeUnixNano,
      endTimeNano: span.endTimeUnixNano,
      attributes: {
        connect: (0, import_map.default)(attributeIds).call(attributeIds, (id) => ({
          id
        }))
      },
      events: {
        create: events
      },
      links: {
        create: links
      },
      statusMessage: span.status.message,
      statusCode: span.status.code,
      resourceId,
      scopeId
    },
    select: {
      id: true
    }
  });
{"traceId":"202ff2bdf87712836a370f642cfc4f7d","spanId":"e7351b0f99977297","name":"POST /:routeName","kind":2,"startTimeUnixNano":"1709165544582000000","endTimeUnixNano":"1709165544756285625","attributes":[{"key":"http.url","value":{"stringValue":"http://localhost:8910/graphql"}},{"key":"http.host","value":{"stringValue":"localhost:8910"}},{"key":"net.host.name","value":{"stringValue":"localhost"}},{"key":"http.method","value":{"stringValue":"POST"}},{"key":"http.scheme","value":{"stringValue":"http"}},{"key":"http.target","value":{"stringValue":"/graphql"}},{"key":"http.user_agent","value":{"stringValue":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"}},{"key":"http.request_content_length_uncompressed","value":{"intValue":512}},{"key":"http.flavor","value":{"stringValue":"1.1"}},{"key":"net.transport","value":{"stringValue":"ip_tcp"}},{"key":"net.host.ip","value":{"stringValue":"::1"}},{"key":"net.host.port","value":{"intValue":8911}},{"key":"net.peer.ip","value":{"stringValue":"::1"}},{"key":"net.peer.port","value":{"intValue":56826}},{"key":"http.status_code","value":{"intValue":200}},{"key":"http.status_text","value":{"stringValue":"OK"}},{"key":"http.route","value":{"stringValue":"/:routeName"}}],"droppedAttributesCount":0,"events":[],"droppedEventsCount":0,"status":{"code":0},"links":[],"droppedLinksCount":0}

afterward that edit it seems working for me:

1 Like

Hello, when I create a new redwood application I get an error, can you confirm that? or it only happens to me xD

@Parcero thanks! This is due to a fault with our internal telemetry platform and won’t cause you any problems with development. We’re correcting this on our end and hopefully it’ll be fixed soon!

1 Like

@QuinnsCode Thanks for the detailed response! I’ll use this to help get a fix in.

1 Like

Trying rw studio and i was having the same problem as @QuinnsCode
rw 7.1.0

In my case:

  1. I had to fix the Prisma client issue (using darwin but darwin was not included Studio)
    I copied the darwin client into node_modules/@redwoodjs/studio/node_modules/@prisma/engines/libquery_engine-darwin.dylib.node

  2. This got rw studio working but the traces were empty.

  3. I then made Quinns changes and now see info in my studio dashboard.

1 Like