I have achieved zen with logs. Levels are easy. You have three, DEBUG, INFO, and...

I have achieved zen with logs.

Levels are easy. You have three, DEBUG, INFO, and ERROR. DEBUG is stuff that's interesting to you, the author of the code. INFO is things that are interesting to the person/team that runs your software in production. ERROR is for things that require human attention.

My opinion is that INFO logs are the primary user interface for your server-side software, and demand the level of attention that you would expect a frontend designer to devote to your brand or whatever. That's where the software gets to interface with the person that's responsible for its care and feeding. Garbage in there wastes their time. Everything in INFO should be logged for a reason. (Yes, sometimes there's spam. Usually request/response data. Most of the time the requests go great, but you don't know they've gone great until you've logged the fact that they were made. It's a necessary evil.)

Meanwhile, DEBUG should probably be pretty chatty. If your program is consuming resources like CPU, it better justify itself in the DEBUG logs. A program sitting there using 100% of the CPU and not logging anything at level DEBUG is very very suspicious. My philosophy is to "show your work" on every critical operation. Log when something starts, what decisions it makes along the way, and when it's done. A message like "failed to contact server example.com: i/o timeout after 10 minutes" is 10 minutes too late to be interesting. "connect to example.com: started", "connect to example.com: i/o timeout after 10 minutes" is much better, because 10 minutes before the error you can read the logs and realize example.com is NOT where your database is hosted. (You can really go overboard here and I would never stop you. connect to example.com: resolve DNS: start, connect to example.com: resolve DNS: A 1.2.3.4, connect to example.com: dial 1.2.3.4: start, .... In production, I compromise and log "start", "ongoing long HTTP request" (at 10 seconds), and "finished" with the status code.) Every branch is a place to log which branch you selected and why. If there's a bug and you only know the outcome, but not the software's "thought process" that led you there, you have nothing to go on. "Hey, the FrobTheBaz endpoint is returning 400s." That is useless information. You have to read the code and guess. You do not want to be guessing during an outage.

Errors are a whole can of worms. If someone reading the logs can't fix the problem, it's not an error. "Error: cannot write data to the database". That's a big problem. Disconnect that popcorn popper and plug the power cable to your database back in! "Error: login for root from spambot@1.2.3.4: invalid signature" is a problem for your firewall. What is someone reading the logs going to do about that? Call up 1.2.3.4 and ask them to stop guessing SSH keys? At best, that's an INFO log. There are 3.4 with 38 zeros possible keys. They aren't ever going to guess it. If you decide "this isn't even worth logging", I can definitely agree with you. (I'd log it, though. An increase in failure attempts is good ancillary information when something else draws your attention to the system.)

Finally, newline-delimited JSON ("jsonl") is the only correct format for logs. Nobody wants to write a parser to figure out "how many 400s did we serve today". Just let JQ [https://jqlang.github.io/jq/] do it for you. If you want pretty colors for viewing, use something like jlog: https://github.com/jrockway/json-logs. You don't need to write VT100 control characters to your log file. I know for a fact you don't even have a VT100.

All other formats of telemetry are just logs with fancier user interfaces. Metrics? Just write them out to your logs: "meter:http_bytes_sent delta:16384 x-request-id:813db394-75a1-4b22-94ad-e7850d9cdd25". You still have the ability to tail your logs and convert the metrics to a proper time series database (with less cardinality than "per request" of course), but you can now also do investigations in context. For example, I've implemented exactly this system at work. A customer shows up, saying that we're using too much memory. I ask for the logs and see something like "span start: compaction", "meter:self_rss_bytes delta:2000000000", "span end: compaction, error:nil", "meter:self_rss_bytes delta:-1000000000". Hey, we leaked a gig of RAM in compaction. You would never figure out the memory leak with "compaction done after 5 minutes" and a graph of memory usage across your fleet.

While you weren't looking I also threw traces in there! Have a logging primitive like Span("operate(foo)", func { ... }) that logs the start and the end, and throw a generated ID in every message. Concatenate the span IDs which each recursive call to Span (I abuse context.Context for this, sorry Rob Pike), and log them with every message in the span. Now you have distributed tracing. As always, sure, parse those logs and throw the log-less spans in to Jaeger so you can efficiently search for rare events and have a nice data flow graph or whatever.

Log sare so simple and so great. The problem you run into is by not producing enough of them. Disks cost nothing. You can delete the debug logs after a short period of time, or not even produce them until you suspect an issue. Don't sweat the log levels; pick the 3 I picked and try to enforce the rules in code review. Aim the log messages at the right audience; dev team, ops team, emergencies. Show your work. Log contextual information. Put metrics in the logs, and aggregate them for dashboards and monitoring. Annotate logs with trace IDs, and put those into Jaeger. You'll never have to wonder why your system is broken again; you printed out everything it's doing. The aggregation will tell you you're not meeting your error budget, and the logs for a representative request will tell you why that one failed. Once you know what's broken, the fix is always easy.

Whenever I tell people my system, they tell me it can't scale. When I worked at Google, I wrote exactly this system for monitoring Google Fiber (well, we didn't have log levels really...), and we could answer questions about the entire CPE fleet as easily as we could answer questions about one device or one bug. Over many years, I've finally gotten this into the "real world" (like weeks ago), and it's been amazing.

Good logs are the best thing you can do for someone that has to run your software. For a lot of folks, that's you! For other folks, that's your customers. Can you customer set up Prometheus, ElasticSearch, and Jaeger, and dump that all out and send it to you in Slack? Unlikely. Can they dump a file that's 10MB of JSON lines into Slack? They sure can. And now you can diagnose the bug and fix it without bothering them for more info. It's good for everyone.