Greboca  

Planet Libre  -  Francois Aichelbaum : JSON, MongoDB and rsyslog are on a boat

 -  Décembre 2020 - 

JSON, MongoDB and rsyslog are on a boat

When dealing with log management, obvious solutions emerge earlier, often before you even discussed the purpose of the log management. We can discuss about ELK, Graylog, Splunk, … Those are great tools but they may not fit will all your needs. Lately, I had to work for one of my customer on enforcing log management for billing purposed. Rsyslog was already set to collect and centralize all the logs (and manage their backups). MongoDB seemed as a perfect tips for storing JSON extract of the logs to generate the proper stats.

Invoice Management - DronaHQ
Invoice Management - DronaHQ

JSON is standard

From json.org:

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

On top of that format, we can used a more structured syntax, adapted to log management: CEE Log Syntax. This defined format is well used by various tools you might use, in the present case: nginx, rsyslog, and mongodb. Up to a certain level…

Structured JSON

Nginx knows about @cee

Sort of. Nginx has a native syslog export for both error_log and access_log. Out of the box, it will simply push all logs from the default log formats to syslog. This won’t be very convenient for proper management afterwards. But nginx is easy to use, and you can define your own custom format. You’ll find plenty of blogposts, tutorials, and others as Any IT here? Help me! describing the proper @cee JSON log format you need:

log_format mainJSON escape=json '@cee: {'
        '"vhost":"$server_name",'
        '"remote_addr":"$remote_addr",'
        '"time_iso8601":"$time_iso8601",'
        '"request_uri":"$request_uri",'
        '"request_length":"$request_length",'
        '"request_method":"$request_method",'
        '"request_time":"$request_time",'
        '"server_port":"$server_port",'
        '"server_protocol":"$server_protocol",'
        '"ssl_protocol":"$ssl_protocol",'
        '"status":"$status",'
        '"bytes_sent":"$bytes_sent",'
        '"http_referer":"$http_referer",'
        '"http_user_agent":"$http_user_agent",'
        '"upstream_response_time":"$upstream_response_time",'
        '"upstream_addr":"$upstream_addr",'
        '"upstream_connect_time":"$upstream_connect_time",'
        '"upstream_cache_status":"$upstream_cache_status",'
        '"tcpinfo_rtt":"$tcpinfo_rtt",'
        '"tcpinfo_rttvar":"$tcpinfo_rttvar"'
        '}';

They all give you the very same structure. Not sure who started it, but that’s not the point: they’re all wrong about it!

I am this close - Terminator 2
I am this close - Terminator 2

They are close, indeed, but some details have to be fixed first. Why are they wrong? Because JSON, and especially the @cee version of it, has datatypes. Numbers (integers and floats) should not be encapsulated between quotes: quotes are for string only.

log_format mainJSON escape=json '@cee: {'
        '"vhost":"$server_name",'
        '"remote_addr":"$remote_addr",'
        '"time_iso8601":"$time_iso8601",'
        '"request_uri":"$request_uri",'
        '"request_length":$request_length,'
        '"request_method":"$request_method",'
        '"request_time":$request_time,'
        '"server_port":$server_port,'
        '"server_protocol":"$server_protocol",'
        '"ssl_protocol":"$ssl_protocol",'
        '"status":$status,'
        '"bytes_sent":$bytes_sent,'
        '"http_referer":"$http_referer",'
        '"http_user_agent":"$http_user_agent",'
        '"upstream_response_time":$upstream_response_time,'
        '"upstream_addr":"$upstream_addr",'
        '"upstream_connect_time":$upstream_connect_time,'
        '"upstream_cache_status":"$upstream_cache_status",'
        '"tcpinfo_rtt":$tcpinfo_rtt,'
        '"tcpinfo_rttvar":$tcpinfo_rttvar'
        '}';

Now we are good to go. Then, as all documentations state it, we can use this log format to push logs to syslog:

access_log      syslog:server=10.10.10.2:514,facility=local7,tag=nginx,severity=info mainJSON;

For sure, you can adapt the faciliy, the tag and the severity of it, along with the log format name.

Rsyslog is great. Its documentation is not.

That’s a fact, technical documentation, made by technical individuals, for technical persons, are not the best. Rsyslog might be one of the best example about this statement. Rainer Gerhards did a great work with his tool, one of the best syslog manager in my humble opinion. Thought, the documentation is a hell to read. Add to this fact, that it not as accurate as it should be.

In the use case I was working on, the infrastructure is benefiting from rsyslog v8-sable. And as stated, we want to use the mongodb exporter, aka ommongodb. Seems easy and straight forward? Almost. But remember this IT mantra:

If all goes well, we forget something.

For reasons out of my mind while writing down this post, I can tell you that it just do not work. Several aspects are to be considering before dealing with those logs.

  1. You need to ensure about the encoding: BSON (the JSON variant mongodb uses internally) only supports valid UTF-8 character. For that, you need to properly fix the encoding before sending them. There is a module for that.
  2. The structured @cee sent using the default configuration to mongodb will just push data “as-is”: first issue and main issue you’ll hit is about the date which will be pushed as a regular string.

But why? - Ryan Reynolds
But why? - Ryan Reynolds

Why? Simply because rsyslog only understand numbers and strings as datatypes on one end, and mongodb doesn’t auto detect date and timestamps on the other end. Is it an issue at the end? If you want to benefit from mongodb filtering features on dates, yes it is. For that purpose, you need to use the ISODate() functions that mongodb only knows about.

After a tremendous number of attempts, trying to deal with the documentation to find the proper format, I decided to read the ommongodb module source code. Pretty easy as it’s a well written C code:

		/* Convert text to ISODATE when needed */
		if (strncmp(name, "date", 5) == 0 || strncmp(name, "time", 5) == 0 ) {

You’ve just read it well: the code does not expect the fieldname to start with a date or time, but it expect the fieldname to be date or time.

Two solutions there:

  1. either update the log format, in nginx configuration, from time_iso8601: to date:
  2. update your rsyslog configuration

I’m not found of the string exporter used in the default documentation of rsyslog for mongodb: we’re using JSON as an input, we expect JSON at the output, why should we use strings in between?

For that reason, I moved to a JSON manipulation, thanks to the JSON parse module, and list type for the template:

module(load="mmutf8fix")
module(load="mmjsonparse")
module(load="ommongodb")

template(name="json-syslog" type="list" option.jsonf="on") {
        property(outname="vhost" name="$!vhost" format="jsonf" datatype="string" onEmpty="null")
        property(outname="remote_addr" name="$!remote_addr" format="jsonf" datatype="string" onEmpty="null")
        property(outname="time_iso8601" name="$!time_iso8601" format="jsonf")
        property(outname="date" name="$!time_iso8601")
        property(outname="request_uri" name="$!request_uri" format="jsonf" datatype="string" onEmpty="null")
        property(outname="request_length" name="$!request_length" format="jsonf" datatype="number")
        property(outname="request_method" name="$!request_method" format="jsonf" datatype="string" onEmpty="null")
        property(outname="request_time" name="$!request_time" format="jsonf" datatype="number")
        property(outname="server_port" name="$!server_port" format="jsonf" datatype="number")
        property(outname="server_protocol" name="$!server_protocol" format="jsonf" datatype="string" onEmpty="null")
        property(outname="ssl_protocol" name="$!ssl_protocol" format="jsonf" datatype="string" onEmpty="null")
        property(outname="status" name="$!status" format="jsonf" datatype="number")
        property(outname="bytes_sent" name="$!bytes_sent" format="jsonf" datatype="number")
        property(outname="http_referer" name="$!http_referer" format="jsonf" datatype="string" onEmpty="null")
        property(outname="http_user_agent" name="$!http_user_agent" format="jsonf" datatype="string" onEmpty="null")
        property(outname="upstream_response_time" name="$!upstream_response_time" format="jsonf" datatype="string" onEmpty="null")
        property(outname="upstream_addr" name="$!upstream_addr" format="jsonf" datatype="string" onEmpty="null")
        property(outname="upstream_connect_time" name="$!upstream_connect_time" format="jsonf" datatype="string" onEmpty="null")
        property(outname="upstream_cache_status" name="$!upstream_cache_status" format="jsonf" datatype="string" onEmpty="null")
        property(outname="tcpinfo_rtt" name="$!tcpinfo_rtt" format="jsonf" datatype="number")
        property(outname="tcpinfo_rttvar" name="$!tcpinfo_rttvar" format="jsonf" datatype="number")
}

Now we can talk. - Kyle Maclachlan - Twin Peaks
Now we can talk. - Kyle Maclachlan - Twin Peaks

That’s not all folks!

Following that changes, nginx logs are pushed to mongodb, allowing easy statistics aggregation for billing purpose. As we just want to push these lines to mongodb, best is to proceed with something like:

if ($syslogfacility-text == 'local7' and $syslogseverity-text == 'info') then {
        action(type="mmutf8fix")
        action(type="mmjsonparse")
        action(type="ommongodb" uristr="mongodb://--REDACTED--:--REDACTED--@--REDATED--:27017/?authSource=logs&authMechanism=SCRAM-SHA-1" db="logs" collection="nginx" template="json-syslog")
        & stop
}

The uristr is a bit different from the the documentation because, once again, the document is not really explicit enough about it. For most real-life scenarios, even if you use a mongodb cluster, you want to rely on dedicated database and dedicated user, with the proper set of permissions. To benefit from it, you need to add some details within the uristr as the user:password but also some query parameters:

  • authSource: the db to rely on for authentication, as you user has only permissons on it
  • authMechanism: you’re pushing a password via the dsn

At the end

We went through. We’ve set up nginx to push JSON @cee-compliant logs to syslog, then we prepare the logs to be properly pushed to mongodb, and we publish them.

Now our folks can run their micro-batching to generate live billing and usage statistics for the customers.

How would you have tackled this kind of need? Did you suffer from technical documentations not adequate or not uptodate?

Gravatar de Francois Aichelbaum
Original post of Francois Aichelbaum.Votez pour ce billet sur Planet Libre.

par Francois Aichelbaum

Planet Libre

Le Planet sur Linux et les Logiciels Libres

genma : Nextcloud - Suppression des fichiers fournis par défaut dans le dossier utilisateur

 -  28 avril - 

Des fichiers par défaut Vous pouvez fournir à vos nouveaux utilisateurs un ensemble de fichiers par défaut qui apparaîtront dans leur Nextcloud (...)


genma : Nextcloud et l'accessibilité 1/ ?

 -  18 mars - 

J'ai écrit quelques billets de blog sur le sujet de l'accessibilité au cours des années et ce sujet, même si j'en parle peu, me tient à coeur. C'est (...)


antistress : Mon nouveau téléphone : Samsung Galaxy S5 sous LineageOS

 -  26 février - 

Tout d'abord il me faut indiquer quels usages je fais de mon téléphone. J'utilise surtout les fonctions SMS et MMS, appareil photo et voix – dans (...)


Carl Chenet : Le télétravail, c’est pas que pour la pandémie !

 -  24 février - 

Ton chef rage en mode “c’était mieux avant”, tu le vois continuer à organiser des réunions-fleuve où personne ne répond. Le reste du temps (et pendant (...)


Carl Chenet : Les pièges en télétravail

 -  28 janvier - 

Rester connecté jusqu’à 22h, bosser en pyjama, suivre une réunion interminable depuis les toilettes… Toi aussi au bout de quelques semaines de (...)