Enforcing Analog log analyzer to not strip useful data from reports

Mikael Willberg

10.5.2010 English Projects · Hacking Modification · Software

I use Analog web log analyzer to make some statistics. While I was checking daily reports I noticed that there were few 404 (Not Found) errors.

Failure Report

Listing files, sorted by the number of failed requests.

reqs: file
----: ----
   2: /fi/article/253

Failed Referrer Report

Listing referring URLs, sorted by the number of failed requests.

reqs: URL
----: ---
   2: http://mig.hyper.fi/article/253

Confusing thing was that these errors were coming from the page that itself seemed to give the error. Checking the actual server log file revealed that Analog silently stripped some data from the requests.

xxx.xxx.xxx.xxx - - [09/May/2010:07:57:35 +0300] "GET /fi/article/253#comment-2 HTTP/1.1" 404 7872 "http://mig.hyper.fi/article/253" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
xxx.xxx.xxx.xxx - - [09/May/2010:07:57:48 +0300] "GET /fi/article/253#comment-2 HTTP/1.1" 404 7872 "http://mig.hyper.fi/article/253" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"

There was no configuration option to change this, so I had to edit Analog 6.0 sources a little.

--- analog-6.0/src/alias.c.orig 2010-05-10 13:23:13.000000000 +0300
+++ analog-6.0/src/alias.c      2010-05-10 13:23:50.000000000 +0300
@@ -99,8 +99,8 @@
 
   /* Zerothly, strip off #'s. These shouldn't get in the request, but do
      for some broken agents (particularly spiders). */
-  if ((c = strchr(name, '#')) != NULL)
-    *c = '\0';
+  //if ((c = strchr(name, '#')) != NULL)
+  //  *c = '\0';
 
   /* Halfthly, strip from semicolon to the end of the URL stem.
      (e.g. jsessionid). */

Quick recompiling and voilà, the report shows more useful report.

Failure Report

Listing files, sorted by the number of failed requests.

reqs: file
----: ----
   2: /fi/article/253#comment-2

Failed Referrer Report

Listing referring URLs, sorted by the number of failed requests.

reqs: URL
----: ---
   2: http://mig.hyper.fi/article/253

The source indicates that this fix could make certain coding assumptions to fail, but the propability is nonexistent. Even RFC states that #-characters in request URL must be enconded.

You might ask why the "invalid" request was made in the first place. That was just a poorly coded spamming robot that made a new comment to a article in the blog and then tried to check if that was published successful.

Edit 2018.06.01 - analog.cx is now maintained under name Analog C:Amie Edition)

Mikael 'Mig' Willberg

Enforcing Analog log analyzer to not strip useful data from reports