Enforcing Analog log analyzer to not strip useful data from reports
Mikael Willberg
10.5.2010 English Projects · Hacking Modification · Software
I use Analog web log analyzer to make some statistics. While I was checking daily reports I noticed that there were few 404 (Not Found) errors.
Failure Report Listing files, sorted by the number of failed requests. reqs: file ----: ---- 2: /fi/article/253 Failed Referrer Report Listing referring URLs, sorted by the number of failed requests. reqs: URL ----: --- 2: http://mig.hyper.fi/article/253
Confusing thing was that these errors were coming from the page that itself seemed to give the error. Checking the actual server log file revealed that Analog silently stripped some data from the requests.
xxx.xxx.xxx.xxx - - [09/May/2010:07:57:35 +0300] "GET /fi/article/253#comment-2 HTTP/1.1" 404 7872 "http://mig.hyper.fi/article/253" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" xxx.xxx.xxx.xxx - - [09/May/2010:07:57:48 +0300] "GET /fi/article/253#comment-2 HTTP/1.1" 404 7872 "http://mig.hyper.fi/article/253" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
There was no configuration option to change this, so I had to edit Analog 6.0 sources a little.
--- analog-6.0/src/alias.c.orig 2010-05-10 13:23:13.000000000 +0300 +++ analog-6.0/src/alias.c 2010-05-10 13:23:50.000000000 +0300 @@ -99,8 +99,8 @@ /* Zerothly, strip off #'s. These shouldn't get in the request, but do for some broken agents (particularly spiders). */ - if ((c = strchr(name, '#')) != NULL) - *c = '\0'; + //if ((c = strchr(name, '#')) != NULL) + // *c = '\0'; /* Halfthly, strip from semicolon to the end of the URL stem. (e.g. jsessionid). */
Quick recompiling and voilĂ , the report shows more useful report.
Failure Report Listing files, sorted by the number of failed requests. reqs: file ----: ---- 2: /fi/article/253#comment-2 Failed Referrer Report Listing referring URLs, sorted by the number of failed requests. reqs: URL ----: --- 2: http://mig.hyper.fi/article/253
The source indicates that this fix could make certain coding assumptions to fail, but the propability is nonexistent. Even RFC states that #-characters in request URL must be enconded.
You might ask why the "invalid" request was made in the first place. That was just a poorly coded spamming robot that made a new comment to a article in the blog and then tried to check if that was published successful.
Edit 2018.06.01 - analog.cx is now maintained under name Analog C:Amie Edition)