Home Features Demos Download Installation User Manual Developer Manual Relation function Credits

Login

Heuristics for Metrics

The special page Metrics reads the log files to produce web analytics. Bigger website can have large files and an important part of the hits is generated by bots. SofaWiki uses the following heuristics to measure trafic efficiently and still accurate.

The metrics is generated in two steps

  • The relations command logs stats month creates stat for a month.
  • The monthly reports apply Relation code to the stat and save them to the cache as "logsv2-month.csv", "logsv2-names-month.csv" and "logsv2-users-month.csv"
  • The yearly statistics read from the cache files and apply Relation code.

Dealing with errors

  • Invalid log lines are ignored
  • Invalid paths are corrected. There should not be slashes in path unless for the sublang page
  • Lines with errors are removed from hits

Dealing with high volume

If there are more than 1024 lines, the scan frequency is slowed down by a factor of ceil(log2(hits)-10). To correct for the hits, the scanned lines are counted more than once.

Dealing with snowflakes

Users that hit only once in a day are removed. The lines are removed from hits. These may have been human users, but as they are not interacting with the site, SofaWiki does not consider them as visitors.

Dealing with bots

SofaWiki has not a list of known bots, but identify them by the behaviours on a daily base. The heuristics for bots are

  • they consult a lot of pages
  • they consult over all 24 hours
  • they consult the robots.txt page

A score is calculated for each user as a sum of

  • ln(hits)
  • sum( -(pagehit/hits) * ln(-pagehit/hits)) // shannon index
  • hits(robots.txt)

The average score and the standard deviation is calculated. If the score of the user is higher than avg + 2 * stddev and the user is not a known user, then it is considered as a bot.

bot hits are removed from the hits and the consulted pages from the visited pages.

As the bot detection is on a daily base, it may happen that users are considered as bots on one day and not on another.