Sofawiki Database Primer

The Sofawiki database is part of the Sofawiki CMS. The properties of the database are

There is no predefined data schema. Fields are defined ad hoc.
There is no interface for data inserting or updating. Data fields are defined ad hoc when you edit the text of a wiki page.
Fields have no type.
FIelds can have multiple values on a page.
Queries can be defined ad hoc in the Special:Query page.
Queries can be integrated into pages with the query function.
Fields of a speficic page can be integrated into pages with the value or the extvalue function.
As in relational algebra, tuples are always unique.

Syntax of data fields

A datafield can be inserted at any position within a is delimited by double square brackets. The key and the value are separated with double colons.

[[key::value]]

If you have multiple values, you can either use multiple fields, or separate the values with double colons. The following two notations are equivalent

[[key::value1]] [[key::value1]]

[[key::value1::value2]]

Order does not matter between keys. If you have multiple values for a key, order does matter. Each iteration of a field belongs to its tuple. If the number of iterations is not the same between the keys, the iterations are padded.

Note also that the query can be tricky if the number of values is not the same per key. See How Query deals multiple values.

The key can be any string if it does not contain a colon or a pipe character. However, to be useful with the query function, start with a lowercase letters and use only lowercase letters and numbers. Do not start with an underscore, that is reserved for internal values

Good keys: name, grade, title, director, stars
Valid keys, but diffficult to use: NAME, Title 2, director-assistant
Invalid keys: _stars, 3stars, article:title
Predefined keys you can use for queries, but not for defining values

_name: Name of the page
_content: Entire content of the page
_revision: Revision of the page

The value can be any string, even with line returns, but it cannot use the chracacter double colon, pipe and square brackets. You will also have problems with curly brackets. Use the predefined tags colon, pipe, leftsquare, rightsquare, leftcurly and rightcurly. The query function is aware of the predefined tags.

The fields are not rendered directly on the page. They are invisible. You use a template or a function to render them to wikitext.

Querying the database

You can either lookup the values of a specific page or querying the entire database.

To lookup specific values that are on the same page, use the value function.
To lookup specific values that are on a page you know, use the extvalue function.
To query the database generally, use the query function.

The query function

The query function accepts a list of statements. Each statement operates on a stack of two-dimensional tables. The result is the top table, which then can be formatted as table, tabbed text, fields or text. As in each function, statements are separated by the pipe character.

Each statement starts with a keyword in uppercase and then a list of arguments.

The most important statement is the SELECT statement. It works like the SQL statement, but you must define the fields explicitely and you can check only against one condition. In Sofawiki, you break up the conditions in a set of simple conditions and then combine them with boolean statements. If you do not know what fields are in the database, use the FIELDS statement.

SELECT fieldlist (FROM namespace)? (WHERE field operator string|WHERE field dollaroperator expression)?

fieldlist = field (, field)*

The fieldlist is a comma-separated list of fields. You cannot aggregate at this step (see GROUP statement).
FROM is optional. It restricts the query to pages where the name starts with namespace. "FROM main:" restricts to main namespace.
WHERE is optional. It adds one condition. You can either compare a field against a string or you can compare the field against an expression.

FROM VIRTUAL has a special meaning. It looks for fields only in the specified page. But while normally only the source text is searched, here the page is parsed with the template parser and then searched only. You can therefore use a query function with OUTPUT FIELDS to create what would be a VIEW in SQL. This is not faster, but creates more readable query functions.

SELECT fieldlist FROM VIRTUAL pagename (WHERE field operator string|WHERE field dollaroperator expression)?

There are many operators

numeric operators compare the numeric values = <> < > <= >=
string operators compare the string value == != << >> <= >=, but also =* (starts with) *= (ends with) *=* (contains)
relaxed string operators work on the url value of the string ~~ !~ ~* *~ |*~*
the * operator checks if the field is present. It does not need a string
the !0 operator checks if the field is not empty
the r= operator makes a regex comparison. Note that you may need to use the predefined tags to escape brackets and pipes

You can compare also against an expression, by preceding the operator with the $ sign. Expressions are calculations within the fields of the tuple using an RPN notation. An expression can contain

fields
numbers
quoted strings (use double quotes)
+ - * / . (concat) :: (concat with double ::)
ABS SIGN COPY POP SWAP SUBSTR REPLACE

As you see, the functions are uppercase like all keywords in the query function.

The IMPORT statement imports table from foreign formats, saved in a wiki page. Actually, implemented are:

the TAB format is, using the first line as labels and the following lines as records, all separated by tabs.
the FIELDS format

The WHERE statement further filters the top table

WHERE field (operator string|dollaroperator expression)

The GROUP statement aggregates the top table. Add aggregation keywords COUNT, SUM, MIN, MAX, AVG after the fields in the field values. If there is no BY keyword, the aggregation is global

GROUP (fieldaggregation) (, fieldaggregation)* BY fieldlist

fieldaggregation = field (COUNT | SUM | MIN | MAX | AVG)?

The ORDER statement order the top table. You can order albhabetic or numeric, ascendant or descendant

ORDER ORDER fieldorder (, fieldorder)*

fieldorder = field (NUMERIC)? (DESC)?

The LIMIT statement selects only a certain number of rows in the table. Start is zero-based

LIMIT start limit

The PROJECT statement selects only some columns. Use this to get rid of columns used before. The RENAME statement renames one field of the column or appends with * a postfix to each columns (use this before self joins). With # you can recover the key value in a normal visible column.

PROJECT fieldlist

RENAME oldfield newfield, otheroldfield newfield

RENAME * postfix

RENAME # field

The CALC statement assigns a value on a column based on a expression using values of the tuple. The FORMAT statement applies a sprintf formatation on the string. The NUMBERFORMAT statement makes pretty thousands-sepataror numbers with a chosen number of digits.

CALC field expression

FORMAT field formatstring

NUMBERFORMAT field digits, otherfield digits

The UNION and EXCEPT statements make boolean operations on the two top tables.

UNION

EXCEPT field

The JOIN statement makes a natural join based on the provided field (not optional). The CROSS statement makes a cross join. The left join adds null values for missing tuples on the right side, the outerjoin adds null values for missing matching tuples on both sides. There is no right join. You can swap and left join.

JOIN field

LEFTJOIN field

OUTERJOIN field

CROSS

The ROTATE statement rotates columns and rows. The keys become column names and the column names keys

The COPY, POP and SWAP operator work on the stack.

There are several options for the output. By default there is a HTML table. You can apply a TEMPLATE to each field. With the OUTPUT statement, you can specify a deep field list, a textarea with tabbed text to copy paste, text or a fixed format like SQL tables. Finally you can style the table

TEMPLATE field templatename

OUTPUT (HTML | FIXED (fieldlist)* | FIELDS | TAB | TABHEADER | TEXT | TEXTSPACE | ROWTEMPLATE string)

STYLE cssclass

The deep list format expects that the columns are ordered from general to detail.

To debug, finally, use the SHOWKEY and the VERBOSE statement.

The inner workings of the query function

The SELECT function uses the sofawiki PHP function swFilter. All other statements are calculated directly in the function each time the page is requested.

The swFilter function scans linearily through the revisions. For each revision, it checks if the condition is met and then extracts the fields of the fieldlist. The result of this search is persistant. It is saved in the queries folder. So even if a linear scan may seem long, it is done only one time for each revision because the Sofawiki knows that revisions are permament, they do not change overtime. This is a kind of selfindexing.

Sometimes the scanning is very long. The swFilter has a timeout, where it stops scanning and returns the result so far. At this moment you get the message Incomplete results. You can refresh the page to get further results.

The swFilter function then compares the found revisions with the current revisions (that have the status "ok" or "protected"). The revision that stay are then returned.

The set of found revisions are hold in a swBitmap, which is a set of booleans that take exactly one bit of memory space per boolean. We can then make boolean operations on the bitmaps.

The swFilter functions makes several optimizations not to search each revision

The namespace limits the search
It compares against queries that checked the existence of the field.
For certain string operators (== *= 0* *=* ~~ *~ ~* *~*) it can check if the page contains the text. It uses therefore a global trigram index.