interactive-mining/interactive-mining-3rdparty.../madis/src/docs/source/examples.txt

Examples
********

.. _pivoting:

.. highlight:: mysql

Pivot
=====

First import data from a tsv file (Tab Separated Values) with :func:`~functions.vtable.file.file` virtual table function.
:download:`sales.tsv <../../functions/row/testing/sales.tsv>`

.. code-block:: python

        >>> sql("""create table sales as
        ...             select Product,Region,Month,cast(Sales as int) as Sales
        ...                 from
        ...                     (file 'testing/sales.tsv' 'dialect:tsv' header:t)""")
        >>> sql("select * from sales")
        Product | Region | Month   | Sales
        ----------------------------------
        Cars    | Athens | 2010-01 | 200
        Cars    | Athens | 2010-02 | 130
        Bikes   | NY     | 2010-01 | 10
        Bikes   | NY     | 2010-02 | 30
        Cars    | NY     | 2010-01 | 100
        Cars    | NY     | 2010-02 | 160
        Cars    | Paris  | 2010-01 | 70
        Cars    | Paris  | 2010-02 | 20
        Bikes   | Paris  | 2010-01 | 100
        Bikes   | Paris  | 2010-02 | 20
        Boats   | Paris  | 2010-01 | 200

Let's say we want the result to be a table with columns:

- Product
- NY: the total sales in NY for this product
- Paris: the total sales in Paris for this product
- Athens: the total sales in Athens for this product

We will perform the aggregate function *sum* over *Sales* column, grouping by *Product* and *Region* columns.
Then :func:`~functions.aggregate.packing.vecpack` must be performed over *Region* and *Sales* sums, grouping on *Product*.

To use :func:`~functions.aggregate.packing.vecpack` the first argument must be a pack of the dimensions. This is the
result of packing the distinct *Region* values, grouping over all the table.

.. code-block:: python

    >>> sql("""select Product,unpackcol(vpck)
    ...      from
    ...        (select Product,vecpack(rpk,Region,salessum) as vpck
    ...         from
    ...             (select pack(distinct Region) as rpk from sales),
    ...             (select Product,Region,sum(sales) as salessum
    ...                 from sales group by Product,Region)
    ...         group by Product)""")
    Product | Paris | NY  | Athens
    ------------------------------
    Bikes   | 120   | 40  | 0
    Boats   | 200   | 0   | 0
    Cars    | 90    | 260 | 330


.. _applexample:


Application scenario
====================


In this example we implement a simple application for query recommendation based on user's country of origin. The input data come from a web portal's logs.

The query mining workflow includes the following five main steps:

1. Import the portal log files into a relational table
2. Use time-heuristics so as to identify coherent query sessions.
   Assign and store a new, reconstructed session id for each record in the logs
3. Preprocess and clean logged queries text
4. Retrieve IP-to-country information from the web and combine this with the IP information from portal log files
   so as to assign a country code to each one of the logged sessions
5. Apply an `Apriori-based <http://en.wikipedia.org/wiki/Apriori_algorithm>`_ technique for extracting frequent term sets per country.

A sample from the log file used is presented below:

.. code-block:: none

   36506	guest	X.X.X.134	p93t9q5eqaa0u0p8isd9skp1n6	en	("paradis riedinger")	search_sim  2010-01-01 00:28:23
   36507	guest	X.X.X.134	p93t9q5eqaa0u0p8isd9skp1n6	en	("paradis riedinger")	view_brief  2010-01-01 00:28:34
   36508	guest	X.X.X.134	p93t9q5eqaa0u0p8isd9skp1n6	en	("paradis riedinger")	view_brief  2010-01-01 00:28:34
   36509	guest	X.X.X.134	p93t9q5eqaa0u0p8isd9skp1n6	en	("paradis riedinger")	view_full  2010-01-01 00:28:42
   36510	guest	X.X.X.55	a9qo379qnl5hbbria6tj6nlp95	de  (creator all "frank leonhard")  search_adv  2010-01-01 00:28:44
   36511	guest	X.X.X.55	a9qo379qnl5hbbria6tj6nlp95	de  (creator all "frank leonhard")	view_brief  2010-01-01 00:29:10
   36512	guest	X.X.X.55	a9qo379qnl5hbbria6tj6nlp95	de  (creator all "frank leonhard")	search_res  2010-01-01 00:29:18


**Step 1**

Initially, log files which are available in the Tab Separated Value format, are imported into a relational table.
The import process is easily implemented in madIS using the :func:`~functions.vtable.file.file` function,
by selecting the appropriate columns from the ".tsv" file. 

::

   create table logs as
      select C1 as id, C2 as userid, C3 as userip, C4 as sesid, C6 as query, c7 as action, C8 as date
      from file('raw_logs.tsv','delimiter:\t','quoting:QUOTE_NONE');

**Step 2**

Thereafter, session reconstruction, a common task in web usage mining, is performed.
In this example it uses a predefined inactivity thresholds to break in-coming sessions.
Such functionality is performed in madIS through the :func:`~functions.aggregate.subgroup.datediffbreak` function,
which takes as an argument the inactivity threshold (as well as some additional parameters) and returns the new session ids.
In the following madSQL segment, where the efficient employment of the described function is presented,
the inactivity threshold has been set to 30 minutes (provided in milliseconds).

::

   alter table logs add sesidnew text;
   update logs
   	set sesidnew=
   	(select bgroupid
   		from
   		( cache select datediffbreak(sesid,id,date,30*60*1000,'order',date,id)
   				from logs
   				where sesid not null group by sesid)
   		where C1=id )
   	where sesid is not null;

**Step 3**

Then, focusing on issued queries, the distinct queries per session are retrieved from the logs
and a variety of query text processing steps take place.
However, a great amount of malformed queries has been observed, so it is only queries of valid text in utf8 encoding that are selected.
This filtering step is performed using the :func:`~functions.row.text.isvalidutf8` function in the where clause of the corresponding madSQL fragment. 

Thereafter, stop word removal is performed over the selected queries through the corresponding function.
Moreover, since queries are issued and logged using the Common Query Language (CQL) syntax, an additional filtering step is executed for removing *CQL* constructs,
through the madIS function :func:`~functions.row.text.cqlkeywords`. The processed queries are then stored in table *QueriesPerSession*.
One of the advantages offered by madIS framework, is that all the powerful Python capabilities and
open source libraries for text processing can be exploited for the rapid implementation of customized functions,
hence significantly easing the “flow-level programming”, in madSQL.

::

   create table QueriesPerSession as
   select filterstopwords(cqlkeywords(query)) as cleanquery , sesidnew, userid
   from logs
   where action like 'search%' and cleanquery!='' and isvalidutf8(query)
   group by query, sesidnew;

**Step 4**

Aiming towards the extraction of term associations per country, an external data source is fetched from the web,
containing the mapping between IP ranges and countries.
Again, the :func:`~functions.vtable.file.file` function is employed for fetching the URL resource and importing the corresponding data in a main memory
indexed table using the :func:`~functions.vtable.cache.cache` function of madIS. The fetched data source specifies the IP ranges in IP long number format,
so the logged IPs have to be converted to the same format by using :func:`~functions.row.iptools.ip2long` function, and then to be subsequently matched
to the imported ranges. However, since each session may contain requests from multiple IPs (due to dynamic IPs, etc.),
it is only the first encountered IP that is considered for each session.
The first IP for each session is obtained using the :func:`~functions.aggregate.selection.minrow` function, comparing the corresponding records’ ids.
The generated mapping from each session to a country code is stored in table Session2Country.

::

   create temporary table Session2Country as
   select sesidnew, CountryCode
   from
   	(select ip2long(minrow(id,userip)) as iplong, sesidnew
		from logs
		group by sesidnew),
	(cache select cast(C3 as integer) as ipfrom, cast(C4 as integer) as ipto, C5 as CountryCode
		from file('http://.../GeoIPCountryCSV.zip','dialect:csv','compression:t'))
   where iplong>=ipfrom and iplong <= ipto;

**Step 5**

Then information regarding text of queries, and session to country mappings is jointly used so as to extract term associations,
with an `apriori-like <http://en.wikipedia.org/wiki/Apriori_algorithm>`_ technique, using the aggregate function :func:`~functions.aggregate.mining.freqitemsets`.
Frequent query term sets are computed over each one of the country codes that occur in Session2Country table.

::

   create table FrequentItemsets as
   select 'nat', CountryCode,
	freqitemsets(cleanquery,'threshold:2','maxlen:5')
   from QueriesPerSession as qs, Session2Country as sc
   where  qs.sesidnew = sc.sesidnew
   group by CountryCode;