# Document Analysis using Support Vector Machines

This post details the Vector Space Kernel Model for document analysis outlined in Shawe-Taylor and Cristianini.

## Create Encoded Matrix for each Document #

1. Select Dictionary of Terms
2. Calculate Term Frequency
3. Encode using Dictionary of Terms

### Calculate Document-Term Matrix #

The matrix representation of the document term frequencies shows the frequency of a term across a collection of documents.

## Kernel Methods for Support Vector Machines #

If x and y are vectors representing a document then a kernel mapping K would be defined as:

``````K(x, y) = φ(x) · φ(y) = φ(x · y)
``````

where the kernel K, the dot product in the new feature space, is defined as a function of the dot product in the original feature space.

## Document Analysis #

Given the document-term matrix (D) and the term-document matrix (D’), define K = DD’ the co-occurrence matrix. Then for documents d_1 and d_2, define the Vector Space Kernel as

``````VSK(d_1, d_2) = φ(d_1) D D' φ(d_2)
``````

## Relevance Matrix #

A relevancy matrix R defines the weight given to each term in the document, and a proximity matrix P defines the distance between terms. The relevancy matrix captures a distribution of the inverse document frequency. Although not hierarchical, a non-zero term in the proximity matrix implies a co-occurrence of terms, which implies less semantic distance. In order to reduce the impact of the document length on the semantic distance, the matrices require normalization.

``````S = RP
``````

The relevancy matrix R is defined based on the term weight w shown in equation,

``````w(t) = ln( L / df(t) )
``````

where L is the number of documents and df(t) is the document frequency for the term t. This ratio is the inverse of the document frequency across the entire corpus.

## Proximity Matrix #

The proximity matrix P is defined as the transpose of matrix D.

``````P = D'
``````

Then, based on P the Proximity Kernel, D’D - indicates a value of term co-occurrence. From co-occurrence information semantic relations can be inferred between terms.

Given a kernel mapping of the term co-occurrence matrix D’D, singular value decomposition yields the matrices U, , and V, where is a semantic matrix and V are the relevant topics within the document.

``````φ(d1) D’ D φ(d2)’
D’ = U ∑ V’
``````

# Semantic Analysis #

Several approaches exist the representation of for semantic information and construction of semantic kernels.

## Semantic Matrix Composition #

Given a document-term matrix and a term co-occurrence kernel mapping (D’D), a semantic matrix can be composed from the product of a relevancy matrix and proximity matrix.

## Implicit Semantic Mapping #

By taking the inner product of each basic block’s document term frequency, a kernel is composed from the corpus of semantic matrices.

``````PK(d_1, d_2) = φ(d_1) U_prime ∑ U φ(d_2)_prime
``````

## Topic Discovery #

In order to discover topics within a vector space representation of a document, Singular Value Decomposition is used which yields a matrix V such that columns of V are Eigenvectors of the linear combination DD’ representing term co-occurrence. Minimizing the combination of topics while maximizing the classification accuracy allows relevant topics to be extracted.

## Explicit Semantic Mapping #

Semantic information inferred from the relevance and proximity information can be explicitly specified in a semantic or conceptual network. These structures have an intrinsic metric of semantic distance, and proximity can be measured via distance within the network.

J.M.
February 2020