# Document Analysis using Support Vector Machines

This post details the Vector Space Kernel Model for document analysis outlined in Shawe-Taylor and Cristianini.

## Create Encoded Matrix for each Document #

- Select Dictionary of Terms
- Calculate Term Frequency
- Encode using Dictionary of Terms

### Calculate Document-Term Matrix #

The matrix representation of the document term frequencies shows the frequency of a term across a collection of documents.

## Kernel Methods for Support Vector Machines #

If **x** and **y** are vectors representing a document then a kernel mapping **K** would be defined as:

```
K(x, y) = φ(x) · φ(y) = φ(x · y)
```

where the kernel **K**, the dot product in the new feature space, is defined as a function of the dot product in the original feature space.

## Document Analysis #

Given the *document-term matrix* (**D**) and the *term-document matrix* (**D’**), define **K = DD’** the co-occurrence matrix. Then for documents **d_1** and **d_2**, define the *Vector Space Kernel* as

```
VSK(d_1, d_2) = φ(d_1) D D' φ(d_2)
```

## Relevance Matrix #

A relevancy matrix **R** defines the weight given to each term in the document, and a proximity matrix **P** defines the distance between terms. The relevancy matrix captures a distribution of the inverse document frequency. Although not hierarchical, a non-zero term in the proximity matrix implies a co-occurrence of terms, which implies less semantic distance. In order to reduce the impact of the document length on the semantic distance, the matrices require normalization.

```
S = RP
```

The relevancy matrix **R** is defined based on the term weight **w** shown in equation,

```
w(t) = ln( L / df(t) )
```

where **L** is the number of documents and **df(t)** is the document frequency for the term **t**. This ratio is the inverse of the document frequency across the entire corpus.

## Proximity Matrix #

The proximity matrix **P** is defined as the transpose of matrix **D**.

```
P = D'
```

Then, based on **P** the *Proximity Kernel*, **D’D** - indicates a value of term co-occurrence. From co-occurrence information semantic relations can be inferred between terms.

Given a kernel mapping of the term co-occurrence matrix **D’D**, singular value decomposition yields the matrices **U**, **∑**, and **V**, where **∑** is a semantic matrix and **V** are the relevant topics within the document.

```
φ(d1) D’ D φ(d2)’
D’ = U ∑ V’
```

# Semantic Analysis #

Several approaches exist the representation of for semantic information and construction of semantic kernels.

## Semantic Matrix Composition #

Given a *document-term matrix* and a *term co-occurrence kernel* mapping (**D’D**), a semantic matrix can be composed from the product of a *relevancy matrix* and *proximity matrix*.

## Implicit Semantic Mapping #

By taking the inner product of each basic block’s document term frequency, a kernel is composed from the corpus of semantic matrices.

```
PK(d_1, d_2) = φ(d_1) U_prime ∑ U φ(d_2)_prime
```

## Topic Discovery #

In order to discover topics within a vector space representation of a document, Singular Value Decomposition is used which yields a matrix **V** such that columns of **V** are Eigenvectors of the linear combination **DD’** representing term co-occurrence. Minimizing the combination of topics while maximizing the classification accuracy allows relevant topics to be extracted.

## Explicit Semantic Mapping #

Semantic information inferred from the relevance and proximity information can be explicitly specified in a semantic or conceptual network. These structures have an intrinsic metric of semantic distance, and proximity can be measured via distance within the network.

J.M.

February 2020

Subscribe to updates here.