Document Analysis using Support Vector Machines

This post details the Vector Space Kernel Model for document analysis outlined in Shawe-Taylor and Cristianini.

Create Encoded Matrix for each Document #

  1. Select Dictionary of Terms
  2. Calculate Term Frequency
  3. Encode using Dictionary of Terms

Calculate Document-Term Matrix #

The matrix representation of the document term frequencies shows the frequency of a term across a collection of documents.

Kernel Methods for Support Vector Machines #

If x and y are vectors representing a document then a kernel mapping K would be defined as:

K(x, y) = φ(x) · φ(y) = φ(x · y) 

where the kernel K, the dot product in the new feature space, is defined as a function of the dot product in the original feature space.

Document Analysis #

Given the document-term matrix (D) and the term-document matrix (D’), define K = DD’ the co-occurrence matrix. Then for documents d_1 and d_2, define the Vector Space Kernel as

VSK(d_1, d_2) = φ(d_1) D D' φ(d_2)

Relevance Matrix #

A relevancy matrix R defines the weight given to each term in the document, and a proximity matrix P defines the distance between terms. The relevancy matrix captures a distribution of the inverse document frequency. Although not hierarchical, a non-zero term in the proximity matrix implies a co-occurrence of terms, which implies less semantic distance. In order to reduce the impact of the document length on the semantic distance, the matrices require normalization.

S = RP

The relevancy matrix R is defined based on the term weight w shown in equation,

w(t) = ln( L / df(t) )

where L is the number of documents and df(t) is the document frequency for the term t. This ratio is the inverse of the document frequency across the entire corpus.

Proximity Matrix #

The proximity matrix P is defined as the transpose of matrix D.

P = D'

Then, based on P the Proximity Kernel, D’D - indicates a value of term co-occurrence. From co-occurrence information semantic relations can be inferred between terms.

Given a kernel mapping of the term co-occurrence matrix D’D, singular value decomposition yields the matrices U, , and V, where is a semantic matrix and V are the relevant topics within the document.

φ(d1) D’ D φ(d2)’
D’ = U ∑ V’

Semantic Analysis #

Several approaches exist the representation of for semantic information and construction of semantic kernels.

Semantic Matrix Composition #

Given a document-term matrix and a term co-occurrence kernel mapping (D’D), a semantic matrix can be composed from the product of a relevancy matrix and proximity matrix.

Implicit Semantic Mapping #

By taking the inner product of each basic block’s document term frequency, a kernel is composed from the corpus of semantic matrices.

PK(d_1, d_2) = φ(d_1) U_prime ∑ U φ(d_2)_prime

Topic Discovery #

In order to discover topics within a vector space representation of a document, Singular Value Decomposition is used which yields a matrix V such that columns of V are Eigenvectors of the linear combination DD’ representing term co-occurrence. Minimizing the combination of topics while maximizing the classification accuracy allows relevant topics to be extracted.

Explicit Semantic Mapping #

Semantic information inferred from the relevance and proximity information can be explicitly specified in a semantic or conceptual network. These structures have an intrinsic metric of semantic distance, and proximity can be measured via distance within the network.

February 2020

Subscribe to updates here.


Now read this

Notes on Uncertainty

Assertions are made by Pearl ‘88 - Probabilistic Reasoning in Intelligent Systems. Encoding knowledge into rules requires enumerating examples. Positive examples are difficult to satisfy, and ambiguously defined. As a compromise,... Continue →