Document Analysis using Support Vector Machines

This post details the Vector Space Kernel Model for document analysis outlined in Shawe-Taylor and Cristianini.

Create Encoded Matrix for each Document #

  1. Select Dictionary of Terms
  2. Calculate Term Frequency
  3. Encode using Dictionary of Terms

Calculate Document-Term Matrix #

The matrix representation of the document term frequencies shows the frequency of a term across a collection of documents.

Kernel Methods for Support Vector Machines #

If x and y are vectors representing a document then a kernel mapping K would be defined as:

K(x, y) = φ(x) · φ(y) = φ(x · y) 

where the kernel K, the dot product in the new feature space, is defined as a function of the dot product in the original feature space.

Document Analysis #

Given the document-term matrix (D) and the term-document matrix (D’), define K = DD’ the co-occurrence matrix. Then for documents d_1 and d_2, define the Vector Space Kernel as

VSK(d_1, d_2) = φ(d_1) D D' φ(d_2)

Relevance Matrix #

A relevancy matrix R defines the weight given to each term in the document, and a proximity matrix P defines the distance between terms. The relevancy matrix captures a distribution of the inverse document frequency. Although not hierarchical, a non-zero term in the proximity matrix implies a co-occurrence of terms, which implies less semantic distance. In order to reduce the impact of the document length on the semantic distance, the matrices require normalization.

S = RP

The relevancy matrix R is defined based on the term weight w shown in equation,

w(t) = ln( L / df(t) )

where L is the number of documents and df(t) is the document frequency for the term t. This ratio is the inverse of the document frequency across the entire corpus.

Proximity Matrix #

The proximity matrix P is defined as the transpose of matrix D.

P = D'

Then, based on P the Proximity Kernel, D’D - indicates a value of term co-occurrence. From co-occurrence information semantic relations can be inferred between terms.

Given a kernel mapping of the term co-occurrence matrix D’D, singular value decomposition yields the matrices U, , and V, where is a semantic matrix and V are the relevant topics within the document.

φ(d1) D’ D φ(d2)’
D’ = U ∑ V’

Semantic Analysis #

Several approaches exist the representation of for semantic information and construction of semantic kernels.

Semantic Matrix Composition #

Given a document-term matrix and a term co-occurrence kernel mapping (D’D), a semantic matrix can be composed from the product of a relevancy matrix and proximity matrix.

Implicit Semantic Mapping #

By taking the inner product of each basic block’s document term frequency, a kernel is composed from the corpus of semantic matrices.

PK(d_1, d_2) = φ(d_1) U_prime ∑ U φ(d_2)_prime

Topic Discovery #

In order to discover topics within a vector space representation of a document, Singular Value Decomposition is used which yields a matrix V such that columns of V are Eigenvectors of the linear combination DD’ representing term co-occurrence. Minimizing the combination of topics while maximizing the classification accuracy allows relevant topics to be extracted.

Explicit Semantic Mapping #

Semantic information inferred from the relevance and proximity information can be explicitly specified in a semantic or conceptual network. These structures have an intrinsic metric of semantic distance, and proximity can be measured via distance within the network.

J.M.
February 2020

Subscribe to updates here.

 
2
Kudos
 
2
Kudos

Now read this

Notes on Uncertainty

Assertions are made by Pearl ‘88 - Probabilistic Reasoning in Intelligent Systems. Encoding knowledge into rules requires enumerating examples. Positive examples are difficult to satisfy, and ambiguously defined. As a compromise,... Continue →