tag:johnmusgrave.com,2014:/feedJohn Musgrave2020-02-22T00:20:37-08:00John Musgravehttps://johnmusgrave.comSvbtle.comtag:johnmusgrave.com,2014:Post/document-analysis-using-support-vector-machines2020-02-22T00:20:37-08:002020-02-22T00:20:37-08:00Document Analysis using Support Vector Machines<p>This post details the Vector Space Kernel Model for document analysis outlined in Shawe-Taylor and Cristianini.</p>
<h2 id="create-encoded-matrix-for-each-document_2">Create Encoded Matrix for each Document <a class="head_anchor" href="#create-encoded-matrix-for-each-document_2">#</a>
</h2>
<ol>
<li>Select Dictionary of Terms</li>
<li>Calculate Term Frequency</li>
<li>Encode using Dictionary of Terms</li>
</ol>
<h3 id="calculate-documentterm-matrix_3">Calculate Document-Term Matrix <a class="head_anchor" href="#calculate-documentterm-matrix_3">#</a>
</h3>
<p>The matrix representation of the document term frequencies shows the frequency of a term across a collection of documents.</p>
<h2 id="kernel-methods-for-support-vector-machines_2">Kernel Methods for Support Vector Machines <a class="head_anchor" href="#kernel-methods-for-support-vector-machines_2">#</a>
</h2>
<p>If <strong>x</strong> and <strong>y</strong> are vectors representing a document then a kernel mapping <strong>K</strong> would be defined as:</p>
<pre><code class="prettyprint">K(x, y) = φ(x) · φ(y) = φ(x · y)
</code></pre>
<p>where the kernel <strong>K</strong>, the dot product in the new feature space, is defined as a function of the dot product in the original feature space.</p>
<h2 id="document-analysis_2">Document Analysis <a class="head_anchor" href="#document-analysis_2">#</a>
</h2>
<p>Given the <em>document-term matrix</em> (<strong>D</strong>) and the <em>term-document matrix</em> (<strong>D’</strong>), define <strong>K = DD’</strong> the co-occurrence matrix. Then for documents <strong>d_1</strong> and <strong>d_2</strong>, define the <em>Vector Space Kernel</em> as</p>
<pre><code class="prettyprint">VSK(d_1, d_2) = φ(d_1) D D' φ(d_2)
</code></pre>
<h2 id="relevance-matrix_2">Relevance Matrix <a class="head_anchor" href="#relevance-matrix_2">#</a>
</h2>
<p>A relevancy matrix <strong>R</strong> defines the weight given to each term in the document, and a proximity matrix <strong>P</strong> defines the distance between terms. The relevancy matrix captures a distribution of the inverse document frequency. Although not hierarchical, a non-zero term in the proximity matrix implies a co-occurrence of terms, which implies less semantic distance. In order to reduce the impact of the document length on the semantic distance, the matrices require normalization.</p>
<pre><code class="prettyprint">S = RP
</code></pre>
<p>The relevancy matrix <strong>R</strong> is defined based on the term weight <strong>w</strong> shown in equation,</p>
<pre><code class="prettyprint">w(t) = ln( L / df(t) )
</code></pre>
<p>where <strong>L</strong> is the number of documents and <strong>df(t)</strong> is the document frequency for the term <strong>t</strong>. This ratio is the inverse of the document frequency across the entire corpus.</p>
<h2 id="proximity-matrix_2">Proximity Matrix <a class="head_anchor" href="#proximity-matrix_2">#</a>
</h2>
<p>The proximity matrix <strong>P</strong> is defined as the transpose of matrix <strong>D</strong>. </p>
<pre><code class="prettyprint">P = D'
</code></pre>
<p>Then, based on <strong>P</strong> the <em>Proximity Kernel</em>, <strong>D’D</strong> - indicates a value of term co-occurrence. From co-occurrence information semantic relations can be inferred between terms.</p>
<p>Given a kernel mapping of the term co-occurrence matrix <strong>D’D</strong>, singular value decomposition yields the matrices <strong>U</strong>, <strong>∑</strong>, and <strong>V</strong>, where <strong>∑</strong> is a semantic matrix and <strong>V</strong> are the relevant topics within the document.</p>
<pre><code class="prettyprint">φ(d1) D’ D φ(d2)’
D’ = U ∑ V’
</code></pre>
<h1 id="semantic-analysis_1">Semantic Analysis <a class="head_anchor" href="#semantic-analysis_1">#</a>
</h1>
<p>Several approaches exist the representation of for semantic information and construction of semantic kernels.</p>
<h2 id="semantic-matrix-composition_2">Semantic Matrix Composition <a class="head_anchor" href="#semantic-matrix-composition_2">#</a>
</h2>
<p>Given a <em>document-term matrix</em> and a <em>term co-occurrence kernel</em> mapping (<strong>D’D</strong>), a semantic matrix can be composed from the product of a <em>relevancy matrix</em> and <em>proximity matrix</em>. </p>
<h2 id="implicit-semantic-mapping_2">Implicit Semantic Mapping <a class="head_anchor" href="#implicit-semantic-mapping_2">#</a>
</h2>
<p>By taking the inner product of each basic block’s document term frequency, a kernel is composed from the corpus of semantic matrices. </p>
<pre><code class="prettyprint">PK(d_1, d_2) = φ(d_1) U_prime ∑ U φ(d_2)_prime
</code></pre>
<h2 id="topic-discovery_2">Topic Discovery <a class="head_anchor" href="#topic-discovery_2">#</a>
</h2>
<p>In order to discover topics within a vector space representation of a document, Singular Value Decomposition is used which yields a matrix <strong>V</strong> such that columns of <strong>V</strong> are Eigenvectors of the linear combination <strong>DD’</strong> representing term co-occurrence. Minimizing the combination of topics while maximizing the classification accuracy allows relevant topics to be extracted.</p>
<h2 id="explicit-semantic-mapping_2">Explicit Semantic Mapping <a class="head_anchor" href="#explicit-semantic-mapping_2">#</a>
</h2>
<p>Semantic information inferred from the relevance and proximity information can be explicitly specified in a semantic or conceptual network. These structures have an intrinsic metric of semantic distance, and proximity can be measured via distance within the network.</p>
<p>J.M.<br>
February 2020</p>
<p>Subscribe to updates <a href="https://tinyletter.com/musgravejw">here</a>.</p>
tag:johnmusgrave.com,2014:Post/notes-on-uncertainty2020-01-01T23:41:16-08:002020-01-01T23:41:16-08:00Notes on Uncertainty<p>Assertions are made by Pearl ‘88 - Probabilistic Reasoning in Intelligent Systems.</p>
<p>Encoding knowledge into rules requires enumerating examples. Positive examples are difficult to satisfy, and ambiguously defined. As a compromise, exceptions can be summarized. Each proposition can be assigned a measure of uncertainty which is aggregated. This uncertainty value is not a truth value, but closer to a counter-example. There is a restrictive assumption of independence. Three schools appear, <strong>non-monotonic logic</strong> which is non-numerical, <strong>probability calculus</strong> that is numerical including Demspter-Schaefer, fuzzy logic, and certainty factors, and <strong>probability theory</strong>, Bayesian probability.</p>
<pre><code class="prettyprint">A->C
B->C
(A^B) -> C
What do these propositions say about the interaction of A and B, and what are their exceptions?
</code></pre>
<p>Extensional systems use productions, and Intensional systems use declarative knowledge. In an extensional system uncertainty is defined as a generalized truth value. Certainty values are composable, aggregated with weights. This system relies on the principal of modularity, which is made up of the principal of locality and detachment, which I would characterize as forms of universality. This treats all rules equally. Extensional systems have challenges in the areas of <strong>bidirectional inference</strong>, <strong>retracting conclusions</strong>, <strong>correlating sources of evidence</strong>, as well as <strong>abductive reasoning</strong>.</p>
<pre><code class="prettyprint">A->B
P(A|B) > 0
System cannot infer from B to A.
Evidence of A->B removed.
</code></pre>
<p>In order for extensional or production systems to work, there must be no cycles present in the graph. This removes any predictive ability, and focuses solely on prescriptive, or diagnostic ability. In order to remove cycles, exceptions can be enumerated, but the principals of locality and detachment in modularity must be removed.</p>
<pre><code class="prettyprint">A->B
C->B
B is true.
If C is true, NOT A is also more probable than A.
</code></pre>
<p>The <strong>conflict is between modularity and coherence</strong>, not the binary truth value of classical logic. Exceptions are not modular, and are not local or detached. P1 -> Q1 ignores P2, P3…Pn. In a graph of propositions, <strong>evidence towards an antecedent may be evidence against the consequent</strong>. Mitigations include attempts to correlate evidence by means of bounds propagation and user specified combination functions. However, higher order correlations are required beyond a pairwise correlation. Higher order dependencies imply a dynamic relationship dependent on evidence. This cannot be specified prior to experience. Given the formalism of a certainty factor, the domain of extensional systems can be calculated to apply only to <strong>Directed Acyclic Graphs</strong>.</p>
<p>In Intensional systems, certainty measures are given to <strong>sets of worlds</strong> and their associated weights. Connectives use <strong>set theoretic operations</strong> to relate the sets of worlds. <strong>These are not composable.</strong> <strong><em>Declarative knowledge is semantically sound</em></strong> (almost by definition), it is bidirectional and highly correlated. However, this knowledge cannot be acted upon in itself. A belief network falling into a category of a <strong>Bayesian Network</strong>, or <strong>Constraint / Qualitative Markov Network</strong> using Dempster-Schaefer must be used in order to act upon the semantic knowledge.</p>
<p><strong>J.M.</strong><br>
<strong>December 2019</strong></p>
<p>Subscribe to updates <a href="https://tinyletter.com/musgravejw">here</a>.</p>