samedi 9 mai 2015

Exracting text from xml using R

This is the xml code part that I am working on: I need to extract some specific text from the xml code.

The xml code.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>    
    <dependencies type="collapsed-dependencies">//the collapsed-dependencies tag
                  <dep type="root">
                    <governor idx="0">ROOT</governor>
                    <dependent idx="8">provide</dependent>
                  </dep>
                  <dep type="mark">
                    <governor idx="2">requested</governor>
                    <dependent idx="1">If</dependent>
                  </dep>
        </dependencies>
    </document>
</root>

I would like the output in the following format:

root(ROOT-0,provide-8)
mark(requested-2,If-1)
advcl(provide-8,requested-2)
case(TD-4,by-3)

I am able to extract each of the parameter separately, but cannot take the whole thing out in one go.

abstracts <- xpathSApply(doc,"//*/dependencies[@type='collapsed-dependencies']",xmlValue) # finds the words within collapdsed-dependencies
abstracts #value
#"ROOTproviderequestedIfproviderequestedTDby"

type <- xpathSApply(doc, "//dependencies/dep", xmlGetAttr, 'type') #gives the types 
type #value
#[1] "root"       "mark"       "advcl"      "case"

idx1 <- xpathSApply(doc, "//dependencies/dep/governor", xmlGetAttr, 'idx') # gives the idx for governor
idx1 #value
#[1] "0"   "2"   "8"   "4"   "2"   "8"

# gives the idx for dependent
idx2 <- xpathSApply(doc, "//dependencies/dep/dependent", xmlGetAttr, 'idx') 

Aucun commentaire:

Enregistrer un commentaire