XML

XML is a significant markup language mainly intended as a means of serialising data structures as a text document. Go has basic support for XML document processing.

XML是一种重要的标记语言,旨在把数据结构序列化成文本文档。Go基本支持XML文档处理。

Introduction

介绍

XML is now a widespread way of representing complex data structures serialised into text format. It is used to describe documents such as DocBook and XHTML. It is used in specialised markup languages such as MathML and CML (Chemistry Markup Language). It is used to encode data as SOAP messages for Web Services, and the Web Service can be specified using WSDL (Web Services Description Language).

现在XML是一个用序列化的文本格式表现复杂数据结构的普遍方式。它被用来描述文档例如DocBook和XHTML。它还用于描述专用标记语言如MathML和CML(化学标记语言)。Web服务中它还用来将数据编码成SOAP消息,Web服务也可以指定使用WSDL(Web服务描述语言)。

At the simplest level, XML allows you to define your own tags for use in text documents. Tags can be nested and can be interspersed with text. Each tag can also contain attributes with values. For example,

在最简单的层次上,XML允许您定义您个人标记用于文本文档。标签可以嵌套,也穿插在文本里。每个标记也可以包含属性与值。例如,


<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>
    

The structure of any XML document can be described in a number of ways:

任何XML文档的结构可以用多种方式描述:

There is argument over the relative value of each way of defining the structure of an XML document. We won't buy into that, as Go does not suport any of them. Go cannot check for validity of any document against a schema, but only for well-formedness.

人们总会争论定义XML文档结构的每一个方式的好坏。我们不会陷入其中,因为Go不支持其中任何一个。Go不能检查任何文档模式的有效性,但只知道良构性。

Four topics are discussed in this chapter: parsing an XML stream, marshalling and unmarshalling Go data into XML, and XHTML.

在本章中讨论四个主题:解析一个XML流,编组和解组Go数据成为XML和XHTML。

Parsing XML

解析XML

Go has an XML parser which is created using NewParser. This takes an io.Reader as parameter and returns a pointer to Parser. The main method of this type is Token which returns the next token in the input stream. The token is one of the types StartElement, EndElement, CharData, Comment, ProcInst or Directive.

Go有一个使用 NewParser.创建的XML解析器。这需要一个io.Reader 作为参数,并返回一个指向Parser 的指针。这个类型的主要方法是 Token ,这个方法返回输入流中的下一个标记。该标记是 StartElement, EndElement, CharData, Comment, ProcInstDirective 其中一种。

The types are

这些类有

StartElement

The type StartElement is a structure with two field types:

StartElement 类型是一个包含两个字段的结构:


type StartElement struct {
    Name Name
    Attr []Attr
}

type Name struct {
    Space, Local string
}

type Attr struct {
    Name  Name
    Value string
}
        
EndElement

This is also a structure

同样也是一个结构


type EndElement struct {
    Name Name
}
        
CharData

This type represents the text content enclosed by a tag and is a simple type

这个类表示一个被标签包住的文本内容,是一个简单类。


type CharData []byte
        
Comment

Similarly for this type

这个类也很简洁


type Comment []byte
        
ProcInst

A ProcInst represents an XML processing instruction of the form <?target inst?>

一个ProcInst表示一个XML处理指令形式,如<target inst?>


type ProcInst struct {
    Target string
    Inst   []byte
}
        
Directive

A Directive represents an XML directive of the form <!text>. The bytes do not include the <! and > markers.

一个指令用XML指令<!文本>的形式表示,内容不包含< !和> 构成部分。


type Directive []byte
        

A program to print out the tree structure of an XML document is

打印XML文档的树结构的一个程序,代码如下


/* Parse XML
 */

package main

import (
        "encoding/xml"
        "fmt"
        "io/ioutil"
        "os"
        "strings"
)

func main() {
        if len(os.Args) != 2 {
                fmt.Println("Usage: ", os.Args[0], "file")
                os.Exit(1)
        }
        file := os.Args[1]
        bytes, err := ioutil.ReadFile(file)
        checkError(err)
        r := strings.NewReader(string(bytes))

        parser := xml.NewDecoder(r)
        depth := 0
        for {
                token, err := parser.Token()
                if err != nil {
                        break
                }
                switch t := token.(type) {
                case xml.StartElement:
                        elmt := xml.StartElement(t)
                        name := elmt.Name.Local
                        printElmt(name, depth)
                        depth++
                case xml.EndElement:
                        depth--
                        elmt := xml.EndElement(t)
                        name := elmt.Name.Local
                        printElmt(name, depth)
                case xml.CharData:
                        bytes := xml.CharData(t)
                        printElmt("\""+string([]byte(bytes))+"\"", depth)
                case xml.Comment:
                        printElmt("Comment", depth)
                case xml.ProcInst:
                        printElmt("ProcInst", depth)
                case xml.Directive:
                        printElmt("Directive", depth)
                default:
                        fmt.Println("Unknown")
                }
        }
}

func printElmt(s string, depth int) {
        for n := 0; n < depth; n++ {
                fmt.Print("  ")
        }
        fmt.Println(s)
}

func checkError(err error) {
        if err != nil {
                fmt.Println("Fatal error ", err.Error())
                os.Exit(1)
        }
}

Note that the parser includes all CharData, including the whitespace between tags.

注意,解析器包括所有文本节点,包括标签之间的空白。

If we run this program against the person data structure given earlier, it produces

如果我们运行这个程序对前面给出的 person数据结构,它就会打印出


person
  "
  "
  name
    "
    "
    family
      " Newmarch "
    family
    "
    "
    personal
      " Jan "
    personal
    "
  "
  name
  "
  "
  email
    "
    jan@newmarch.name
  "
  email
  "
  "
  email
    "
    j.newmarch@boxhill.edu.au
  "
  email
  "
"
person
"
"

Note that as no DTD or other XML specification has been used, the tokenizer correctly prints out all the white space (a DTD may specify that the whitespace can be ignored, but without it that assumption cannot be made.)

注意,因为没有使用DTD或其他XML规范, tokenizer 正确地打印出所有的空白(一个DTD可能指定可以忽略空格,但是没有它假设就不能成立。)

There is a potential trap in using this parser. It re-uses space for strings, so that once you see a token you need to copy its value if you want to refer to it later. Go has methods such as func (c CharData) Copy() CharData to make a copy of data.

在使用这个解析器过程中有一个潜在的陷阱值得注意:它会为字符串重新利用空间,所以,一旦你看到一个你想要复制它的值的标记,假设你想稍后引用它的话,Go有类似的方法如 func (c CharData) Copy() CharData 来复制数据。

Unmarshalling XML

反编排XML

Go provides a function Unmarshal and a method func (*Parser) Unmarshal to unmarshal XML into Go data structures. The unmarshalling is not perfect: Go and XML are different languages.

Go提供一个函数 Unmarshal 和一个方法调用 func (*Parser) Unmarshal 解组XML转化为Go数据结构。解组并不是完美的:Go和XML毕竟是是两个不同的语言。

We consider a simple example before looking at the details. We take the XML document given earlier of

我们先考虑一个简单的例子再查看细节。我们用前面给出的XML文档


<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>
    

We would like to map this onto the Go structures

接下来我们想把这个文档映射到Go结构


type Person struct {
        Name Name
        Email []Email
}

type Name struct {
        Family string
        Personal string
}

type Email struct {
        Type string
        Address string
}
    

This requires several comments:

这里需要一些说明:

  1. Unmarshalling uses the Go reflection package. This requires that all fields by public i.e. start with a capital letter. Earlier versions of Go used case-insensitive matching to match fields such as the XML string "name" to the field Name. Now, though, case-sensitive matching is used. To perform a match, the structure fields must be tagged to show the XML string that will be matched against. This changes Person to
    
    type Person struct {
            Name Name `xml:"name"`
            Email []Email `xml:"email"`
    }
            
  2. While tagging of fields can attach XML strings to fields, it can't do so with the names of the structures. An additional field is required, with field name "XMLName". This only affects the top-level struct, Person
    
    type Person struct {
            XMLName Name `xml:"person"`
            Name Name `xml:"name"`
            Email []Email `xml:"email"`
    }
            
  3. Repeated tags in the map to a slice in Go
  4. Attributes within tags will match to fields in a structure only if the Go field has the tag ",attr". This occurs with the field Type of Email, where matching the attribute "type" of the "email" tag requires `xml:"type,attr"`
  5. If an XML tag has no attributes and only has character data, then it matches a string field by the same name (case-sensitive, though). So the tag `xml:"family"` with character data "Newmarch" maps to the string field Family
  6. But if the tag has attributes, then it must map to a structure. Go assigns the character data to the field with tag ,chardata. This occurs with the "email" data and the field Address with tag ,chardata
  1. 使用Go reflection包去解组。这要求所有字段是公有,也就是以一个大写字母开始。早期版本的Go使用不区分大小写匹配来匹配字段,例如XML标签“name”对应Name字段。但是现在使用case-sensitive匹配,要执行一个匹配,结构字段后必须用标记来显示XML标签名,以应付匹配。Person修改下应该是
    
    type Person struct {
            Name Name `xml:"name"`
            Email []Email `xml:"email"`
    }
            
  2. 虽然标记结构字段可以使用XML字符串,但是对于结构名不能这么做 ,这个解决办法是增加一个额外字段,命名“XMLName”。这只会影响上级结构,修改Person 如下
    
    type Person struct {
            XMLName Name `xml:"person"`
            Name Name `xml:"name"`
            Email []Email `xml:"email"`
    }
            
  3. 重复标记会映射到Go的slice
  4. 要包含属性的标签准确匹配对应的结构字段,只有在Go字段后标记”,attr”。举个下面例子中 Email类型的Type字段,需要标记`xml:"type,attr"`才能匹配带有“type”属性的“email”
  5. 如果一个XML标签没有属性而且只有文本内容,那么它匹配一个string 字段是通过相同的名称(区分大小写的,不过如此)。所以标签`xml:"family"`将对应着文本”Newmarch”映射到Family的string字段中
  6. 但如果一个标签带有属性,那么它这个特征必须反映到一个结构。Go在字段后标记着 ,chardata的文字。如下面例子中通过 Address 后标记,chardata的字段来获取email的文本值

A program to unmarshal the document above is

解组上面文档的一个程序


/* Unmarshal
 */

package main

import (
        "encoding/xml"
        "fmt"
        "os"
        //"strings"
)

type Person struct {
        XMLName Name    `xml:"person"`
        Name    Name    `xml:"name"`
        Email   []Email `xml:"email"`
}

type Name struct {
        Family   string `xml:"family"`
        Personal string `xml:"personal"`
}

type Email struct {
        Type    string `xml:"type,attr"`
        Address string `xml:",chardata"`
}

func main() {
        str := `<?xml version="1.0" encoding="utf-8"?>
<person>
  <name>
    <family> Newmarch </family>
    <personal> Jan </personal>
  </name>
  <email type="personal">
    jan@newmarch.name
  </email>
  <email type="work">
    j.newmarch@boxhill.edu.au
  </email>
</person>`

        var person Person

        err := xml.Unmarshal([]byte(str), &person)
        checkError(err)

        // now use the person structure e.g.
 fmt.Println("Family name: \"" + person.Name.Family + "\"")
        fmt.Println("Second email address: \"" + person.Email[1].Address + "\"")
}

func checkError(err error) {
        if err != nil {
                fmt.Println("Fatal error ", err.Error())
                os.Exit(1)
        }
}

(Note the spaces are correct.). The strict rules are given in the package specification.

(注意空间是正确的)。Go在包详解中给出了严格的规则。

Marshalling XML

编组 XML

Go 1 also has support for marshalling data structures into an XML document. The function is

Go1也支持将数据结构编组为XML文档的。这个函数是

    
func Marshal(v interface}{) ([]byte, error)
    
  

This was used as a check in the last two lines of the previous program.

这是用来检查前面程序的最后两行

XHTML

XHTML

HTML does not conform to XML syntax. It has unterminated tags such as '<br>'. XHTML is a cleanup of HTML to make it compliant to XML. Documents in XHTML can be managed using the techniques above for XML.

HTML并不符合XML语法。 它包含无闭端的标签如“< br >”。XHTML是HTML的一个自身兼容XML的子集。 在XHTML文档中可以使用操作XML的技术。

HTML

There is some support in the XML package to handle HTML documents even though they are not XML-compliant. The XML parser discussed earlier can handle many HTML documents if it is modified by

XML包的部分方法可支持处理HTML文档,即使他们本身不具备XML兼容性。前面讨论的XML解析器修改下就可以处理大部分HTML文件

    
        parser := xml.NewDecoder(r)
        parser.Strict = false
        parser.AutoClose = xml.HTMLAutoClose
        parser.Entity = xml.HTMLEntity
    
  

Conclusion

结论

Go has basic support for dealing with XML strings. It does not as yet have mechanisms for dealing with XML specification languages such as XML Schema or Relax NG.

Go基本支持对XML字符的处理,而且它不像有着针对XML专用语言如XML Schema或Relax NG的处理机制。

Copyright Jan Newmarch, jan@newmarch.name

If you like this book, please contribute using Flattr
or donate using PayPal