code and games
Built with Hugo and Hyde-Y.

Cleaning XML Files Before Unmarshaling in Go

· Read in about 1 min · (136 Words)

Everyone loves parsing XML files, right? The inconsistent formatting, undocumented fields and attributes, they’re always a hoot. If you’re especially unlucky, you’ll also come up against invalid Unicode characters that cause the parser to choke, as I did recently:

2017/01/24 11:28:06 error parsing data/201701182200040_58647400_2.xml: XML syntax error on line 96: illegal character code U+001E

Fortunately, Go makes it really easy to clean that part up, thanks to the excellent unicode and strings standard libraries. In my case, I had read in the contents of an xml file into a byte slice called xmlData, using os.ReadFile. I simply put the code below in between the file read and the XML unmarshal, and it’s been smooth sailing ever since:

printOnly := func(r rune) rune {
  if unicode.IsPrint(r) {
    return r
  }
  return -1
}
xmlData = []byte(strings.Map(printOnly, string(xmlData)))