I don’t write technical posts often because I have had to dig through more than my fair share of irrelevant blog posts from 2007. However, every once in a while I run across a problem that has a solution so convoluted and inconvenient that I feel the need to put it in one place; so here we are.
Disclaimers:
I have been working in Scala for the past year or so (but am not an expert or particularly fond of it). If you haven’t touched Scala before, it’s very similar to Java but the syntax is not exactly the same (and also it’s a functional language, so you end up creating a lot of vals
). I modified a lot of Java snippets to get this working in Scala. It is not perfect Scala, if you have a better solution, feel free to share it.
Also, I’ve pulled these code snippets out of a much larger project from multiple files, you will run into issues if you try to just copy/paste them.
The Problem
If you’ve ever had to handle data, then you know the biggest headache is in the structure of it. JSON and XML are both ways to describe and structure how data is organized. XML came to be around 1996, JSON was created in the early 2000s. In my experience, JSON is generally seen as the more “modern” and usable approach – this is the data format used most often in Javascript – but some people are still diehard XML fans.
If you’re working with third party API’s that have been around for a while then there’s a pretty good chance that they’ll be returning XML when you’re using JSON. If you’re unfamiliar with both formats, let me try to explain why this is what’s been keeping me up for the past five nights.
XML
[pastacode lang=”markup” manual=”%3Croot%3E%0A%09%3Cperson%3E%0A%09%09%3Cname%3EJones%3C%2Fname%3E%0A%09%09%3Cage%3E23%3C%2Fage%3E%0A%09%09%3Coccupation%3EConsultant%3C%2Foccupation%3E%0A%09%3C%2Fperson%3E%0A%3C%2Froot%3E” message=”” highlight=”” provider=”manual”/]
JSON
[pastacode lang=”javascript” manual=”%7B%0A%09person%3A%20%7B%0A%09%09%22name%22%3A%20%22Jones%22%2C%0A%09%09%22age%22%3A%2023%0A%09%09%22occupation%22%3A%20%22Consultant%22%0A%09%7D%0A%7D” message=”” highlight=”” provider=”manual”/]
Right off the bat you can see that they’re organized a little bit differently. XML uses tags, JSON uses brackets and semicolons. The examples above are extremely simple and can pretty easily be parsed by libraries like LiftWeb, no big deal.
Unfortunately, in real life, the examples are rarely so simple, your XML is probably going to look a little bit more like this
[pastacode lang=”markup” manual=”%3Croot%3E%0A%09%3CPeople%3E%0A%09%09%3CPerson%C2%A0id%3D1%3E%0A%09%09%09%3CPersonalDetails%3E%0A%09%09%09%09%3CName%3EJones%3C%2FName%3E%0A%09%09%09%09%3CAge%3E23%3C%2FAge%3E%0A%09%09%09%09%3CLocations%3E%0A%09%09%09%09%09%3CLocation%C2%A0reason%3D%22Born%22%3EAlaska%3C%2FLocation%3E%0A%09%09%09%09%09%3CLocation%C2%A0reason%3D%22Work%22%3ETexas%3C%2FLocation%3E%0A%09%09%09%09%09%3CLocation%20reason%3D%22Study%22%3EMoscow%3C%2FLocation%3E%0A%09%09%09%09%09%3CLocation%20reason%3D%22Study%22%3EBeijing%3C%2FLocation%3E%0A%09%09%09%09%3C%2FLocations%3E%0A%09%09%09%3C%2FPersonalDetails%3E%0A%09%09%09%3CWorkDetails%3E%0A%09%09%09%09%3CJob%3E%0A%09%09%09%09%09%3CJobTitle%3EConsultant%3C%2FJobTitle%3E%0A%09%09%09%09%09%3CCompany%3ECredera%3C%2FCompany%3E%0A%09%09%09%09%09%3CHireDate%3E2016%3C%2FHireDate%3E%0A%09%09%09%09%09%3CEndDate%3E%3C%2FEndDate%3E%0A%09%09%09%09%09%3CSkills%3E%0A%09%09%09%09%09%09%3CSkill%C2%A0expertise%3D%221%22%3EScala%3C%2FSkill%3E%0A%09%09%09%09%09%09%3CSkill%C2%A0expertise%3D%223%22%3EJavascript%3C%2FSkill%3E%0A%09%09%09%09%09%09%3CSkill%3EAngularJS%3C%2FSkill%3E%0A%09%09%09%09%09%3C%2FSkills%3E%0A%09%09%09%09%3C%2FJob%3E%0A%09%09%09%09%3CJob%3E%0A%09%09%09%09%09%3CJobTitle%3EPersonal%20Assistant%3C%2FJobTitle%3E%0A%09%09%09%09%09%3CCompany%3ESMU%3C%2FCompany%3E%0A%09%09%09%09%09%3CHireDate%3E2014%3C%2FHireDate%3E%0A%09%09%09%09%09%3CEndDate%3E2016%3C%2FEndDate%3E%0A%09%09%09%09%09%3CSkills%3E%0A%09%09%09%09%09%09%3CSkill%3EDewey%20Decimal%20System%3C%2FSkill%3E%0A%09%09%09%09%09%3C%2FSkills%3E%0A%09%09%09%09%3C%2FJob%3E%0A%09%09%09%3C%2FWorkDetails%3E%0A%09%09%3C%2FPerson%3E%0A%09%3C%2FPeople%3E%0A%3C%2Froot%3E%0A%09%09%09%09%0A%09%09%09%09%0A%09%09%09%09″ message=”” highlight=”” provider=”manual”/]
Suddenly there’s stuff inside the brackets, but not all of them (those are called attributes), some of the tags don’t have anything between them, and the XML is a lot less readable. The equivalent JSON might look like this:
[pastacode lang=”javascript” manual=”%7B%C2%A0%22people%22%3A%20%5B%0A%09%7B%0A%09%09%22id%22%3A%201%2C%0A%09%09%22personalDetails%22%3A%20%7B%0A%09%09%09%22name%22%3A%20%22Jones%22%2C%0A%09%09%09%22age%22%3A%2023%2C%0A%09%09%09%22locations%22%3A%20%5B%0A%09%09%09%09%7B%22reason%22%3A%20%22Born%22%2C%20%22location%22%3A%20%22Alaska%22%7D%2C%0A%09%09%09%09%7B%22reason%22%3A%20%22Work%22%2C%20%22location%22%3A%20%22Texas%22%7D%2C%0A%09%09%09%09%7B%22reason%22%3A%20%22Study%22%2C%20%22location%22%3A%20%22Moscow%22%7D%2C%0A%09%09%09%09%7B%22reason%22%3A%20%22Study%22%2C%20%22location%22%3A%20%22Beijing%22%7D%0A%09%09%09%5D%0A%09%09%7D%2C%0A%09%09%22workDetails%22%3A%20%7B%0A%09%09%09%22jobs%22%3A%20%5B%0A%09%09%09%09%7B%0A%09%09%09%09%09%22jobTitle%22%20%3A%20%22Consultant%22%2C%0A%09%09%09%09%09%22company%22%C2%A0%3A%20%22Credera%22%2C%0A%09%09%09%09%09%22hireDate%22%C2%A0%3A%202016%2C%0A%09%09%09%09%09%22endDate%22%20%3A%20undefined%2C%0A%09%09%09%09%09%22skills%22%3A%20%5B%0A%09%09%09%09%09%09%7B%22expertise%22%3A%201%2C%20%22skill%22%3A%20%22Scala%22%7D%2C%0A%09%09%09%09%09%09%7B%22expertise%22%3A%203%2C%20%22skill%22%3A%20%22Javascript%22%7D%2C%0A%09%09%09%09%09%09%7B%22expertise%22%3A%20undefined%2C%20%22skill%22%3A%20%22AngularJS%22%7D%2C%0A%09%09%09%09%09%5D%0A%09%09%09%09%7D%2C%0A%09%09%09%09%7B%20%0A%09%09%09%09%09%22jobTitle%22%20%3A%20%22Consultant%22%2C%20%0A%09%09%09%09%09%22company%22%C2%A0%3A%20%22Credera%22%2C%20%0A%09%09%09%09%09%22hireDate%22%C2%A0%3A%202016%2C%20%0A%09%09%09%09%09%22endDate%22%20%3A%20undefined%2C%0A%09%09%09%09%09%22skills%22%3A%20%5B%20%0A%09%09%09%09%09%20%20%C2%A0%C2%A0%7B%22expertise%22%3A%201%2C%20%22skill%22%3A%20%22Scala%22%7D%2C%20%0A%09%09%09%09%09%09%7B%22expertise%22%3A%203%2C%20%22skill%22%3A%20%22Javascript%22%7D%2C%20%0A%09%09%09%09%09%09%7B%22expertise%22%3A%20undefined%2C%20%22skill%22%3A%20%22AngularJS%22%7D%2C%20%0A%09%09%09%09%09%5D%20%0A%09%09%09%09%7D%0A%09%09%09%0A%09%09%09%5D%0A%09%09%7D%0A%09%7D%0A%5D%7D” message=”” highlight=”” provider=”manual”/]
You can see how the similarities between XML and the JSON models start to fall apart due to convention as the model gets more complicated. In XML you might have a variable data model (the types of data can change according to what data is available) but in JSON you will rarely run into a key that has a value that is occasionally a list, often a string, and sometimes just doesn’t exist, all on the same API call.
Parsing XML into JSON
For this first part we used scala.xml.NodeSeq to extrapolate the information we wanted and place it into objects accordingly.
- You can pull out nodes by using the
\
followed by the node name
- You can pull out attributes by using
\@
followed by the attribute name
[pastacode lang=”java” manual=”%20def%20toJSONObject(xmlObject%3A%20NodeSeq)%20%3A%20List%5BPerson%5D%20%3D%20%7B%0A%20%20%20%20%20val%20listOfPeople%20%3D%20new%20ListBuffer%5BPerson%5D%0A%20%20%20%20%20val%20people%20%3D%20xmlObject%20%5C%20%22People%22%0A%20%20%20%20%20%0A%20%20%20%20%20people.map%20%7B%20person%20%3D%3E%0A%20%20%20%20%20%20%20val%20personId%20%3D%20person%20%5C%40%20%22id%22%0A%20%20%20%20%20%20%20val%20personalDetails%20%3D%20person%20%5C%20%22PersonalDetails%22%0A%09%C2%A0%20%C2%A0val%20personName%20%3D%20personalDetails%20%5C%20%22Name%22%0A%09%C2%A0%20%C2%A0val%20personAge%20%3D%20personDetails%20%5C%20%22Age%22%0A%09%C2%A0%C2%A0%20%0A%09%C2%A0%20%C2%A0val%20workDetails%20%3D%20person%20%5C%20%22WorkDetails%22%0A%09%C2%A0%20%C2%A0val%20jobs%20%3D%20workDetails%20%5C%20%22Jobs%22%0A%09%C2%A0%20%C2%A0%0A%09%C2%A0%20%C2%A0val%20jobList%20%3D%20new%C2%A0ListBuffer%5BJob%5D%0A%09%C2%A0%20%C2%A0jobs.map%20%7B%C2%A0job%20%3D%3E%0A%09%C2%A0%20%C2%A0%09%09val%20title%20%3D%20job%20%5C%20%22Title%22%0A%09%09%20%20%20%C2%A0val%20company%20%3D%20job%20%5C%20%22Company%22%0A%09%20%20%20%C2%A0%C2%A0%20%C2%A0%20jobList%20%2B%3D%20Job(jobTitle%20%3D%20title%2C%20companyName%20%3D%20company)%0A%09%20%20%20%7D%0A%0A%20%20%C2%A0%20%C2%A0%C2%A0%C2%A0listOfPeople%20%2B%3D%C2%A0Person(id%20%C2%A0%3D%20personId.text%2C%20name%20%3D%C2%A0personName.text%2C%20age%C2%A0%3D%20personAge.text%2C%20jobs%20%3D%20jobList.toList%C2%A0)%0A%20%20%20%20%20%20%20%0A%20%20%20%20%20%7D%0A%20%20%20%20%20listOfPeople.toList%0A%20%C2%A0%7D” message=”” highlight=”” provider=”manual”/]
After we figured out how to do it, this became simple and even enjoyable. scala.xml.NodeSeq allows us to walk down the XML tree structure, grab the exact text and attributes that we want, then reformulate them in JSON objects that we’ve defined. If the node is blank or doesn’t exist, it returns an empty string instead of a parsing error. You just have to make sure that in your pre-defined JSON objects that every field is an Option[]
.
Voila, problem of parsing the weird ambiguous XML structure has been solved.
Parsing JSON as XML
This is where it gets weird. Unfortunately, it seems like it’s a lot harder to make elegant code that reliably parses your complex JSON objects back into XML.
The Scala Elem type that’s found in scala.xml._
allows you to create XML structures and mix in values in an incredibly simple way:
[pastacode lang=”java” manual=”val%20person%20%3D%20Person(id%C2%A0%3D%201%2C%20name%20%3D%20%22Jones%22%2C%20age%20%3D%2023%2C%20occupation%20%3D%20%22Consultant%22)%0A%0Aval%20xmlObject%20%3D%20%0A%09%3Croot%3E%0A%09%09%3Cpeople%3E%0A%09%09%09%3Cperson%3E%0A%09%09%09%09%3Cid%3Eperson.id%3C%2Fid%3E%0A%09%09%09%09%3Cpersonaldetails%3E%0A%09%09%09%09%09%3Cname%3Eperson.name%3C%2Fname%3E%0A%09%09%09%09%09%3Cage%3Eperson.age%3C%2Fage%3E%0A%09%09%09%09%3C%2Fpersonaldetails%3E%0A%09%09%09%09%3Cworkdetails%3E%0A%09%09%09%09%09%3Ctitle%3Eperson.occupation%3C%2Ftitle%3E%0A%09%09%09%09%3C%2Fworkdetails%3E%0A%09%09%09%3C%2Fperson%3E%0A%09%09%3C%2Fpeople%3E%0A%09%3C%2Froot%3E%0A%09%09%09%09″ message=”” highlight=”” provider=”manual”/]
If you’re dealing with elements that have lists or fields that may or may not exist, then Elem isn’t going to cut it. You want something that can parse your JSON into XML with attributes and a minimum amount of typing on your part. A lot of the libraries will cleanly parse JSON objects into XML even if they have complex organizations, but it was a struggle to find a library that would also dynamically parse attributes.
Staxon
This is where the Staxon library comes in (you can find GitHub documentation here). They have examples on their wiki for converting XML to JSON and JSON to XML so I won’t steal their thunder by copy and pasting their exact code here – but I will show you what we did.
Staxon solves the attribute issue by changing the way you name the keys in your JSON objects. @
Symbols denote a key that is an attribute for the containing key (so in the example below, if you had a list of jobs the XML would look like <job order=2><title>Con... etc. etc.</job>
[pastacode lang=”javascript” manual=”%7B%0A%09%22person%22%3A%20%7B%0A%09%09%22%40id%22%20%3A%201%2C%0A%09%09%22name%22%20%3A%20%22Jones%22%2C%0A%09%09%22job%22%20%3A%20%0A%09%09%5B%7B%0A%09%09%09%22%40order%22%20%3A%202%2C%0A%09%09%09%22title%22%20%3A%20%22Consultant%22%2C%0A%09%09%09%22company%22%20%3A%20%22Credera%22%C2%A0%7D%2C%0A%09%09%C2%A0%7B%0A%09%09%09%20%22%40order%22%20%3A%201%2C%20%0A%09%09%09%20%22title%22%20%3A%20%22UX%20Consultant%22%2C%20%0A%09%09%09%20%22company%22%20%3A%20%22New%20Economic%20School%20of%20Moscow%22%C2%A0%7D%0A%09%09%C2%A0%7D%0A%09%09%0A%09%09%5D%0A%09%7D%0A%7D” message=”” highlight=”” provider=”manual”/]
Unfortunately, Scala being the finicky beast that it is, you can’t use an @ symbol as the beginning of a key name in a JSON object. If you use single quotes (`) to escape the @ symbol your IDE will likely not give you any errors, but it will probably throw a runtime error. Our way around this was to add underscores ( _ ) in the model where we wanted the @ symbol to be, and then when we stringified the object we simply did a replace.all('_', '@')
to get the desired format.
We also modified the Input
and Output
streams (originally Java inputStream and outputStream) from the original Staxon documentation into ByteArrayInputStream/ByteArrayOutputStream
so we could pass in and parse out Strings instead of just printing to a file or the command line.
Disclaimer 2.0: To re-emphasize before I get 50 code reviews, this snippet is not code complete –
we declare implicit values of objects, translators, and jsonformatters with Spray in other files in our code.
The base of this function is usually intended to return the result of an API call, not to just transform an object (That’s where the Future()) comes in at the end.
The functionality of this snippet is spread out over at least 4-5 files and multiple functions
– I ordered it this way for simplicity in reading, not for efficiency.
[pastacode lang=”java” manual=”import%20javax.xml.stream.XMLEventReader%3B%0Aimport%20javax.xml.stream.XMLEventWriter%3B%0Aimport%20javax.xml.stream.XMLOutputFactory%3B%0Aimport%20javax.xml.stream.XMLStreamException%3B%0A%0Aimport%20de.odysseus.staxon.json.JsonXMLConfig%3B%0Aimport%20de.odysseus.staxon.json.JsonXMLConfigBuilder%3B%0Aimport%20de.odysseus.staxon.json.JsonXMLInputFactory%3B%0Aimport%20de.odysseus.staxon.xml.util.PrettyXMLEventWriter%3B%0A%0Aimport%20java.io.ByteArrayInputStream%0Aimport%20java.io.ByteArrayOutputStream%0A%2F%2FThis%20may%20not%20be%20the%20complete%20list%20you%20need%20%5E%20so%20don’t%20hate%20me%20if%20you%20still%20have%20to%20import%20some%20other%20libraries%0A%0Adef%20editPerson(person%3A%20Person)%3A%20Future%5BUnit%5D%20%3D%20%7B%0A%20%20%20%0A%09%2F%2FThis%20toPerson()%C2%A0is%20a%20different%20transformational%20function%C2%A0where%20we%20add%20in%20default%20attribute%20values%C2%A0like%20the%20namespace%2C%20not%20included%20in%20this%20snippet%20%0A%20%20%20%20val%20formattedPerson%20%3D%20toPerson(person)%0A%09%0A%09%2F*%C2%A0EDIT%3A%20Because%20this%20has%20been%20mentioned%20in%20the%20comments%0A%C2%A0%09*%C2%A0We%20put%20this%20object%20into%20a%20JSON%20format%C2%A0-%20This%20JSON%C2%A0-%3E%20String%20-%3E%20XML%20can%20(and%20should)%20be%20put%20into%20a%20separate%20modularized%C2%A0function%2C%C2%A0%0A%09*%C2%A0I’m%20laying%20it%20out%20this%20way%C2%A0so%20you%20can%20see%20the%20linear%20process%C2%A0and%20not%20have%20to%20jump%20between%20functions%0A%09*%C2%A0We%20use%20implicit%20values%20in%20order%20to%20get%20the%20%22toJson%22%20to%20work%20(case%20class%20Person()%20)%20etc.%C2%A0%0A%09*%C2%A0%0A%09*%2F%0A%20%20%20%20val%20jsonPerson%20%3D%20formattedPerson.toJson%0A%C2%A0%20%C2%A0%C2%A0%0A%C2%A0%20%C2%A0%20%2F%2FWe%20stringify%20the%20JSON%20format%20and%20replace%20all%20the%20_%20with%20%40%20signs%20to%20indicate%20an%20attribute%0A%20%20%20%20val%20stringPerson%20%3D%20jsonPerson.toString().replaceAll(%22_%22%2C%20%22%40%22)%0A%09%09%0A%09%2F%2FThe%C2%A0input%20is%20established%20as%20a%20ByteArrayInputStream%20(this%20is%20so%20it%20works%20with%20the%20Staxon%20methods)%0A%20%20%20%20val%20input%20%3D%20%20new%20java.io.ByteArrayInputStream(stringPerson.getBytes)%0A%09%0A%09%2F%2FWe%20send%20it%20to%20a%20translator%C2%A0that%20parses%20it%20to%20XML%0A%20%20%20%20val%20requestBody%20%3D%20toXml(input)%0A%09%09%0A%09%2F%2FWe%20have%20to%20add%20the%20content%20type%20to%20the%C2%A0httpEntity%20before%20sending%20it%20off%0A%20%20%20%20val%20httpEntity%20%3D%20HttpEntity.apply(MediaTypes.%60application%2Fxml%60%2C%20requestBody)%0A%09%09%0A%09%2F%2FAttach%20it%20to%20the%20request%C2%A0and%20get%20the%20result%09%0A%20%20%20%20val%20request%20%3D%20Put(%22%2Fapi%2Fcall%3Faction%3Dedit%22%2C%20httpEntity).withHeaders()%0A%20%20%20%20val%20result%20%3D%20pipeline(request)%0A%20%20%20%20%0A%20%20%20Future()%0A%20%20%7D%0A%0A%0A%2F%2FThis%20is%20a%20modified%20version%20of%20what%20is%20on%20the%20Staxon%20GitHub%20to%20allow%20for%20Stringification%0Adef%20toXml(json%3A%20ByteArrayInputStream)%3A%20String%20%3D%20%7B%0A%20%20%20%20%0A%20%20%20%20val%20config%20%3D%20new%20JsonXMLConfigBuilder().multiplePI(false).build()%3B%0A%20%20%20%20val%20output%20%3D%20new%20ByteArrayOutputStream()%3B%0A%09%09try%20%7B%0A%09%09%09%2F*%0A%09%09%09%20*%20Create%20reader%20(JSON).%0A%09%09%09%20*%2F%0A%09%09%09val%20reader%20%3D%20new%20JsonXMLInputFactory(config).createXMLEventReader(json)%3B%0A%09%09%09%0A%09%09%09%2F*%0A%09%09%09%20*%20Create%20writer%20(XML).%0A%09%09%09%20*%2F%0A%09%09%09val%20writer%20%3D%20XMLOutputFactory.newInstance().createXMLEventWriter(output)%3B%0A%09%09%09val%20prettyWriter%20%3D%20new%20PrettyXMLEventWriter(writer)%3B%20%2F%2F%20format%20output%0A%09%09%09%0A%09%09%09%2F*%0A%09%09%09%20*%20Copy%20events%20from%20reader%20to%20writer.%0A%09%09%09%20*%2F%0A%09%09%09prettyWriter.add(reader)%3B%0A%09%09%09%0A%09%09%09%2F*%0A%09%09%09%20*%20Close%20reader%2Fwriter.%0A%09%09%09%20*%2F%0A%09%09%09reader.close()%3B%0A%09%09%09writer.close()%3B%0A%09%09%09val%20finalOutput%20%3D%20output.toString()%0A%09%09%09finalOutput%0A%09%09%7D%20finally%20%7B%0A%09%09%09%2F*%0A%09%09%09%20*%20As%20per%20StAX%20specification%2C%20XMLEventReader%2FWriter.close()%20doesn’t%20close%0A%09%09%09%20*%20the%20underlying%20stream.%0A%09%09%09%20*%2F%0A%09%09%09%0A%09%09%09json.close()%3B%0A%09%09%09output.close()%3B%09%0A%09%09%7D%0A%20%20%7D” message=”Copying JSON to XML via StAX Event API (Modified)” highlight=”” provider=”manual”/]
There you go. It’s not the prettiest way to parse something with all of the transformations, but trust me when I say it is super effective
Helpful References
Also, a large amount of credit goes to the technical lead on my project who did a lot of research and was the one who eventually found the Staxon library. When I say “we” in this article, the research that went into finding and implementing this solution was truly a team effort.