We live in a world of complex systems that are continually generating vast amounts of data. The catch is that not all of this data is created the same. There are hundreds upon hundreds of different file formats, ranging from typical office tools like documents and spreadsheets to specialized formats that support scientific analysis and 3D graphic rendering. Each one of these formats collects different types of data for different reasons.
So why is that a problem? The challenge is that enterprise organizations are continually adopting new technology platforms in order to maintain their competitive edge. Simply converting data from one system to another can require a considerable investment in time and money. This is especially challenging in industries like aerospace, where regulations require a high degree of data precision.
Data Conversion in a Highly Regulated Industry
Accuracy is the number one requirement when undergoing data conversion in an industry like aerospace. This is a challenge considering that the process typically requires zero data loss, output to multiple formats, and well-structured output. On top of this, a successful data conversion must minimize manual review. If not, the process can require an army of subject matter experts and expect lengthy delays.
Many industry experts point to using Extensible Markup Language (XML) as an intermediary format in order to establish successful data conversion processes. This is because converting data into XML provides a foundation for accurate formatting and reporting. Further, XML-based data, tools, and processes provide an excellent foundation for automated QA and data production to multiple output formats.
What is XML?
XML is used in technical information communities to author and maintain structured data. The data is hierarchical and each data component is described using XML elements and attributes. Under the covers, XML is plain text, though you can emphasize content with bold/italic/underline characteristics if you want to. XML can be used on a file system or stored in a Content Management System (CMS).
Besides technical content, other uses for XML include communicating with web services, distributing news feeds, reporting stock market prices and financial trends, describing graphic formats such as SVG, and many, many more.
As a markup language, XML is not an end-point – it is a dynamic framework that allows additional types of content to be developed from it. This means XML stylesheets can be used to produce multi-channel output (HTML, PDF, and more) from a single source, or you can even render XML directly in a web browser.
Even if data is being converted between non-XML formats, XML can be used to “neutralize” the data and facilitate a precise transformation. For example, you may have MS Word documents with very specific and complex formats that need to be converted into text that another system can read plainly and simply.
Or you may have GML content that needs to be converted to specific SGML document types (GML was the predecessor of SGML and SGML was the predecessor of XML). This may sound like an easy jump on the surface, but in reality GML and SGML markup are worlds apart.
So Why Use XML for Data Conversions?
Quite simply, XML has many brilliant and powerful tools to help get the job done, including:
- Product-specific import and export tools – The Import feature of PTC’s Arbortext Publishing Engine imports files from Microsoft Word, FrameMaker, RTF, HTML, PDF and text into XML.
- Extensible Stylesheet Language Transformations – XSLT is a language for transforming an XML document into another type of XML document or to HTML or text documents. Like XML, XSLT is a W3C recommendation.
- Application Programming Interfaces – APIs provide the power of programming languages around XML-centric tools and data as well as standard file input and output utilities which are part of every programming language.
The Proof is in the Pudding
We don’t try to hide the fact that we have a lot of experience working with XML here at Terra. That being the case, it’s easy to offer two real-world cases of data conversion, where even though the input and output formats are very different, they both rely on XML as the core interim format.
Example 1: In the case of converting from MS Word, we leverage XML in the interim to be able to neutralize the data and then write it to the desired output format. Because XML offers structured, characterized data, we can glean all properties needed from the input and store them in the XML, then leverage the structure and characteristics to write the desired output.
Example 2: In the case of GML to SGML conversion, we took the data beyond SGML to XML to be able to leverage XML to manipulate and neutralize the data and then write the XML files to SGML. This approach proved to be very reliable, accurate, and economical (it also saved our customer a ton of money).
Like you, we want data conversion to be as accurate as possible. And like you, we want the data conversion to be as less labor-intensive as possible without sacrificing data integrity. To achieve this, our approach employs three strategies:
- Proofing the concept manually
- Batch conversions
- Writing logs with status information
As I mentioned earlier, visual checks and QA can be time-consuming. Writing logs with error, warning, and information messages enables reviewers to quickly scan and identify issues. Logs also enable identifying exact locations where input/output are ambiguous and ultimately where errors are prone. This is a more pro-active approach where known conditions can be identified during the process as opposed to post-process visual checks.