By: Bruno Dirkx, Team Leader Data Science, NGDATA
When parsing a JSON file, or an XML file for that matter, you have two options. You can read the file entirely in an in-memory data structure (a tree model), which allows for easy random access to all the data. Or you can process the file in a streaming manner. In this case, either the parser can be in control by pushing out events (as is the case with XML SAX parsers) or the application can pull the events from the parser. The first has the advantage that it’s easy to chain multiple processors but it’s quite hard to implement. The second has the advantage that it’s rather easy to program and that you can stop parsing when you have what you need.
I was working on a little import tool for Lily which would read a schema description and records from a JSON file and put them into Lily.
Since I did not want to spend hours on this, I thought it was best to go for the tree model, thus reading the entire JSON file into memory. Still, it seemed like the sort of tool which might be easily abused: generate a large JSON file, then use the tool to import it into Lily. In this case, reading the file entirely into memory might be impossible.
So I started using Jackson’s pull API, but quickly changed my mind, deciding it would be too much work. But then I looked a bit closer at the API and found out that it’s very easy to combine the streaming and tree-model parsing options: you can move through the file as a whole in a streaming way, and then read individual objects into a tree structure.
As an example, let’s take the following input:
{ "records": [ {"field1": "outer", "field2": "thought"}, {"field2": "thought", "field1": "outer"} ] , "special message": "hello, world!" }
For this simple example it would be better to use plain CSV, but just imagine the fields being sparse or the records having a more complex structure.
The following snippet illustrates how this file can be read using a combination of stream and tree-model parsing. Each individual record is read in a tree structure, but the file is never read in its entirety into memory, making it possible to process JSON files gigabytes in size while using minimal memory.
import org.codehaus.jackson.map.*; import org.codehaus.jackson.*; import java.io.File; public class ParseJsonSample { public static void main(String[] args) throws Exception { JsonFactory f = new MappingJsonFactory(); JsonParser jp = f.createJsonParser(new File(args[0])); JsonToken current; current = jp.nextToken(); if (current != JsonToken.START_OBJECT) { System.out.println("Error: root should be object: quiting."); return; } while (jp.nextToken() != JsonToken.END_OBJECT) { String fieldName = jp.getCurrentName(); // move from field name to field value current = jp.nextToken(); if (fieldName.equals("records")) { if (current == JsonToken.START_ARRAY) { // For each of the records in the array while (jp.nextToken() != JsonToken.END_ARRAY) { // read the record into a tree model, // this moves the parsing position to the end of it JsonNode node = jp.readValueAsTree(); // And now we have random access to everything in the object System.out.println("field1: " + node.get("field1").getValueAsText()); System.out.println("field2: " + node.get("field2").getValueAsText()); } } else { System.out.println("Error: records should be an array: skipping."); jp.skipChildren(); } } else { System.out.println("Unprocessed property: " + fieldName); jp.skipChildren(); } } } }
As you can guess, the nextToken() call each time gives the next parsing event: start object, start field, start array, start object, …, end object, …, end array, …
The jp.readValueAsTree() call allows to read what is at the current parsing position, a JSON object or array, into Jackson’s generic JSON tree model. Once you have this, you can access the data randomly, regardless of the order in which things appear in the file (in the example field1 and field2 are not always in the same order). Jackson supports mapping onto your own Java objects too. The jp.skipChildren() is convenient: it allows to skip over a complete object tree or an array without having to run yourself over all the events contained in it.
Once again, this illustrates the great value there is in the open source libraries out there.
Parsing a Large JSON File for 2017 and Beyond
While the example above is quite popular, I wanted to update it with new methods and new libraries that have unfolded recently.
GSON
There are some excellent libraries for parsing large JSON files with minimal resources. One is the popular GSON library. It gets at the same effect of parsing the file as both stream and object. It handles each record as it passes, then discards the stream, keeping memory usage low.
Here’s a great example of using GSON in a “mixed reads” fashion (using both streaming and object model reading at the same time).
If you’re interested in using the GSON approach, there’s a great tutorial for that here.
.NET Processing of Large JSON Files
If you’re working in the .NET stack, Json.NET is a great tool for parsing large files. It’s fast, efficient, and it’s the most downloaded NuGet package out there.
JSON Processing API
Another good tool for parsing large JSON files is the JSON Processing API. For an example of how to use it, see this Stack Overflow thread. To download the API itself, click here.
Large JSON File Parsing for Python
One programmer friend who works in Python and handles large JSON files daily uses the Pandas Python Data Analysis Library. For Python and JSON, this library offers the best balance of speed and ease of use. For added functionality, pandas can be used together with the scikit-learn free Python machine learning tool.
Additional Material
Here’s some additional reading material to help zero in on the quest to process huge JSON files with minimal resources.
- Stack Overflow thread on processing large JSON files.
- Parsing JSON files for Android. (See answer by Genson author.)
- Maven and parsing JSON files. (See answer about GSON, ORG.JSON, and Jackson.)
- Stack Overflow GSON JSON large file parsing example.