Crystal Language Logo

Logo of the Crystal Language

Crystal's performance is useful in IO intense tasks such as mapping JSON files into an object. Recently I worked on a project with my friend David Colombo. David needed to take JSON files, map the data onto an object, and insert it into a SQLite Database. This post describes mapping JSON files in Crystal.

Background

Previously, David had created a parser in NodeJS that reads through the JSON files containing structured data and inserts the data into an SQLite database. These JSON Files contained 1,000s of keys, and are 100s of megabytes each. Unfortunately due to limitations of the language and some anti-practices that JavaScript allows, the script took 14 hours to complete it’s task. Mapping JSON files this way is too slow. David attempted to write a parser in Python. The parser was limited to 9999 keys and couldn't meet the project's requirements. He sent me a message asking if I was still working with Ruby and asked if it'd be any faster. I said that while I'm able to write a script in Ruby just fine, I had been experimenting with a language called Crystal (a language with Ruby-like syntax and C-like performance) and asked if he'd be open to trying it instead of Ruby. I also wasn't able to predict the performance of Ruby ahead of time and was not prepared to provide an answer on whether Ruby would be faster. Since I’ve been learning Crystal I decided to give the rewrite in it a shot.

A few requirements

We want the code to be long lived and require minimal changes to the project's dependencies. We also wanted to avoid using third party libraries, only using Crystal’s standard library was a goal. The standard library does not support databases. We decided to use the official shard for SQLite. It’s maintained by the Crystal Core Team so it will probably be well maintained.

An object to represent the JSON Files

To maximize performance I decided to use a nested struct to contain the data using Crystal's JSON.mapping() (https://crystal-lang.org/api/0.28.0/JSON.html#mapping). The data won't be changed at runtime. Data is only copied into SQLite. Stack memory is cheaper than heap memory so the drawbacks of structs were worth it. In exchange we gained performance benefits while mapping JSON files.

module MyProgram
extend self
struct CVE_Data_Entity
struct CVE_Items
struct CVE
struct DataMeta
JSON.mapping(
"id": {key: "ID", type: String, nilable: true},
"assigner": {key: "ASSIGNER", type: String, nilable: true}
)
end
struct Affects
struct Vendor
struct VendorData
struct Product
struct Data
struct Version
struct Data
JSON.mapping(
"version_value": {type: String, nilable: true},
"version_affected": {type: String, nilable: true},
)
end
JSON.mapping(
"version_data": {type: Array(Data), nilable: true},
)
end
JSON.mapping(
"product_name": {type: String, nilable: true},
"version": {type: Version, nilable: true}
)
end
JSON.mapping(
"product_data": {type: Array(Data), nilable: true},
)
end
JSON.mapping(
"vendor_name": {type: String, nilable: true},
"product": {type: Product, nilable: true},
)
end
JSON.mapping(
"vendor_data": {type: Array(VendorData), nilable: true},
)
end
JSON.mapping(
"vendor": {type: Vendor, nilable: true},
)
end
struct Problemtype
struct Data
struct Description
JSON.mapping(
"lang": {type: String, nilable: true},
"value": {type: String, nilable: true}
)
end
JSON.mapping(
"description": {type: Array(Description), nilable: true}
)
end
JSON.mapping(
"problemtype_data": {type: Array(Data), nilable: true},
)
end
struct References
struct Data
JSON.mapping(
"url": {type: String, nilable: true},
"name": {type: String, nilable: true},
"refsource": {type: String, nilable: true},
"tags": {type: Array(String), nilable: true},
)
end
JSON.mapping(
"reference_data": {type: Array(Data), nilable: true},
)
end
struct Description
struct Data
JSON.mapping(
"lang": {type: String, nilable: true},
"value": {type: String, nilable: true},
)
end
JSON.mapping(
"description_data": {type: Array(Data), nilable: true}
)
end
JSON.mapping(
"data_type": {type: String, nilable: true},
"data_format": {type: String, nilable: true},
"data_version": {type: String, nilable: true},
"cve_data_meta": {key: "CVE_data_meta", type: DataMeta, nilable: true},
"affects": {type: Affects, nilable: true},
"problemtype": {type: Problemtype, nilable: true},
"references": {type: References, nilable: true},
"description": {type: Description, nilable: true},
)
end
struct Configurations
struct Nodes
struct CPE
JSON.mapping(
"vulnerable": {type: Bool, nilable: true},
"cpe23Uri": {type: String, nilable: true},
)
end
JSON.mapping(
"operator": {type: String, nilable: true},
"cpe_match": {type: Array(CPE), nilable: true},
)
end
JSON.mapping(
"cve_data_version": {key: "CVE_data_version", type: String, nilable: true},
"nodes": {type: Array(Nodes), nilable: true},
)
end
struct Impact
struct BaseMetricV3
struct CvssV3
JSON.mapping(
"version": {type: String, nilable: true},
"vectorString": {type: String, nilable: true},
"attackVector": {type: String, nilable: true},
"attackComplexity": {type: String, nilable: true},
"privilegesRequired": {type: String, nilable: true},
"userInteraction": {type: String, nilable: true},
"scope": {type: String, nilable: true},
"confidentialityImpact": {type: String, nilable: true},
"integrityImpact": {type: String, nilable: true},
"availabilityImpact": {type: String, nilable: true},
"baseScore": {type: Float64, nilable: true},
"baseSeverity": {type: String, nilable: true},
)
end
JSON.mapping(
"cvssV3": {type: CvssV3, nilable: true},
"exploitabilityScore": {type: Float64, nilable: true},
"impactScore": {type: Float64, nilable: true},
)
end
struct BaseMetricV2
struct CvssV2
JSON.mapping(
"version": {type: String, nilable: true},
"vectorString": {type: String, nilable: true},
"accessVector": {type: String, nilable: true},
"accessComplexity": {type: String, nilable: true},
"authentication": {type: String, nilable: true},
"confidentialityImpact": {type: String, nilable: true},
"integrityImpact": {type: String, nilable: true},
"availabilityImpact": {type: String, nilable: true},
"baseScore": {type: Float64, nilable: true},
)
end
JSON.mapping(
"cvssV2": {type: CvssV2, nilable: true},
"severity": {type: String, nilable: true},
"exploitabilityScore": {type: Float64, nilable: true},
"impactScore": {type: Float64, nilable: true},
"acInsufInfo": {type: Bool, nilable: true},
"obtainAllPrivilege": {type: Bool, nilable: true},
"obtainUserPrivilege": {type: Bool, nilable: true},
"obtainOtherPrivilege": {type: Bool, nilable: true},
"userInteractionRequired": {type: Bool, nilable: true},
)
end
JSON.mapping(
"baseMetricV2": {type: BaseMetricV2, nilable: true},
"baseMetricV3": {type: BaseMetricV3, nilable: true},
)
end
JSON.mapping(
"cve": {type: CVE, nilable: true},
"configurations": {type: Configurations, nilable: true},
"impact": {type: Impact, nilable: true},
"publishedDate": {type: String, nilable: true},
"lastModifiedDate": {type: String, nilable: true},
)
end
JSON.mapping(
"cve_data_type": {key: "CVE_data_type", type: String, nilable: true},
"cve_data_format": {key: "CVE_data_format", type: String, nilable: true},
"cve_data_version": {key: "CVE_data_version", type: String, nilable: true},
"cve_data_numberofcves": {key: "CVE_data_numberOfCVEs", type: String, nilable: true},
"cve_data_timestamp": {key: "CVE_data_timestamp", type: String, nilable: true},
"cve_items": {key: "CVE_Items", type: Array(CVE_Items), nilable: true},
)
end
end

Improving the JSON parsing time

In Crystal's development mode, mapping JSON files (2018.json containing around 200MB of data) into an object took 30 seconds. In release mode it took 5 seconds. This performance was pretty good already but I'd like it to be faster as the datasets would get larger over time. The first thing I tried was changing the class to a struct (which is being used now). That had minimal impact on performance. Next I changed from using the File.open() method to the File.read() method which improved the file read speed and brought the parse time down to under a second. From this we learned that it's much faster to open a file in read mode, than in read and write mode. When writing code we know to only ask for read permissions except when we also need to write to it. There are probably more file optimizations we could try, although that's a topic of it's own.

Inserting the data into SQLite

Mapping JSON files onto an object was only half the challenge. I also needed an efficient way to insert data into a SQLite Database. In my first attempt, I tried iterating over the various arrays. This involved several individual queries and was slow. On my Macbook it took around five minutes. To make matters worse it never finished on a Linux Laptop. After asking for help, I learned that I could group these queries into one bulk transaction. I came up with the following code that ran in under a second.

require "json"
require "sqlite3"
require "./cve_data_entity.cr"
module MyProgram
VERSION = "0.1.0"
filepath = "./src/example-json-files/example-full-2019-dataset.json"
myobject = CVE_Data_Entity.from_json(File.read(filepath))
DB.open "sqlite3://./src/example-json-files/dbname.sqlt" do |db|
db.transaction do |tx|
tx.begin_transaction
myobject.try(&.cve_items).try(&.each do |item|
# Insert General Information into the Database
cve_item_id = item.try(&.cve).try(&.cve_data_meta).try(&.id) || "Not available"
data_type = item.try(&.cve).try(&.data_type) || "Not available"
data_format = item.try(&.cve).try(&.data_format) || "Not available"
data_version = item.try(&.cve).try(&.data_version) || "Not available"
published_date = item.try(&.publishedDate) || "Not available"
last_modified_date = item.try(&.lastModifiedDate) || "Not available"
tx.connection.exec("INSERT INTO GENERAL_INFORMATION (\"ID\", \"DATA_TYPE\", \"DATA_FORMAT\", \"DATA_VERSION\", \"PUBLISHDATE\", \"LASTMODIFIEDDATE\") VALUES (?, ?, ?, ?, ?, ?)", [cve_item_id, data_type, data_format, data_version, published_date, last_modified_date])
end)
tx.commit
end
end
end

Admittedly the .try() method calls can be a bit messy and we're looking into cleaner ways to write this type of code. One recommendation was to port the JSON .dig() method to my struct, in the future I might attempt that. Read about capturing blocks and procs in the Crystal Documentation if you're having trouble reading the above code. This method is not ideal when working with larger amounts of code. Although other than the readability issues, there was not a noticeable performance impact.

Conclusion

By writing a parser and database insertion logic in Crystal we were forced to use better coding practices. We learned about SQLite transactions, and saved about 14 hours on our database's build time. The database now builds in under a second. If you have a similar challenge in your organization, consider trying to solve it using Crystal. Consider sharing your story in a comment below.