Crystal Lang: Mapping JSON files in under a second

Crystal Language Logo
Logo of the Crystal Language

Crystal’s performance is useful in IO intense tasks such as mapping JSON files into an object. Recently I worked on a project with my friend David Colombo. David needed to take JSON files, map the data onto an object, and insert it into a SQLite Database. This post describes mapping JSON files in Crystal.

Background

Previously, David had created a parser in NodeJS that reads through the JSON files containing structured data and inserts the data into an SQLite database. These JSON Files contained 1,000s of keys, and are 100s of megabytes each. Unfortunately due to limitations of the language and some anti-practices that JavaScript allows, the script took 14 hours to complete it’s task. Mapping JSON files this way is too slow. David attempted to write a parser in Python. The parser was limited to 9999 keys and couldn’t meet the project’s requirements. He sent me a message asking if I was still working with Ruby and asked if it’d be any faster. I said that while I’m able to write a script in Ruby just fine, I had been experimenting with a language called Crystal (a language with Ruby-like syntax and C-like performance) and asked if he’d be open to trying it instead of Ruby. I also wasn’t able to predict the performance of Ruby ahead of time and was not prepared to provide an answer on whether Ruby would be faster. Since I’ve been learning Crystal I decided to give the rewrite in it a shot.

A few requirements

We want the code to be long lived and require minimal changes to the project’s dependencies. We also wanted to avoid using third party libraries, only using Crystal’s standard library was a goal. The standard library does not support databases. We decided to use the official shard for SQLite. It’s maintained by the Crystal Core Team so it will probably be well maintained.

An object to represent the JSON Files

To maximize performance I decided to use a nested struct to contain the data using Crystal’s JSON.mapping() (https://crystal-lang.org/api/0.28.0/JSON.html#mapping). The data won’t be changed at runtime. Data is only copied into SQLite. Stack memory is cheaper than heap memory so the drawbacks of structs were worth it. In exchange we gained performance benefits while mapping JSON files.

module MyProgram
  extend self
  struct CVE_Data_Entity
    struct CVE_Items
      struct CVE
        struct DataMeta
          JSON.mapping(
            "id": {key: "ID", type: String, nilable: true},
            "assigner": {key: "ASSIGNER", type: String, nilable: true}
          )
        end
        struct Affects
          struct Vendor
            struct VendorData
              struct Product
                struct Data
                  struct Version
                    struct Data
                      JSON.mapping(
                        "version_value": {type: String, nilable: true},
                        "version_affected": {type: String, nilable: true},
                      )
                    end
                    JSON.mapping(
                      "version_data": {type: Array(Data), nilable: true},
                    )
                  end
                  JSON.mapping(
                    "product_name": {type: String, nilable: true},
                    "version": {type: Version, nilable: true}
                  )
                end
                JSON.mapping(
                  "product_data": {type: Array(Data), nilable: true},
                )
              end
              JSON.mapping(
                "vendor_name": {type: String, nilable: true},
                "product": {type: Product, nilable: true},
              )
            end
            JSON.mapping(
              "vendor_data": {type: Array(VendorData), nilable: true},
            )
          end
          JSON.mapping(
            "vendor": {type: Vendor, nilable: true},
          )
        end
        struct Problemtype
          struct Data
            struct Description
              JSON.mapping(
                "lang": {type: String, nilable: true},
                "value": {type: String, nilable: true}
              )
            end
            JSON.mapping(
              "description": {type: Array(Description), nilable: true}
            )
          end
          JSON.mapping(
            "problemtype_data": {type: Array(Data), nilable: true},
          )
        end
        struct References
          struct Data
            JSON.mapping(
              "url": {type: String, nilable: true},
              "name": {type: String, nilable: true},
              "refsource": {type: String, nilable: true},
              "tags": {type: Array(String), nilable: true},
            )
          end
          JSON.mapping(
            "reference_data": {type: Array(Data), nilable: true},
          )
        end
        struct Description
          struct Data
            JSON.mapping(
              "lang": {type: String, nilable: true},
              "value": {type: String, nilable: true},
            )
          end
          JSON.mapping(
            "description_data": {type: Array(Data), nilable: true}
          )
        end
        JSON.mapping(
          "data_type": {type: String, nilable: true},
          "data_format": {type: String, nilable: true},
          "data_version": {type: String, nilable: true},
          "cve_data_meta": {key: "CVE_data_meta", type: DataMeta, nilable: true},
          "affects": {type: Affects, nilable: true},
          "problemtype": {type: Problemtype, nilable: true},
          "references": {type: References, nilable: true},
          "description": {type: Description, nilable: true},
        )
      end
      struct Configurations
        struct Nodes
          struct CPE
            JSON.mapping(
              "vulnerable": {type: Bool, nilable: true},
              "cpe23Uri": {type: String, nilable: true},
            )
          end
          JSON.mapping(
            "operator": {type: String, nilable: true},
            "cpe_match": {type: Array(CPE), nilable: true},
          )
        end
        JSON.mapping(
          "cve_data_version": {key: "CVE_data_version", type: String, nilable: true},
          "nodes": {type: Array(Nodes), nilable: true},
        )
      end
      struct Impact
        struct BaseMetricV3
          struct CvssV3
            JSON.mapping(
              "version": {type: String, nilable: true},
              "vectorString": {type: String, nilable: true},
              "attackVector": {type: String, nilable: true},
              "attackComplexity": {type: String, nilable: true},
              "privilegesRequired": {type: String, nilable: true},
              "userInteraction": {type: String, nilable: true},
              "scope": {type: String, nilable: true},
              "confidentialityImpact": {type: String, nilable: true},
              "integrityImpact": {type: String, nilable: true},
              "availabilityImpact": {type: String, nilable: true},
              "baseScore": {type: Float64, nilable: true},
              "baseSeverity": {type: String, nilable: true},
            )
          end
          JSON.mapping(
            "cvssV3": {type: CvssV3, nilable: true},
            "exploitabilityScore": {type: Float64, nilable: true},
            "impactScore": {type: Float64, nilable: true},
          )
        end
        struct BaseMetricV2
          struct CvssV2
            JSON.mapping(
              "version": {type: String, nilable: true},
              "vectorString": {type: String, nilable: true},
              "accessVector": {type: String, nilable: true},
              "accessComplexity": {type: String, nilable: true},
              "authentication": {type: String, nilable: true},
              "confidentialityImpact": {type: String, nilable: true},
              "integrityImpact": {type: String, nilable: true},
              "availabilityImpact": {type: String, nilable: true},
              "baseScore": {type: Float64, nilable: true},
            )
          end
          JSON.mapping(
            "cvssV2": {type: CvssV2, nilable: true},
            "severity": {type: String, nilable: true},
            "exploitabilityScore": {type: Float64, nilable: true},
            "impactScore": {type: Float64, nilable: true},
            "acInsufInfo": {type: Bool, nilable: true},
            "obtainAllPrivilege": {type: Bool, nilable: true},
            "obtainUserPrivilege": {type: Bool, nilable: true},
            "obtainOtherPrivilege": {type: Bool, nilable: true},
            "userInteractionRequired": {type: Bool, nilable: true},
          )
        end
        JSON.mapping(
          "baseMetricV2": {type: BaseMetricV2, nilable: true},
          "baseMetricV3": {type: BaseMetricV3, nilable: true},
        )
      end
      JSON.mapping(
        "cve": {type: CVE, nilable: true},
        "configurations": {type: Configurations, nilable: true},
        "impact": {type: Impact, nilable: true},
        "publishedDate": {type: String, nilable: true},
        "lastModifiedDate": {type: String, nilable: true},
      )
    end
    JSON.mapping(
      "cve_data_type": {key: "CVE_data_type", type: String, nilable: true},
      "cve_data_format": {key: "CVE_data_format", type: String, nilable: true},
      "cve_data_version": {key: "CVE_data_version", type: String, nilable: true},
      "cve_data_numberofcves": {key: "CVE_data_numberOfCVEs", type: String, nilable: true},
      "cve_data_timestamp": {key: "CVE_data_timestamp", type: String, nilable: true},
      "cve_items": {key: "CVE_Items", type: Array(CVE_Items), nilable: true},
    )
  end
end

Improving the JSON parsing time

In Crystal’s development mode, mapping JSON files (2018.json containing around 200MB of data) into an object took 30 seconds. In release mode it took 5 seconds. This performance was pretty good already but I’d like it to be faster as the datasets would get larger over time. The first thing I tried was changing the class to a struct (which is being used now). That had minimal impact on performance. Next I changed from using the File.open() method to the File.read() method which improved the file read speed and brought the parse time down to under a second. From this we learned that it’s much faster to open a file in read mode, than in read and write mode. When writing code we know to only ask for read permissions except when we also need to write to it. There are probably more file optimizations we could try, although that’s a topic of it’s own.

Inserting the data into SQLite

Mapping JSON files onto an object was only half the challenge. I also needed an efficient way to insert data into a SQLite Database. In my first attempt, I tried iterating over the various arrays. This involved several individual queries and was slow. On my Macbook it took around five minutes. To make matters worse it never finished on a Linux Laptop. After asking for help, I learned that I could group these queries into one bulk transaction. I came up with the following code that ran in under a second.

require "json"
require "sqlite3"
require "./cve_data_entity.cr"
module MyProgram
  VERSION = "0.1.0"
  filepath = "./src/example-json-files/example-full-2019-dataset.json"
  myobject = CVE_Data_Entity.from_json(File.read(filepath))
  DB.open "sqlite3://./src/example-json-files/dbname.sqlt" do |db|
    db.transaction do |tx|
      tx.begin_transaction
      myobject.try(&.cve_items).try(&.each do |item|
        # Insert General Information into the Database
        cve_item_id = item.try(&.cve).try(&.cve_data_meta).try(&.id) || "Not available"
        data_type = item.try(&.cve).try(&.data_type) || "Not available"
        data_format = item.try(&.cve).try(&.data_format) || "Not available"
        data_version = item.try(&.cve).try(&.data_version) || "Not available"
        published_date = item.try(&.publishedDate) || "Not available"
        last_modified_date = item.try(&.lastModifiedDate) || "Not available"
        tx.connection.exec("INSERT INTO GENERAL_INFORMATION (\"ID\", \"DATA_TYPE\", \"DATA_FORMAT\", \"DATA_VERSION\", \"PUBLISHDATE\", \"LASTMODIFIEDDATE\") VALUES (?, ?, ?, ?, ?, ?)", [cve_item_id, data_type, data_format, data_version, published_date, last_modified_date])
      end)
      tx.commit
    end
  end
end

Admittedly the .try() method calls can be a bit messy and we’re looking into cleaner ways to write this type of code. One recommendation was to port the JSON .dig() method to my struct, in the future I might attempt that. Read about capturing blocks and procs in the Crystal Documentation if you’re having trouble reading the above code. This method is not ideal when working with larger amounts of code. Although other than the readability issues, there was not a noticeable performance impact.

Conclusion

By writing a parser and database insertion logic in Crystal we were forced to use better coding practices. We learned about SQLite transactions, and saved about 14 hours on our database’s build time. The database now builds in under a second. If you have a similar challenge in your organization, consider trying to solve it using Crystal. Consider sharing your story in a comment below.

%d bloggers like this: