How To Build a Command-Line JSON Splitter
Create a tool to split large JSON files
Getting Started
First things first, create a file named json-splitter.py. We'll need three external modules for this project...
- os: responsible for opening and reading the file
- json: responsible for the decoding and encoding of json data
- math: necessary for calculating the number of files when splitting the JSON file
Our first step will be importing these three modules and printing a welcome message.import os
import json
import mathprint('Welcome to the JSON Splitter')
print('First, enter the name of the file you want to split')
Requesting the JSON File
In order to split a JSON file we need to ask the user for one. We’re going to use a try/except block to prompt the user for the file, open the file, and check for a JSON Array. If any of these three things fail then the script cannot work and we’ll exit.try:
# request file name
file_name = input('filename: ')
f = open(file_name)
file_size = os.path.getsize(file_name)
data = json.load(f)if isinstance(data, list):
data_len = len(data)
print('Valid JSON file found')
else:
print("JSON is not an Array of Objects")
exit()except:
print('Error loading JSON file ... exiting')
exit()
Some highlights from the code above:
- The
input()function is used to prompt the user for text input - We use
os.path.getsize()to get the file size which is needed later when splitting. - The
isinstance()function is used to make sure the JSON file is an Array, presumably of Objects - If any of the code in the try section causes an error, the except block will execute
Defining the Chunk Size
Now that the JSON file has been loaded into the variable data, it’s time to find out how to split the file. We’re going to split our file based on a maximum file size for each chunk. If the chunk size is smaller than the file then we’ll prompt the user and gracefully exit the script.# get numeric input
try:
mb_per_file = abs(float(input('Enter maximum file size (MB): ')))
except:
print('Error entering maximum file size ... exiting')
exit()# check that file is larger than max size
if file_size < mb_per_file * 1000000:
print('File smaller than split size, exiting')
exit()# determine number of files necessary
num_files = math.ceil(file_size/(mb_per_file*1000000))
print('File will be split into',num_files,'equal parts')
The most important aspect of the above code is that we convert the String input to a Float (and take the absolute value for good measure). To finish up, we’ll calculate, store and print the number of files that will be created based on the maximum file size.
Converting JSON to a Tabled Structure
Okay, next is to set up the data structure for holding the split pieces and the cutoffs for each piece. We’ll use a 2D Array — also known as an Array of Arrays. Think of a 2D Array as a spreadsheet with an Y-axis (number of nested arrays) and an X-axis (size of each nested array). We’ll create the correct number of nested arrays by using a for loop with the calculated number of chunks. To find the cutoff points lets divide the length of the JSON Array by the number of files. Last, add the length to the end of the array of indices.# initialize 2D array
split_data = [[] for i in range(0,num_files)]# determine indices of cutoffs in array
starts = [math.floor(i * data_len/num_files) for i in range(0,num_files)]
starts.append(data_len)
Looping Through Each Chunk
We’re all set to slice up our array. For every chunk to create, we’ll loop through the JSON array starting at the current cutoff index and stopping at the next cutoff index. As we complete each chunk we’ll make a file and write the chunk accordingly. Print a section complete message then finally once the loop is done a message letting the user know the entire script has completed.# loop through 2D array
for i in range(0,num_files):
# loop through each range in array
for n in range(starts[i],starts[i+1]):
split_data[i].append(data[n])# create file when section is complete
name = os.path.basename(file_name).split('.')[0] + '_' + str(i+1) + '.json'
with open(name, 'w') as outfile:
json.dump(split_data[i], outfile)print('Part',str(i+1),'... completed')print('Success! Script Completed')
Hope you enjoyed the tutorial! Remember to check out the full script on GitHub: https://github.com/jhsu98/json-splitter. If you have any questions please let me know.