--- title: Initial parsing work on large JSON corpus layout: post --- Yesterday, I wrote of a challenge that I faced in working out which texts in a corpus have decent OCR and, then, which texts they actually are. This morning, I put together a small script that has a first go at this. I enclose this below for anybody who is interested. The basic steps are: 1. Read input from CSV and build a list of titles and identifiers (stripping all punctuation from titles and limiting it to 5 words) 2. Sequentially read in the JSON files from disk, performing the same stripping transform above on the first ten elements in the dictionary. 3. See if the transformed title is in the transformed first ten elements of the JSON file. This currently yields me about 15 good titles out of every 1,000 JSON files. That said, these JSON files are not all in English. And many of them have bad OCR. In any case, I'll continue to refine this and expand the filters as safely as I can. # coding=UTF-8 import csv import re import glob import json def load_csv(): titles = {} # load the CSV file of titles with open('/home/martin/Mounts/THREETB/Corpus/book-list.csv', 'rb') as csvfile: csv_file = csv.reader(csvfile, delimiter=',', quotechar='"') # iterate over the CSV for row in csv_file: # extract a potential title and substitute out all punctuation titles[row[0]] = re.sub('[\.\?\(\)\]\[,;:\'!\*!“]', '', re.sub('\[.+?\]', '', row[7])).lower().replace(' ', ' ').strip() # remove any titles here that are either blank or less than three words long or less than 6 chars total if titles[row[0]] == '' or len(titles[row[0]].split(' ')) < 3 or len(titles[row[0]]) < 6: del titles[row[0]] if row[0] in titles: try: # shorten title to first five words titles[row[0]] = ' '.join(titles[row[0]].split(' ')[0:5]) except IndexError: # title was short pass return titles def parse_json(folder, titles): directory_to_parse = '/home/martin/Mounts/THREETB/Corpus/json/{0}'.format(folder) jsons = glob.glob('{0}/*.json'.format(directory_to_parse)) ret = {} file_counter = 0 for json_file in jsons: with open(json_file, 'rb') as json_file_handle: file_counter += 1 if file_counter == 1000: print "Processed 1,000 JSON files" file_counter = 0 loaded_json = json.load(json_file_handle) # check the first eight entries of the JSON try: for x in range(0, 10): subbed_text = re.sub('[\.\?\(\)\]\[,;:\'!\*!“]', '', re.sub('\[.+?\]', '', str(loaded_json[x]))).lower().replace(' ', ' ').strip() should_break = False for key, title in titles.iteritems(): if title in subbed_text: print '[{0}] {1}: {2}'.format(key, title, json_file) ret[key] = json_file should_break = True # remove the key to avoid duplicates del titles[key] break if should_break: break except IndexError: # if we arrive here it's a short JSON pass except: pass return ret if __name__ == '__main__': titles = load_csv() for folder_name in range(0, 25): parse_json(str(folder_name).zfill(4), titles)