Capturing > 99% of DXs in ICD 10 CM.

by Gilbert Keith

Here’s basic code that will traverse through the ICD 10 CM codeset (you’ll need the XML file from here.) This is only been tested with the D index.

Update the filepath and outputfile variable with the right filenames.

#attempt to import cElementTree before importing ElementTree
try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

# Method to write to file
# Traversing the dict structure coming out of parse_term(term)
# and writing it to the outputfile.
# See the parse_term method for details on the exception in the else
# statement.
# If you run this on the whole XML file for the D index, you should see
# the following output to stdout.
# * None F45.41
# * None A02.0
# * None A02.0
# * None T75.4
# * None D69.59

def write_to_file(f,dx,prevtitle):
    title = dx.keys()[0]
    string=""

    for val in dx[title]:
        if (type(val) == type(None)):
            string=prevtitle+"\\"+title + "\t"+"None"+"\n"
            f.write(string.encode('utf-8'))
            #pass
        elif(type(val) == dict):
            string=prevtitle+"\\"+title
            write_to_file(f,val,string)
        else :
            try: 
                string=prevtitle+"\\"+title + "\t"+str(val)+"\n"
            except TypeError:
                print title, val
            f.write(string.encode('utf-8'))

# Method to parse a file
# Assumptions:
# * Every element passed in here will have a...
# * Not every element passed in here will have a ...
# * Elements passed in here could have a...
# ** These are essentially more specific additions to the general term
# ** and there are several possible layers of hierarchy
# Because I am lazy and don't want to write code to accommodate every
# level of specificity, I've only written one method which is recursively
# called if more specific data is encountered. So far this works.
# The format of dict that is returned:
# * For all terms: {title: list_of_data}
# * list_of_data can have:
# ** a NoneType value. 
# *** i.e. a phrase which doesn't have a code assigned because it is too general
# ** a Str value.
# *** i.e. a code which corresponds to a title
# ** a dict.
# *** this is further divided into {title: list_of_data} (i.e. recursive)
# What needs work:
# * I currently ignore any tag that is not 'title', 'code', or 'term'
# ** I could be looking at the 'nemod' and 'see also' tags for synonyms
# ** and extra specifying information
# * Currently noticing a issure where a tag formatted thusly:
# **barvaluefoovalue
# ** doesn't get parsed well. The element ends up with a NoneType
# ** for its value. The XML validates ok on W3C, but something is obviously
# ** wrong.
# *** example:
# ***      
# ***        <nemod>(massive)</nemod> blood transfusion
# ***         D69.59
# ***      
# ** Right now, I am just catching the error when printing it out 
# ** (that's when I noticed it, python allows dict to have None for keys?)

def parse_term(term):
    ret_dict={}
    termtitle=term.findall('title')[0].text
    try:
        ret_dict[termtitle]=[term.findall('code')[0].text]        
    except IndexError:
        ret_dict[termtitle]=[None]
    for subterm in term.findall('term'):
        ret_dict[termtitle].append(parse_term(subterm))
    return ret_dict

filename='/Users/gk/Desktop/iHF/icd_10_xml/icd_10_cm_2014_D.xml'
icd_10_etree = ET.parse(filename) #parse into icd_10_etree documentwrapper
rootnode=icd_10_etree.getroot() #get the tree
workfile = 'ihftesting/workfile.txt' #output file
f = open(workfile, 'w') #open up the file
dx="" #initialize dx; we'll use this per tag
#################################
# The basic division of the file is:
#
#  
#    ...
#     ...
#     
#       ...
#     
#   
# 
##################################
# We parse out the mainTerm elements and pass them into parse_term()
for letter in rootnode.findall('letter'):
    for mainterm in letter.findall('mainTerm'):
        dx = parse_term(mainterm)
        write_to_file(f,dx,"")

f.close()

There are only 5 codes that I haven’t been able to parse because of the exception I mentioned. Not bad for only 2 days worth of work. I have the back slashes in there to figure out how exactly each line was constructed (so reverse lookup for verifying can be easy.)

This is kind of what the output looks like. You can find the entire output here.

\Aarskog's syndrome    Q87.1
\Abandonment    None
\Abasia    F44.4
\Abderhalden-Kaufmann-Lignac syndrome    E72.04
\Abdomen, abdominal    None
\Abdomen, abdominal\acute    R10.0
\Abdomen, abdominal\angina    K55.1
\Abdomen, abdominal\muscle deficiency syndrome    Q79.4
\Abdominalgia    None
\Abduction contracture, hip or other joint    None
\Aberrant    None
\Aberrant\adrenal gland    Q89.1
\Aberrant\artery    Q27.8
\Aberrant\artery\basilar NEC    Q28.1
\Aberrant\artery\cerebral    Q28.3
\Aberrant\artery\coronary    Q24.5
\Aberrant\artery\digestive system    Q27.8
\Aberrant\artery\eye    Q15.8
\Aberrant\artery\lower limb    Q27.8
\Aberrant\artery\precerebral    Q28.1
\Aberrant\artery\pulmonary    Q25.79
\Aberrant\artery\renal    Q27.2
\Aberrant\artery\retina    Q14.1
\Aberrant\artery\specified site NEC    Q27.8
\Aberrant\artery\subclavian    Q27.8
\Aberrant\artery\upper limb    Q27.8
\Aberrant\artery\vertebral    Q28.1
\Aberrant\breast    Q83.8
\Aberrant\endocrine gland NEC    Q89.2
\Aberrant\hepatic duct    Q44.5
\Aberrant\pancreas    Q45.3
\Aberrant\parathyroid gland    Q89.2
\Aberrant\pituitary gland    Q89.2
\Aberrant\sebaceous glands, mucous membrane, mouth, congenital    Q38.6
\Aberrant\spleen    Q89.09
\Aberrant\subclavian artery    Q27.8
\Aberrant\thymus    Q89.2
\Aberrant\thyroid gland    Q89.2
\Aberrant\vein    Q27.8
\Aberrant\vein\cerebral    Q28.3
\Aberrant\vein\digestive system    Q27.8
\Aberrant\vein\lower limb    Q27.8
\Aberrant\vein\precerebral    Q28.1
\Aberrant\vein\specified site NEC    Q27.8
\Aberrant\vein\upper limb    Q27.8
\Aberration    None

Here is the entire output.

Advertisements