Handling non-standard CSV encoding and dialect
Most CSV data now is encoded using the standard Unicode formats that are used by default in Python. Occasionally however, you may come across a data file with an older or more obscure encoding format. In order to properly read and process data with a non-standard encoding, you will need to specify the encoding in the call to open()
function that creates the file object. The pandas.read_csv()
function also allows for the specification of non-standard encoding. I've made a link to the encoding formats accepted by Python in the Links and Further Reading document in the external resources.
There also may be variations in the delimiter, the character used to separate values, the newline character used to indicate the end of a line, and a few other formatting attributes. These variations are collectively referred to as the CSV dialect. Both the pandas.read_csv()
function and the csv.reader()
have parameters that allow you to specify variations in the...