[prev in list] [next in list] [prev in thread] [next in thread] 

List:       python-list
Subject:    Re: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
From:       "hongy... () gmail ! com" <hongyi ! zhao () gmail ! com>
Date:       2021-09-30 3:53:44
Message-ID: aa18ac06-02d2-4b26-976a-0f38ab8b5c6cn () googlegroups ! com
[Download RAW message or body]

On Thursday, September 30, 2021 at 9:20:37 AM UTC+8, hongy...@gmail.com wrote:
> On Thursday, September 30, 2021 at 5:20:04 AM UTC+8, Peter J. Holzer wrote: 
> > On 2021-09-29 01:22:03 -0700, hongy...@gmail.com wrote: 
> > > I tried to convert a xls file into csv with the following command, but failed: 
> > > 
> > > $ in2csv --sheet 'Sheet1' 2021-2022-1.xls 
> > > XLRDError: Unsupported format, or corrupt file: Expected BOF record; found \
> > > b'\r\n\r\n\r\n\r\n'  
> > > The above testing file is located at here [1]. 
> > > 
> > > [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls 
> > Why is that file name .xls when it's obviously an HTML file?
> Good catch! Thank you for pointing this out. This file is automatically exported \
> from my university's teaching management system, and it was assigned the .xls \
> extension by default. 

According to the above comment, after I change the extension to html, the following \
python code will do the trick:


import sys
import pandas as pd

if len(sys.argv) != 2:
    print('Usage: ' + sys.argv[0] + ' input-file')
    exit(1)

myhtml_pd = pd.read_html(sys.argv[1])
#In [25]: len(myhtml_pd)
#Out[25]: 3

for i in myhtml_pd[2].index:
    if i > 0:
        for j in myhtml_pd[2].columns:
            if j >1 and not pd.isnull(myhtml_pd[2].loc[i][j]):
                print(myhtml_pd[2].loc[i][j])

HZ
-- 
https://mail.python.org/mailman/listinfo/python-list


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic