[prev in list] [next in list] [prev in thread] [next in thread] 

List:       coreutils
Subject:    Re: BUG in sort --numeric-sort --unique
From:       "Kaz Kylheku (Coreutils)" <962-396-1872 () kylheku ! com>
Date:       2020-02-13 23:32:35
Message-ID: 9d85bba401c2b15a5d18a846e01dbc9b () mail ! kylheku ! com
[Download RAW message or body]

On 2020-02-13 14:00, Stefano Pederzani wrote:
> In fact, separating the parameters:
> # cat controllareARCHIVIO_2020/02/controllare20200213.txt | sort -u |
> sort -n | wc -l
> 1262
> we workaround the bug.

My own experiment shows confirms things to be reasonable.

When -n and -u are combined, then uniqueness is based no numeric
equivalence. Since numeric equivalence is weaker, de-duplication
based on numeric equivalence can cull out more records than
de-duplication based on textual equivalence.

$ printf "0\n00\n000\n" | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -n
0
00
000
$ printf "0\n00\n000\n" | sort -nu
0
$ printf "0\n00\n000\n" | sort -n | sort -u
0
00
000
$ printf "0\n00\n000\n" | sort -u | sort -n
0
00
000

As you can see, sort -nu is not equivalent to any combination
of sort -n and sort -u.   sort -nu has de-duplicated a file of
different "spellings" of zero down to a single entry.

sort -u may not de-duplicate these entries because "0"
is textually different from "00".

> Every line is only something like "1.2.3.4".

Unfortunately, "sort -n" will probably not do what you think with
this data.

Please read sort's GNU Info documentation; the man page lacks
detail about what numeric sorting means.

Also, the POSIX standard's description of -n:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

In short, what -n does is recognize a *prefix* of each line as a number
according to a pattern that includes optional blanks, an optional sign,
digits, a radix character, and digit group separators.

-n does not deal with compound numeric identifiers like 1.2.3.4.

Basically 1.2.3.4 and 1.2.4.4 both look like the number 1.2.

$ sort -nu
1.2.3.4
1.2.4.4
1.2.5.6
[Ctrl-D][Enter]
1.2.3.4

Oops! This result is correct; under numeric sort (-n), all these lines
are considered to have the key 1.2.  And if we de-duplicatd based on=20
that,
they are all considered to be duplicates; they de-duplicate down to
a single line.






[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic