Urves correspond to box counting result for these two words in the original and shuffled text. The area corresponds to CELL is bigger than the case of HYBRID. CELL is more important than HYBRID in the book The Origin of Species. doi:10.1371/journal.pone.0130617.gEvaluation of Our MethodThe best way to evaluate the efficiency of our approach to keyword detection is comparing its results with other methods. We use two metrics in this comparison: precision and recall. These tell us to what extent the retrieved list of keywords conforms to the manually selected list as described in the previous section. In this work, we would like to compare our method with two efficient methods in keyword extraction, the C Value [14] and Entropy [17]. These methods are selected according to our experience. We found that C Value has the maximum amount of recall compared with other methods and entropy has maximum amount of precision compared with others [18, 23] (these methods are reviewed in further detail in the appendix). To do the assessment we use the glossary written by W. S. Dallas [24]. Note that the choice of glossary has the potential to considerably alter the result of comparisons. Two points are relevant BAY1217389MedChemExpress BAY1217389 before proceeding to the comparison. First, the glossary of the book contains not journal.pone.0077579 only words, but also some phrases. To deal with multi-word keywords of the glossary we separate them into single words. For example we convert the phrase GANOID FISHES to two separate words GANOID and FISHES in the glossary. Second, in any method, a value is assigned to each vocabulary word, then we can sort the words from the highest value to the lowest. We give rank 1 to the first word in the sorted list, the second word takes rank 2 and so on. Unlike in Zipfian ranking, this ranking process allows for rank ties; in other words, if some words have the same assigned value, they should have the same rank. As an example, in Table 3 the words FORWARD and MONTHS have equal values. In this case we assign them equalPLOS ONE | DOI:10.1371/journal.pone.0130617 June 19,11 /The Fractal Patterns of Words in a TextTable 1. List of the twenty CrotalineMedChemExpress Monocrotaline top-ranked words according to degree of fractality (left) and the first twenty frequent words (right) from the book The Origin of Species. Words with high degree of fractality are important words according to subject of the book and common words have low degree of fractality. The string un which is placed in the second row of list of top-ranked words is a French determinant which appears four times in a single sentence. So, it is highly clustered and has high value of fractality. Because we do not perform 1.07839E+15 any pre-processing to eliminate foreign words, this word appears in the list. Words slaves un illegitimate saliva pedicellariae floated pupae wax vibracula masters avicularia dried movable segment caudicle neuters cuckoo lamellae dun bucket Frequency 34 4 21 5 15 18 13 42 12 17 13 9 10 5 6 12 32 20 8 7 Fractality 17.42 16.70 16.52 16.42 16.03 15.98 15.72 15.65 15.54 15.52 15.28 15.11 15.10 15.04 14.59 14.93 14.89 14.67 14.60 14.59 Words the of and in to a that as have be is species by which are or it on with for Frequency 13368 9071 5482 4973 4477 3143 2612 2122 2051 2045 1975 1745 1665 1646 1556 1489 1462 1432 1383 1381 Fractality 2.54 2.67 2.79 2.97 2.79 2.71 2.77 3.16 2.79 2.78 2.80 2.42 2.82 2.76 2.69 3.22 3.04 3.12 3.02 2.doi:10.1371/journal.pone.0130617.trank (2128) and the next word in the list will have rank 2130. There are two ap.Urves correspond to box counting result for these two words in the original and shuffled text. The area corresponds to CELL is bigger than the case of HYBRID. CELL is more important than HYBRID in the book The Origin of Species. doi:10.1371/journal.pone.0130617.gEvaluation of Our MethodThe best way to evaluate the efficiency of our approach to keyword detection is comparing its results with other methods. We use two metrics in this comparison: precision and recall. These tell us to what extent the retrieved list of keywords conforms to the manually selected list as described in the previous section. In this work, we would like to compare our method with two efficient methods in keyword extraction, the C Value [14] and Entropy [17]. These methods are selected according to our experience. We found that C Value has the maximum amount of recall compared with other methods and entropy has maximum amount of precision compared with others [18, 23] (these methods are reviewed in further detail in the appendix). To do the assessment we use the glossary written by W. S. Dallas [24]. Note that the choice of glossary has the potential to considerably alter the result of comparisons. Two points are relevant before proceeding to the comparison. First, the glossary of the book contains not journal.pone.0077579 only words, but also some phrases. To deal with multi-word keywords of the glossary we separate them into single words. For example we convert the phrase GANOID FISHES to two separate words GANOID and FISHES in the glossary. Second, in any method, a value is assigned to each vocabulary word, then we can sort the words from the highest value to the lowest. We give rank 1 to the first word in the sorted list, the second word takes rank 2 and so on. Unlike in Zipfian ranking, this ranking process allows for rank ties; in other words, if some words have the same assigned value, they should have the same rank. As an example, in Table 3 the words FORWARD and MONTHS have equal values. In this case we assign them equalPLOS ONE | DOI:10.1371/journal.pone.0130617 June 19,11 /The Fractal Patterns of Words in a TextTable 1. List of the twenty top-ranked words according to degree of fractality (left) and the first twenty frequent words (right) from the book The Origin of Species. Words with high degree of fractality are important words according to subject of the book and common words have low degree of fractality. The string un which is placed in the second row of list of top-ranked words is a French determinant which appears four times in a single sentence. So, it is highly clustered and has high value of fractality. Because we do not perform 1.07839E+15 any pre-processing to eliminate foreign words, this word appears in the list. Words slaves un illegitimate saliva pedicellariae floated pupae wax vibracula masters avicularia dried movable segment caudicle neuters cuckoo lamellae dun bucket Frequency 34 4 21 5 15 18 13 42 12 17 13 9 10 5 6 12 32 20 8 7 Fractality 17.42 16.70 16.52 16.42 16.03 15.98 15.72 15.65 15.54 15.52 15.28 15.11 15.10 15.04 14.59 14.93 14.89 14.67 14.60 14.59 Words the of and in to a that as have be is species by which are or it on with for Frequency 13368 9071 5482 4973 4477 3143 2612 2122 2051 2045 1975 1745 1665 1646 1556 1489 1462 1432 1383 1381 Fractality 2.54 2.67 2.79 2.97 2.79 2.71 2.77 3.16 2.79 2.78 2.80 2.42 2.82 2.76 2.69 3.22 3.04 3.12 3.02 2.doi:10.1371/journal.pone.0130617.trank (2128) and the next word in the list will have rank 2130. There are two ap.