UTF и str_word_count

lolshik · 7 Сен 2008

Доброе время суток.

Пишу что то типа граббера. Считаю слова таким кодом.

PHP:

$str = "лол 555 хай"; 
$c = str_word_count($str); 
echo $c;

Сохраняю php файл в кодировки UTF-8 (Unix)

С английскими словами - всё на ура. А с русскими - неправильно считает кол-во слов.

Как быть?

alexzh · 7 Сен 2008

идешь на пхп.нет и смотришь комменты
Для просмотра ссылки Войди или Зарегистрируйся
там кое что есть, думаю найдешь для себя... например

PHP:

<?php

/**
 * Returns the number of words in a string.
 * As far as I have tested, it is very accurate.
 * The string can have HTML in it,
 * but you should do something like this first:
 *
 *    $search = array(
 *      '@<script[^>]*?>.*?</script>@si',
 *      '@<style[^>]*?>.*?</style>@siU',
 *      '@<![\s\S]*?--[ \t\n\r]*>@'
 *    );
 *    $html = preg_replace($search, '', $html);
 *
 */

function word_count($html) {

  # strip all html tags
  $wc = strip_tags($html);

  # remove 'words' that don't consist of alphanumerical characters or punctuation
  $pattern = "#[^(\w|\d|\'|\"|\.|\!|\?|;|,|\\|\/|\-|:|\&|@)]+#";
  $wc = trim(preg_replace($pattern, " ", $wc));

  # remove one-letter 'words' that consist only of punctuation
  $wc = trim(preg_replace("#\s*[(\'|\"|\.|\!|\?|;|,|\\|\/|\-|:|\&|@)]\s*#", " ", $wc));

  # remove superfluous whitespace
  $wc = preg_replace("/\s\s+/", " ", $wc);

  # split string into an array of words
  $wc = explode(" ", $wc);

  # remove empty elements
  $wc = array_filter($wc);

  # return the number of words
  return count($wc);

}
?>

EugeneVC · 7 Сен 2008

есть несколько путей
1) взять популярный фреймворк типа ZendFramework или Kohana - там это все есть - нужные функции.
2) использовать функции с префиксом mb_, в твоем случае это

Код:

mb_strlen($name,'utf8')

PHP_Master · 7 Сен 2008

EugeneVC, не находишь разницы между функциями mb_strlen() и str_word_count()?

UTF и str_word_count

lolshik

Постоялец

alexzh

Гуру форума

EugeneVC

Гуру форума

PHP_Master

Хранитель порядка